Hi there I am trying a new type of article – reading a paper (RAP for short). In this article, I will read a paper and ask myself some questions about the paper, such as the general ideas and implementations, and finally I will try to make a conclusions and give some of my opinions.

The new category of these articles is called Biscuits, food for my teatime. There’s no clear schedule and I will write RAPs sometimes.

Title: Rationalizing Neural Predictions

From: EMNLP 2016

PDF: http://aclweb.org/anthology/D/D16/D16-1011.pdf

Q: What does the paper do?

A1: The paper proposed an approach to learn to extract rationales from input text. So, a rationale is like a piece of text that can justify the decisions (signals) in specific tasks, e.g., classes in Sentiment Classification problem.

A2: In addition, the function learned to extract rationales does not access any additional rationale annotations, which means the approach only uses the original task data (Multi-Aspect Sentiment Analysis dataset & AskUbuntu data in the paper).

A3: The extraction looks like this (Figure 1). Different aspects of appearance, smell and palate are shown in red, blue and green respectively. The extracted text seems to explain decisions of different aspects.

Q: What is the solution proposed?

A1: It assumes that a subset of the text can perform equivalently or closely  to the full set. Imagine the scene where you find most of the words useless in an ad and only the price matters. What if there’s a robot that can help you highlight all the prices so that you would not be confused by rhetoric in ads.

A2: The paper defines two functions, $enc(\mathbf{x})$ and $gen(\mathbf{x})$. Take Sentiment Analysis as an example. In the ordinary neural network structure, the $enc(\mathbf{x})$ function will map the full text $\mathbf{x}$ to a prediction. And the function $gen(\mathbf{x})$ will extract a subset of $\mathbf{x}$, and such a subset $gen(\mathbf{x})$ can be the input of $enc(\mathbf{x})$, making it $enc(gen(\mathbf{x}))$.

A3: So the paper want the following equation to be true: $$enc(gen(\mathbf{x}))=enc(\mathbf{x})$$ which means the extracted subset should generate the same prediction as the full text under the function $enc(\mathbf{x})$.

A4: At the same time, since signals from Sentiment Analysis will directly optimize the parameters of $enc(\mathbf{x})$, through an MSE loss, signals from the tasks can apply on $gen(\mathbf{x})$ indirectly: $$\mathscr{L}(\mathbf{x},\mathbf{y})=||enc(gen(\mathbf{x}))-\mathbf{y}||^2_2$$ where $\mathbf{y}$ is the task supervision signals.

A5: For a consideration of the “shape” of the rationales, the paper also apply several regularizers to the loss, which restricts the number and guides the coherency of the extracted text.

A6: So under a neural network framework, this $gen(\mathbf{x})$ function can be view as a Sequence Labeling problem using Recurrent-Like structures, an extractive Text Compression problem, to be more specific. The paper employs RCNN.

A7: So it’s almost done. While the approach is taking all possible generated rationales into account, the loss will become  somewhat like this:

and the gradient becomes:

Since it’s annoying to calculate the derivatives of the expectation of all sequences, the gradient is transformed into:

where the expectation of the derivatives can be calculated instead, which means a sampling can be employed to save you of lots of troubles.

Q: What is the performance of the proposed approach?

A1: On the Multi-Aspect Sentiment Analysis task, a subset of words is generated by the function $gen(\mathbf{x})$, and $\mathbf{precision}$ is evaluated based on whether the words in the subset are in the sentences describing the target aspect in the sentence-level annotation.

A2: It seems that the approach is doing quite well in the aspects listed above. But we can still observe some fluctuations in some aspect(s).

Q: What are the pros and cons, in my opinion?

P1: The idea, deduction and implementation are concise under an end-to-end framework.

P2: The design of the $gen(\mathbf{x})$ function and the related assumption is simple but powerful, especially without any direct rationale annotation which seems the biggest strengths of the paper.

P3: The functions are actually general, which means that they can be realized under other frameworks, like CRF.

P4: The examples included in the paper, they do make sense!

C1: The dataset in the experiment is de-correlated, which confuses me. What is the point removing those samples with highly correlated aspects? The overall review in Multi-Aspect Sentiment Analysis can be more or less quite related to some aspects.

C2: The paper only shows the precisions of 3 aspects. I would quite like to see how it performs on aspects highly correlated, such as overall and taste.

C3: The evaluation part seems like a Information Retrieval setting/design where $\mathbf{precision}$ is calculated. So what about the $\mathbf{recall}$ and $\mathbf{f1}$ since it’s about searching words in sentences of annotated aspects?

Q: How is the paper in general?

A1: I think the most impressive points in the paper is that it generates rationales from simply the task data, without any extra rationale annotation/supervision. The assumption and design is attractive and it turns out working quite well.

A2: The generation of rationales can be view as a specialized Text Compression problem with an extractive approach. The difference is that, in a pure Text Compression problem the golden standard is for a more general purpose/use, but in this paper the golden word set is for the purpose of Sentiment Analysis, which makes this work more flexible.

A3: Since there are still discussions on the automatic evaluation metrics of Text/Sentence Compression task, especially the extractive one, a manual/human evaluation will make it more reasonable to some degree.

A4: For future work, I guess this approach can be adapted into a abstractive one (compared to extractive), where unseen words can be generated to replace the highlighted words to provide a more concise meaning representation.

1. veinpy

Thanks for your review for this paper, and I also have read it as well as the code.

I found that in the slides of this paper, the author regards the learning method as a type of reinforce learning. Could you give us an intuitive explanation about this part?

• joseph

I add a picture from the slide to the end of the article. For an initial understanding of Reinforcement Learning, you may refer to an introduction from Sutton (cliek here). As indicated in the paper, the network learns the generator function gen(x), which gives different probabilistic estimations of labeled sequences (i.e., P(z|x)), which behaves exactly like the policy function (in RL) that also produces actions to be taken. The loss(reward) from the prediction can be back-propagated to tune the generator(policy), which is just like one of the RL learning methods, the policy gradient methods(click here). So the author said this is a REINFORCE-Style method.

• joseph

You may also find Andrej’s article useful (here). It’s just like teaching a boy how to correctly be polite and say hello to his grandpa. At first, your kid would select one of the following:
A) Hello, grandpa! (decent)
B) Hello, Tom! (not good)
C) Hello, old Tom! (impolite)
If he says A), he gets rewarded (a candy or so), otherwise nothing. Then gradually he would learn to prefer saying A) and avoid B) and C), which is a policy network that helps him select from all the candidates. The policy network corresponding to gen(x) in the paper is also something that you are to train where it generates candidate labeled sequences instead of A)B)C).