Paper Explained: A Structured Self-Attentive Sentence Embedding

In this post I will discuss about A Structured Self-Attentive Sentence Embedding (published at ICLR in 2017) which is an interesting paper that introduces a new sentence representation different from the conventional sentence embedding vector. To understand this post and the paper, the readers are required to have a basic understanding of latent space representation of words (word embeddings), recurrent architectures like LSTMs and BiLSTMs and non linear functions like softmax and tanh. I highly recommend the readers to read the original paper after reading this post to learn about the proposed model in more detail.

Introduction

This paper proposes a new model for extracting a sentence embedding by using self-attention. Instead of using a traditional 1-D vector, the authors propose to use a 2-D matrix to represent the sentence, with each row of the matrix attending on a different part of the sentence. They also propose a self-attention mechanism and a special regularization term for the model. The embedding that is generated is interpretable and can be easily visualized to understand what parts of the sentence are encoded into the embedding.

What is Attention?

In simple terms attention is a mechanism wherein a neural network pays "attention" or focus to specific parts of the input text sequence given some more contextual information. It is just like how we humans pay attention to certain words or ideas while looking up answers to questions in a text document,

In neural network models an attention mechanism is usually used on top of an LSTM or CNN layer to guide the extraction of a sentence embedding by providing some more information or context. Consider memory networks based question answering systems. In such systems the query or question vector provide the necessary extra information to attend to the available memory vectors to identify the most appropriate memory vector that answers our question.

However, for some other tasks like sentiment classification or sentiment analysis, attention is not directly applicable as there is no such extra information since the model is only given one single sentence as input. In such cases the most common way is to use a recurrent architecture and add a max pooling or averaging step across all time steps or just pick up the hidden representation at the last time step as the encoded embedding.

Propositions of the paper

Carrying semantics along all the time steps of a recurrent model is relatively hard and unnecessary.
The proposed self-attention mechanism allows extracting different aspects of the sentence into multiple vector representations.
Attention is performed on top of an LSTM layer in the proposed sentence embedding model. This enables attention to be used in those cases when there are no extra inputs. In addition, due to its direct access to hidden representations from previous time steps, it relieves some long-term memorization burden from the LSTMs.

Approach

The proposed sentence embedding model consists of two parts -

The first part is a bidirectional LSTM
The second part is the self-attention mechanism

The attention mechanism provides a set of weight vectors for the LSTM hidden states. These weights are multiplied with the LSTM hidden states and the resulting weighted LSTM hidden states are summed to generate an embedding for the sentence.

The Model

Source: A Structured Self-Attentive Sentence Embedding paper

Part 1: BiLSTM

Consider a sentence with n words. Each word is represented with a word vector of dimension d. So the entire sentence S can be visualised as a 2-D matrix of size n x d where the t^th row represents the word embedding w_t, of the t^th word in the sequence. More formally,

S = (w₁, w₂, w₃, ........ w_n)

Word vectors in themselves are good representations of the word in latent space but not of the sentence as they don't capture any semantics or context. So to gain some dependency between adjacent words within a single sentence, we use a bidirectional LSTM to process the sentence. The BiLSTM's take d dimensional word vectors w_t and u dimensional hidden states of the previous time step h_t-1 as inputs and give a u dimensional hidden context vector h_t as output. More formally,

h_{t_f} = LSTM_f (w_t, h_{t-1_f})
h_{t_b} = LSTM_b (w_t, h_{t-1_b})

where h_{t_f} denotes the hidden state of the forward LSTM for time step t and h_{t_b} denotes the hidden state of the backward LSTM for time step t. Now we concatenate the two vectors h_{t_f} and h_{t_b} to obtain a new hidden state vector h_t which would be of size 2u. Hence for all n time steps we would get a vector H of size n x 2u,

H = (h₁, h₂, h₃, ........ h_n)

Part 2: Self-Attention Mechanism

Since we have a variable length input text sequence and we would like to encode it into a fixed size embedding we will perform a linear combination of the n LSTM hidden vectors in H. Computing the linear combination requires the self-attention mechanism. To compute the attention weights we take all the LSTM hidden states in H as inputs and output a vector of attention weights a, as follows

a = softmax(w_s2tanh(W_s1H^T))

where, W_s1 is a weight matrix of size d_a x 2u, w_s2 is a single vector of size d_a and d_a is an adjustable hyperparameter. H^T will be a 2u x n matrix which on multiplication with W_s1 will give us a d_a x n sized matrix. Then the tanh activation function is applied for each value in the matrix. Finally, multiplication with w_s2 will give us an n dimensional vector over which we apply a softmax to get a probability distribution over n values. These values sum up to 1 and will act as the attention weights.

Now that we have obtained the attention weights, for each value i from 1 to n, we multiply a_i with each value in the vector h_i to obtain weighted or attended context vectors. Then we sum up of all these weighted vectors to get a 2u dimensional embedding denoted by m. It should be noted that this vector representation focuses only on a specific aspect or component of the sentence, like a special set of related words or phrases. However, since there can be multiple components in a sentence that together form the overall semantics of the whole sentence, especially for long sentences, we would need multiple m’s that each focus on different parts of the sentence. Thus we need to perform the attention multiple times.

If we want to extract r different aspects of the sentence, we could extend the w_s2 into a r x d_a matrix, which would be denoted as W_s2 and the resulting annotation vector a would become annotation matrix A. Formally,

A = softmax(W_s2tanh(W_s1H^T))

The embedding vector m then becomes an r x 2u embedding matrix M. To compute the r weighted sums we multiply the annotation matrix A and LSTM hidden states H. The resulting matrix is the sentence embedding M.

M = AH

Penalisation Term

This is one of the most important and interesting parts of the paper. We just discussed that to extract a variety of aspects about the sentence we perform multiple hops of attention to generate r different attention weight vectors which constitutes our attention matrix A. However, it should be noted that our attention mechanism could face the issue of redundancy. We are trying to look at specific aspects of the sentence but if one or two aspects are more dominant than the others then our attention mechanism will focus only on them and end up generating a matrix with similar rows of attention weights. In order to avoid this problem we would need a penalisation term that forces diversity in the attention vectors, thereby generating different summation embeddings m_i in M which capture different aspects of the sentence.

Well, so all we have to do is introduce some metric that helps us maximise the divergence between the different attention weight vectors. One of the best contrastive divergence functions is the Kullback Leibler Divergence or KL Divergence (in case you are new to this term, I suggest that you read about it). The authors, however, conjecture that while trying to maximise the KL Divergence between all pairs of attention weight vectors, we would end up with a lot of very small or close to zero values in the attention matrix A. This would make the training unstable. KL Divergence also doesn't encourage focusing on different aspects of the same sentence. So for these two reasons KL Divergence is not applicable as a good penalty term.

So they introduce a new penalisation term as follows:

P = ||(AA^T - I)||_F²

where ||•||_F stands for the Frobenius norm of a matrix. This penalization term P will be multiplied by a coefficient, and we minimize it together with the original loss.

Lets try to understand whats going on here. Consider A and A^T where,

A = [a₁, a₂, ............ a_r]

Since they denote attention weights as probabilities, the magnitude of any attention vector a_i will be 1. Now if we multiply A and A^T we would get a matrix say X as,

[a₁₁, a₁₂, ............ a_1r]
[a₂₁, a₂₂, ............ a_2r]
.
.
.
[a_r1, a_r2, ............ a_rr]

where a_ij represents a_i • a_j for all i and j in 1 to r

Since magnitude of all a_i and a_j is 1, a_ij is essentially the cosine similarity score between the two vectors. Note that when i = j, a_ii will always be 1 hence the diagonal elements of the matrix X will be 1. For the other cases when i != j, a_ij will also be less than or equal to one, as the cosine score between two non negative vectors where each value is less than 1 always lies between 0 and 1. So (X - I) (where I is the identity matrix of the same dimension as X), would give us a matrix with all diagonal elements as zeros and the other values lie somewhere between 0 and 1. Now if we try to minimize the squared sum of these values we would actually be pushing each of the values a_ij as close to 0 as possible. Hence we are trying to minimize the similarity score between a_i and a_j thus increasing the divergence between the attention weights. This will ensure that the weighted summation vector matrix M that we use as our sentence embedding, has different row vectors representing different aspects of the sentence.

Conclusion

The paper also discusses about how to visualise the sentence embedding and about some experiments that the authors carried out. It is advised to read through that part of the paper. In case you have trouble figuring that out, comment about it below and I'll update the post with that part soon.

Computer Science and Programming

Search This Blog