Attention
In this Blog we will going to study how attention mechanism works.
Attention is mechanism used in sequence to sequence models which allows the model to focus on specific parts of the input sequence when producing each part of the output sequence.
This is done by assigning weights or alignment scores to the hidden state of every time-step of encoder.
Lets take the example of machine translation to understand the attention mechanism.

In the above example, translated word “बत्ती” does not dependent on all the input words from the encoder, rather it depends on the word “light”. Similarly, for the translated word “बंद”, “turn off” is the most relevant word.
Therefore for a particular translated word, the relevance of each input token is not equally important, only 2–3 words are actually contributing for generating a particular output.
Previously with normal LSTM based encoder decoder architecture, at any time stamp t we need to two input S(t-1) and Y(t-1) where S(t-1) is previous hidden state of decoder and Y(t-1) is the input to the decoder through teacher forcing.
With attention based encoder-decoder architecture, we provide 3 input.
S(t-1), Y(t-1) and C(t) where C(t) is the attention input.

This attention state, Ci, enables the decoder to assess the relevance of each encoder hidden state (h1, h2, h3 and h4) in predicting the output at a specific time step. For example, at time step t2 in the decoder, generating y2 depends on y1, s1 and c2. To compute c2, we calculate the weighted sum of all the encoder’s hidden states. This weighted sum indicates the contribution of each encoder hidden state in predicting the decoder’s output at time step t2




where h1, h2, h3 and h4 are the hidden state of encoder and i is time stamp of decoder and j is time stamp of encoder.
This shows that in attention mechanism we send context vector at every time stamp of the decoder. At every time stamp of decoder we send the intermediate hidden state of the encoder.
To calculate the alignment scores or attention weights with respect to all hidden states, we integrate a feed-forward neural network into the decoder architecture. During training, this neural network learns the optimal parameters, provided enough data is given.

To calculate alignment scores we have two type of attention mechanism: -
1. Bahdanau Attention (also known as additive attention)
2. Luong Attention (also known as multiplicative attention)
Bahdanau Attention
Bahdanau attention uses an additive approach to calculate the alignment scores. It employs a feed-forward neural network with a single hidden layer.

tanh is the activation function of the hidden layer, W and b are the weights and biases of hidden layers and V is the weight of ouput layer.

Attention input C1 has same dimension that hidden state has.
Luong Attention
Luong attention uses a multiplicative approach to calculate the alignment scores. It perform dot product of hj and Si to find the attention weights. If two vectors are similar then their dot product will be high and hence their attention weights will be more and if two vectors are not similar then their attention weights will be small.

In Loung Attention to calculate how relevant encoder hidden state is at each time step, the alignment model perform dot product with all the hidden state of encoder(h1, h2, h3 and h4) and current context vector of decoder(St). The current time stamp context vector of decoder gives an updated state of the decoder, and thus make the training process more faster.
In summary:
Attention mechanism is a neural network which dynamically focus on specific parts of the input sequence when generating each element of the output sequence.
- Bahdanau Attention: Additive, more flexible and complex, uses a neural network to calculate alignment scores.
- Luong Attention: Multiplicative, simpler and more efficient, uses dot product or other variations to calculate alignment scores.
Both attention mechanisms aim to improve the performance of sequence-to-sequence models by allowing the decoder to focus on relevant parts of the input sequence dynamically, but they do so using different methodologies