Self Attention

6 min readAug 6, 2024

In this article, we will learn the ins and outs of how the self-attention mechanism works inside the Transformer.

Self attention also known as intra-attention, is a mechanism that allows each element in a sequence to focus on other elements in the same sequence to gather contextual information.

1. Input Representation:

Each word/token/element in the input sentence/sequence is represented as word embedding vector.
In original research paper each token is represented as 512 dimension vector.

2. Query, Key and Value vectors:

With each input word element, 3 different vectors are computed using learned weights matrices Q, K and V vectors.
These vectors are typically the same dimension as the input vector but transformed via linear transformation/projections.

3. Compute the Attention Scores:

The attention score is computed by taking dot product of the query vector of the current element with the key vector of all the elements including the current element itself.
This scores indicate how much focus or relevant the current element should place on the other elements.

4. Scale the Scores:

To ensure stability and more effective gradient flow, the dot product scores or attention scores are scaled by square root of the dimension of the key vectors.
This is done because each word is represented by 512 dimension vector. Since the dimensionality of each word is high, variance or the spread is also high. With high variance vanishing gradient problem occurs due to the characteristic of soft max function.
By decreasing the variance we can prevent vanishing gradient problem.

5. Apply Soft max and Compute weighted sum

Each element’s final representation is computed as a weighted sum of all value vectors.

Explanation of each points

Input Representation

In NLP application, the most important thing is the conversion of words into meaningful numbers.

One of the most useful technique is word2vec. However, the problem with word2vec is that it is static in nature i.e. it only capture “semantic meaning” or the “average meaning” of the word. It does not capture the contextual meaning of the word.

For example, in the phrases ‘Money Bank’ and ‘River Bank’ the word ‘bank’ is used in two different contexts. However, using the word2vec word embedding technique, this word will be represented the same way in both contexts.

To capture this contextual meaning we use self attention mechanism or attention indeed. It is a mechanism which takes static word embedding as input and generate contextual (dynamic) embedding.

Inside self attention block

Self attention mechanism represents each word w.r.t the surrounding words by dividing each word into 3 forms (i) Key, (ii) Query and (iii) Value. It uses dot product to find the similarity between two words. If two vectors are similar their dot product will be high.

To find these three vectors, we perform linear transformation with 3 different matrices whose weights are learned during the training process.

Query, Key and Value vector of any word comes from the static word embedding of that particular word by performing linear transformation.

We will start with the random values of all the matrices (Wq, Wk, Wv) and during the training process, these weights get updated with the help of data.

Geometric Intuition of Self Attention

Consider the following phrases with all the values as hypothetical

“Money Bank” and “River Bank”

Why do we scale self attention score?

The soft max function convert set of numbers into probability such that sum of probability is 1.

If you input large numbers into the softmax function, it assigns higher probabilities to those numbers, whereas small input values result in lower probabilities. This disparity can lead to the vanishing gradient problem during backpropagation, where smaller values contribute minimally to the gradient, thereby receiving less weight in the learning process. Consequently, the soft max function tends to amplify the influence of larger numbers, potentially skewing the model’s learning dynamics.

In a way, self attention works as gravity. It pulls words such that it capture the context of the word in which it is used in the sentence rather than being static as in the case of word2vec embedding. In other words, self attention is context aware.

Why is self-attention called self?

Self attention is called “self” because it allows each element within a sentence to focus on other words within the same sequence. This mechanism is essential for capturing dependencies and contextual information of each word w.r.t other words in the given sentence.

So internally it is calculating relevance of each word in the same (self) sentence by assigning attention scores.

For example: The cat sat on the mat

The word “cat” will consider the context of all other words including itself in the above sentence.

Attention mechanism calculate alignment/attention scores between two different sequence (like english sentence and translated sentence) whereas self attention calculate alignment/attention scores or similarity score within the same sequence.

In summary self attention is the heart of the transformer that computes a weighted representation of words based on their relevance to each other, enhancing the model’s ability to understand context and relationships within the input sequence.

Multi-Head Attention

Self-Attention captures only one meaning of a sentence. However, when a sentence has two meanings this mechanism unable to capture both the meaning.

For example:
i) The animal didn’t cross the street because it was too tired.
ii) The animal didn’t cross the street because it was too crowded.
In these sentences “it” refer to two different thing. In the first sentence “it” refer to animal and in the second sentence “it” refer to the street.

These two meanings of “it” does not capture by a single self attention layer. However, we use multiple self attention layer to capture different meanings of same sequence. This multiple self attention layer is called Multi-Head attention.

Multi-head attention extends the self-attention mechanism by running multiple self-attention processes in parallel. Each of these parallel processes is called a “head”. Here’s how multi-head attention works:

Multiple Sets of Q, K, V: Instead of having a single set of Q, K, and V vectors, multi-head attention computes several sets (heads) of Q, K, and V vectors. This is done by applying different learned linear transformations i.e. multiple Wq, Wk and Wv to the input embeddings.
Parallel Attention Mechanisms: Each head performs the self-attention process independently, producing different attention scores and weighted sums.

Concatenation: The outputs of all the heads are concatenated together.
Final Linear Transformation: The concatenated output is passed through a final linear transformation to produce the final output.

Multi-Head Attention with two input words

In other words, while self-attention provides a way for a model to focus on different parts of a sequence when generating a representation for each word, multi-head attention enhances this capability by allowing multiple, independent attention mechanisms to work in parallel, resulting in richer and more nuanced representations.