How can one effectively explain the transformer architecture during an interview?
Transformer has revolutionised the field of NLP due to its remarkable ability to capture long term dependencies in data & its parallelizable nature.
There are 2 parts in a transformer
- Encoder : which extracts features from the input sentence.
- Decoder : which uses these features to produce an output sentence.

As the input sequence passes through the input embedding layer, each token or word is transformed into a fixed-length vector representation. In the original research paper, each token is represented as a 512-dimensional vector. This is usually done with the help of word2vec embedding.
The order of words is crucial in sequence-to-sequence models. To preserve positional information, a positional encoding is applied, allowing each token to retain its position within the sequence in addition to its vector representation.
Now, this vector representation is send to the Self-Attention layer which is the heart of transformer architecture.
What is the primary goal of self-attention layer?
The primary purpose of the self-attention layer is to generate dynamic word embeddings that adapt to context. Let me explain with an example:
- ‘Apple generates $3B in revenue last year.’
- ‘Apple is a healthy fruit.’
The word ‘Apple’ refers to a company in the first sentence and a fruit in the second. With static embeddings like Word2Vec, ‘Apple’ would have the same representation in both cases. Self-attention, however, creates contextual embeddings, allowing the model to distinguish between the different meanings of ‘Apple’ based on its surrounding words.
How does self-attention produces contextual embedding?
Self-attention produces contextual embeddings by allowing each token in a sequence to attend all other tokens in the sequence, including itself. This process lets the model capture relationships between words in different contexts.
In order to determine the weights dynamically, the self-attention mechanism computes 3 vectors for each input word which are Key, Query, and Value.
The attention score, or weight, is calculated by taking the dot product of the query vector of the current word with the key vectors of all words in the sequence, including itself. This score determines how much importance or relevance the current word should assign to each of the other words.
Once the attention scores are calculated, these dot products are scaled by square root of the dimension of key vectors to ensure stability and more effective gradient flow.

Why we scale the scores?
Since each word is represented by a high-dimensional vector (e.g., 512 dimensions), the raw attention scores can have a wide range or varaince or spread. High variance can cause problems during training because after that we apply softmax function, which turns the scores into probabilities, very large scores dominate, and very small ones become almost irrelevant.
The softmax function gives high probabilities to larger numbers and very low probabilities to smaller ones. This can cause a vanishing gradient problem during back propagation, where smaller values don’t contribute much to the gradient and get less attention during learning. To reduce this effect, we scale the scores to keep them more balanced.
These scaled probabilistic weights are then multiplied by the value vectors of each word. By attending to other tokens in the sequence, each token’s embedding is influenced by the words around it. This enables the model to capture different meanings of the same word depending on its context, producing a dynamic, context-sensitive representation rather than a static one.
Example:
“Apple generates $3B in revenue last year.”
“Apple is a healthy fruit.”
The word “Apple” will have a different embedding in each case because, in the first sentence, the attention mechanism would focus on words like “revenue” and “generates,” suggesting a company context. In the second sentence, it would focus on “fruit” and “healthy,” indicating a fruit context.
Multi-head Attention:
Self-attention captures a single interpretation of a sentence. However, when a sentence has multiple meanings, a single self-attention layer may struggle to capture both. For instance:
- ‘The animal didn’t cross the street because it was too tired.’
- ‘The animal didn’t cross the street because it was too crowded.’
In the first sentence, ‘it’ refers to the animal, while in the second, ‘it’ refers to the street. A single self-attention layer may not distinguish these two meanings. To address this, we use multiple self-attention layers, each focusing on different aspects of the sequence. This technique is known as multi-head attention and helps capture multiple interpretations within the same sentence.
What is the difference between self attention and cross attention?

What is masked multi-head attention?
A masked attention layer is a variation of the standard attention mechanism used in the Transformer architecture, specifically designed to handle autoregressive tasks, such as language generation. The key difference between masked attention and normal attention lies in their handling of future tokens in the sequence.
Normal Attention Layer:
- Function: In a regular attention layer, every token in the input sequence attends to all other tokens, including tokens both before and after itself. This is useful for tasks where the full context is available, such as sequence-to-sequence tasks (e.g., translation) where the entire input is known.
- Use Case: Commonly used in the encoder of the Transformer architecture, where each token has access to the entire input sequence to learn relationships between all tokens.
2. Masked Attention Layer:
- Function: In a masked attention layer, each token can only attend to itself and the tokens that come before it, effectively “masking” any information about future tokens. This ensures that the model doesn’t “cheat” by looking ahead when predicting the next token in tasks like language generation, where predictions are made one step at a time.
- Masking Mechanism: The attention mask is usually a binary matrix applied to the attention scores, where future token positions are set to negative infinity (or a very large negative number) before applying the soft max function. This forces the attention mechanism to ignore those future tokens.
- Use Case: Commonly used in the decoder of the Transformer architecture for autoregressive tasks, like text generation, where the model generates tokens one by one, without knowledge of future tokens.
Key Differences:
- Normal attention: Each token attends to all other tokens in the sequence (full context).
- Masked attention: Each token only attends to itself and tokens before it, preventing access to future context.
What is layer normalisation and how is the difference between batch normalisation?
Layer normalisation is a technique that helps stabilise and accelerate training process. Unlike batch normalisation where it normalise the values across the batch or the column, layer normalisation normalise the values across the row since the batch size in a sequence to sequence problem is not fixed. This normalisation gaurantee that it maintain stable gradients & enhance training efficiency.