Positional Encoding and Layer Normalisation

Sambhav Mehta
5 min readAug 9, 2024

--

In this is article we will go through two of the important components of transformers architecture, positional encoding and layer normlisation.

Unlike RNNs and LSTMs, which are sequential algorithms and so naturally position-aware, self-attention mechanisms process each word in a sequence simultaneously and independently. Positional encoding is necessary to provide the concept of sequence order because self-attention lacks an inherent sense of order. By embedding information about the position of each word in the sequence, positional encoding gives the context that is required for the model to accurately represent the sentence structure.

To assign positions to each word in a sequence, we need an encoding that satisfies key requirements:

  • Continuous Embedding: The positional encoding should use continuous values, as neural networks perform better with continuous inputs rather than discrete numbers like 1, 2, or 3.
  • Bounded Values: The encoding should produce values within a specific range to prevent them from becoming excessively large. Unbounded positional values can lead to vanishing gradients, causing unstable training and hindering the model’s learning process.
  • Periodic Functions: The encoding should incorporate periodic functions that capture both absolute and relative positions, enabling the model to understand the order of words in a sequence effectively.

One possible solution which satisfied all the above criteria is trigonometric functions like sine and cosine functions.

Positional Encoding

The issue with using a single periodic function for positional encoding is that the values will eventually repeat after a certain position, losing the uniqueness needed for distinguishing different positions. To overcome this, multiple sine and cosine functions with varying frequencies are used.

Positional Encoding Formula

The dimensionality of the positional encoding vector matches that of the word embedding vector, ensuring compatibility. Before passing the input to the self-attention layer, we combine the word embedding vector with its corresponding positional encoding. This combined vector encapsulates both semantic information from the word embedding and positional information, allowing the model to understand the word’s meaning and its position in the sequence.

We perform element-wise addition of the word embedding and positional encoding vectors to maintain the same dimensionality in the resulting vector. This ensures that the training time remains consistent. In contrast, using concatenation would double the dimensionality of the vector, leading to a significant increase in training time and computational resources. By keeping the dimensionality unchanged, element-wise addition allows the model to efficiently incorporate positional information without introducing additional computational overhead.

Positional Encoding is very similar to binary encoding expect we are doing it in continuous value domain rather than taking discrete number/value like 0,1,2,3.

In original paper the each word embedding and positional encoding vector is of 512 dimension.

Layer Normalisation

Before understanding layer normalisation, let’s understand how batch normalisation works?

Batch normalisation is a technique which is use to normalise across the batch dimension or across the column and is highly effective in ANN and CNN where batch size is fix.

It is apply to input layer and the output of hidden layer.

What is internal co-variate shift?

It refers to the phenomenon where the distribution of each layer’s input in a deep neural network changes during training as the parameters of the previous layer change.

This shift occurs because as the weights in a network are updated during the back-propagation, the output of each layer (which serve as the input to the next layer) can change in distribution, leading to instability in training.

How batch normalisation addresses it?

Batch normalisation normalises the output of the layer so that mean = 0 and variance = 1 across the batch. This reduces the internal covariate shift by ensuring that the input distribution to each layer remains more stable during training.

By reducing internal covariate shift, it allows the model to use higher learning rates, which can lead to faster training and potentially better performance.

For each feature in the input, it compute the mean and variance across the batch.

Layer normalisation normalises the values across the features dimension for each individual input, making it more suitable for RNN or sequeunce data where batch size varies.

Let understand layer normalisation using a diagram:

When using batch normalization on sequences with padding, the calculated mean and variance can be skewed because padding values distort the true representation of the data. Padding is added to make input sequences equal in length, but it introduces artificial values that affect the accuracy of the mean and variance. To address this issue, layer normalization is used instead. Layer normalisation normalizes across the features (or rows) within each individual sequence, rather than across the batch (or columns), ensuring that padding doesn’t interfere with the normalisation process and providing a more accurate representation of the data.

Conclusion:

Positional Encoding

The technique of positional encoding is employed in transformer models to provide the model with information regarding the word order in a sequence. Because self-attention does not naturally follow a sequence, positional encodings are appended to the word embeddings. These encodings enable the model to comprehend the sequence structure by using sine and cosine functions of various frequencies to generate distinct vectors that indicate the positions of each word.

Layer Normalisation

In neural networks, layer normalization is a technique that helps stabilize and accelerate training, especially for transformer-like models. Instead of normalizing the input across the batch, it normalizes it across the features (or neurons) within a single data point. In particular, when working with different input sizes or sequences, this guarantees that each layer’s output has a consistent distribution, which aids in maintaining stable gradients and enhancing training efficiency.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sambhav Mehta
Sambhav Mehta

Written by Sambhav Mehta

I make content on data science and related field

No responses yet

Write a response