LSTM based sequence-sequence/ encoder-decoder architecture

In a seq2seq model, there are 3 important parts
1. Encoder
2. Context-Vector (hidden state at the last time step)
3. Decoder

How does encoder decoder works?
Let’s take an example of English to Hindi machine translation for seq2seq model.

It is a supervised machine learning problem where we send first english words sequentially to the encoder and decoder predict the word and calculate loss function to update the gradient for all trainable parameters (weights and baises).
Step1: Tokenisation of the words in input and output column

Step2: Vectorization
Convert tokens into number separately for input and output.
For simplification we are using one-hot encoding technique
For input the vocabulary size is 5

For output side we have all the unique tokens/words + 2 extra tokens (start_token and end_token). In total the output side has 6+2=8 words
This “start token” will be send to the decoder as input along with the hidden state of the last time step. The “end token” will signal the decoder to stop predicting the sequence word (inference/prediction).

Training a sequence sequence model
Forward Propagation

- During forward propagation we will send OHE vectors to encoder as input at each time steps.
- At each time steps, the cell and the hidden states of encoder mode will get updated.
- At the last time steps, we will get final hidden state from the encoder which will input as the context vector for decoder along with start token.
- When the decoder predict output, it uses soft-max to predict the word with highest probability.

5. Even if decoder predict a wrong word, we will send actual word in the next time step. This method is called teacher force. We do this because the model converge very fast.
Backward Propagation:
Backward propagation in sequential data is called backward propagation through time (BPTT).
During backward propagation, we calculate the loss using categorical cross-entropy at every time step. Using this loss, we calculate the gradient, which in turn updates the weights.

Propagation first goes through decoder than context vecor and finally to encoder.
Decoder Back-propagation:
For each output step, compute the gradient of the loss with respect to the decoder’s output.
- Backpropagate through time (BPTT) to compute gradients for each parameter in the decoder.
- Accumulate gradients over all time steps.
- Update the decoder parameters using gradient descent or another optimization algorithm.
Context Vector Gradient:
Aggregate the gradients from the decoder back to the context vector.
Encoder Backpropagation:
Backpropagate the gradient of the context vector through the encoder layers.
- Compute gradients for each parameter in the encoder using BPTT.
- Update the encoder parameters using gradient descent or another optimisation algorithm.
Parameter Update:
Apply the computed gradients to the encoder and decoder parameters, updating them in the direction that minimises the loss.
Step to improvement the model:
- Use embedding layer for both encoder and decoder. Either use pretrained embedding like word2vec or glove or make your own trainable embedding during the training process.
- Use Deep LSTM/ Stacked LSTM.
a) Easily able to handle long term dependencies.
b) Reduce Overfitting because of more trainable parameters.
c) Layered representation which help to capture hierarchy i.e. initial layers capture word meaning, middle layers capture sentence meaning and top layers capture paragraph meaning.
Drawback of LSTM based Encoder-Decoder:
- Computation complexity and resource intensity.
- Alignment Issues.
- Unable to handle long sequences (>30 words).
- Huge data requirement to train the model.