Understanding Seq2Seq Neural Networks – Part 8: When Does the Decoder Stop?

Dev.to / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The decoder in a seq2seq model continues generating tokens until it outputs an EOS (end-of-sequence) token, rather than stopping after a fixed number of steps.
  • Each generation step feeds the current predicted token (via its embedding) into unrolled LSTM cells, then uses the resulting hidden states to predict the next token through the same fully connected layer.
  • In practice, decoding may also halt when a maximum output length is reached, providing a safeguard against infinite generation.
  • During training, teacher forcing is used: the decoder inputs the known correct token rather than its own previously predicted token to stabilize and guide learning via backpropagation.
  • The article concludes this seq2seq series and previews the next topic: introducing the attention mechanism.

In the previous article, we saw the translation being done.

But there is an issue.

The decoder does not stop until it outputs an EOS token.

So, we plug the word "Vamos" into the decoder’s unrolled embedding layer and unroll the two LSTM cells in each layer.

Then, we run the output values (short-term memory or hidden states) into the same fully connected layer.

The next predicted token is EOS.

How the Decoder Works

So now, this means we translated the English sentence "let’s go" into the correct Spanish sentence.

For the decoder, the context vector, which is created by both layers of encoder unrolled LSTM cells, is used to initialize the LSTMs in the decoder.

The input to the LSTMs comes from the output word embedding layer, which starts with EOS. After that, it uses whatever word was predicted by the output layer.

In practice, the decoder keeps predicting words until it predicts the EOS token or reaches some maximum output length.

All these weights and biases are trained using backpropagation.

When training an encoder-decoder model, instead of using the predicted token as input to the decoder LSTMs, we use the known correct token. This is known as teacher forcing (explained in this article).

What’s Next

That’s it for sequence-to-sequence neural networks.

In the next article, we will continue with the attention mechanism for neural networks.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

Installerpedia Screenshot

🔗 Explore Installerpedia here