For comparing the hidden states between the encoder and decoder, we need a similarity score.
Two common approaches to calculate this are:
- Cosine similarity
- Dot product
Cosine Similarity
It performs a dot product on the vectors and then normalizes the result.
Example
Encoder output:
[-0.76, 0.75]
Decoder output:
[0.91, 0.38]
Cosine similarity ≈ -0.39
- Close to 1 → very similar → strong attention
- Close to 0 → not related
- Negative → opposite → low attention
This is useful when:
- Values can vary a lot in size
- You want a consistent scale (-1 to 1)
The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that.
Dot Product
Dot product is much simpler. It does the following:
- Multiply corresponding values
- Add them up
Example
(-0.76 × 0.91) + (0.75 × 0.38) = -0.41
Dot product is preferred in attention because:
- It’s fast
- It’s simple
- It gives good relative scores
Even if the numbers are not normalized, the model can still figure out:
- Which words are more important
- Which words to ignore
Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.
Just run:
ipm install repo-name
… and you’re done! 🚀



