How Self-Attention Works — QKV, Softmax, and Matrix Computation

Dev.to / 6/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Self-attention in Transformers is fundamentally a matrix-based computation that lets each token compare and update its representation using all other tokens in the same sequence.
The core pipeline projects input embeddings into Query (Q), Key (K), and Value (V), computes similarity scores via QKᵀ, scales them by √d_k, applies softmax to get weights, and then forms a weighted sum of V.
Because the operation is expressed in matrix form, implementations process all tokens in parallel rather than token-by-token, enabling Transformers to scale efficiently.
A concrete example (“I love you”) illustrates that the token “love” can attend strongly to context words like “I” and “you,” shifting its meaning from an isolated word to a context-aware representation.
The Q/K/V separation provides an intuition for how models learn different roles: Q decides what to look for, K defines what can be matched, and V carries the information mixed into the output.

Continue reading this article on the original site.