Why Transformer Is Needed
An architecture proposed in the 2017 paper "Attention Is All You Need." Earlier RNNs and LSTMs process words in order, so they're slow and bad at long text. Transformer reconciles parallel processing and learning long-range dependencies.
What Is Attention
Attention is a mechanism that learns "to understand the current word, which other words in the sentence to weight, and how much."
Example: "He sat on the bank"
Whether "bank" is a "financial institution" or "river edge" is decided by attention to other words in context ("sat"). Attention expresses inter-word relevance numerically.
Q / K / V (Query, Key, Value)
Three vectors are made for each word.
- Query (Q): "what am I looking for now"
- Key (K): "I can provide this kind of information"
- Value (V): the actual information content

