The Transformer: The Architecture Behind Modern AI

Dev.to / 5/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article traces how modern neural architectures evolved—from early single-neuron and MLP ideas to CNN/RNN-style sequence handling, and finally to the Transformer that replaced many prior mechanisms with attention.
  • It argues that the Transformer’s key breakthrough matches a cognitive shift: understanding meaning in parallel rather than translating sequentially, and formalizes this with the probability of the next token given all prior tokens.
  • It explains how decoder-only Transformers (the basis of GPT/Claude-style models) are built from repeated layers containing token/position embeddings, masked multi-head self-attention, and other components that together model context.
  • The piece emphasizes that attention turns context into computable signals, shaping every generated output according to both past and present tokens.

"Attention Is All You Need." -- Vaswani, 2017

The Path So Far

We started with a single neuron drawing a line. Added hidden layers to bend it. Taught the network to learn its own weights. Scaled training with mini-batches and Adam. Fought overfitting with dropout. Built filters for images. Gave networks memory for sequences. Replaced compression with attention.

Each architecture solved a problem the previous one couldn't. Each carried forward what worked and discarded what didn't.

Architecture evolution: MLP → CNN → RNN → Transformer

The Personal Connect

In Attention blog post, I described how I used to compose sentences in Tamil first, then translate word by word into English. It was slow, sequential, and lossy. When I finally started thinking directly in English, everything changed. I wasn't translating anymore. I was processing meaning, grammar, and context all at once, shaped by everything I'd read and heard before.

That shift, from sequential translation to parallel understanding, is exactly what the Transformer does. And the core idea is simple:

P(next token | all previous tokens)

What is the probability of the next token, given everything that came before? That single equation is the foundation of GPT, Claude, and every modern language model. Everything you produce is shaped by your past and present context, conscious or not. The Transformer makes that idea computational.

Breaking Down the Decoder

The decoder-only Transformer (used by GPT, Claude, and most generative AI models) is a stack of identical layers. Each layer has four components, and we've seen every one of them before.

Token + Position Embedding: Each token becomes a vector (say, 128 numbers). Since attention doesn't care about order, a position signal is added. Token "slow" at position 3 gets a different embedding than "slow" at position 6. The model learns that position matters.

Masked Multi-Head Self-Attention: This is the core. Every token computes how relevant every previous token is to it, then blends their information accordingly.

Consider the sentence from RNN: "My teacher said I was slow, but he didn't know I was just getting started."

When predicting what "he" refers to:
  "My"       → low relevance (possessive, context)
  "teacher"  → high relevance (the subject — "he" refers back here)
  "said"     → low relevance (verb, not a referent)
  "I"        → medium relevance (another person in the sentence)
  "was"      → low relevance (auxiliary verb)
  "slow"     → low relevance (adjective)
  "but"      → low relevance (conjunction)
  "he"       → current position

RNN had to compress everything into a fixed-size hidden state and hope "teacher" survived the journey. Here, attention reaches back directly. No compression, no forgetting.

The attention formula from Attention:

Attention(Q, K, V) = softmax(Q·Kᵀ / √d) · V

Each token generates a Query ("what am I looking for?"), a Key ("what do I offer?"), and a Value ("what information do I carry?"). The dot product Q·Kᵀ scores how well each key matches the query. Softmax turns scores into weights. The weighted sum of values produces the output. The causal mask ensures token 5 only sees tokens 1 through 4. No peeking ahead.

Multi-head attention runs this operation multiple times in parallel with different learned projections. Conceptually similar to CNN's multiple filters: in a CNN, each filter detects a different spatial pattern (edges, textures). In a Transformer, each head detects a different relationship (grammar, coreference, meaning). Eight heads, eight perspectives, same total computation.

Add & LayerNorm: The residual connection from Post 07. The input bypasses the attention layer and gets added back:

output = LayerNorm(x + Attention(x))

This keeps gradients alive through deep stacks. Layer normalization stabilizes the signal between layers. Without these, a 12-layer Transformer wouldn't train.

Feed-Forward Network: A two-layer MLP with GELU activation, applied to each position independently:

FFN(x) = GELU(x · W + b) · W + b

This is where the non-linearity lives. Attention itself is a weighted sum (linear). The FFN transforms what each token learned from attention through a non-linear function, the same principle from Post 02. Without it, stacking attention layers would collapse to a single linear operation.

These four components repeat N times. Each layer refines the representation. By the final layer, the vector for each token encodes its meaning in the full context of the sequence.

A final linear layer followed by softmax produces the probability distribution over the next token. This last layer is intentionally linear. Its job is to project the rich representations into vocabulary space. The non-linearity has already done its work in the layers below.

How It Learns

All weights start random. The Transformer knows nothing. Training uses the same loop from this series: backprop computes gradients, Adam updates weights, dropout prevents memorization.

What's different is what it learns from. No labels. No human annotations. Just raw text. "Given these tokens, predict the next one." Billions of times. The model learns grammar, facts, reasoning, style, all as a side effect of next-token prediction.

This is called self-supervised learning. The training signal comes from the data itself. Every sentence is both the input and the answer. Predict the next word, check if you were right, adjust. The same try-miss-adjust loop from Bakcpropagation, at a scale that would have seemed impossible when we started with XOR.

See It

Open the playground. Two pretrained models on Shakespeare, a small one (112K params) and a larger one (826K params). Type a prompt like "ROMEO:" and generate text instantly. Both models are tiny, so the output will still be rough, not real Shakespeare. But compare the two side by side and you'll see the 826K model produces noticeably better structure: dialogue format, character names, verse-like line breaks. Scale matters, even at this toy level.

The Series, Complete

This series started because I was building with AI tools but didn't understand how any of it worked. Ten posts later, I understand the foundations. Not because I memorised the formulas, but because I recreated each piece, watched it work, and saw how it connects to the next. There is still plenty to learn. The journey continues.

The Transformer didn't invent any of these pieces. It composed them. The genius was in what it removed, not what it added.

What's Next

We've built the architecture. But architecture alone doesn't make intelligence. Training is what brings it to life: how data is prepared, how models scale, how they're fine-tuned, how they learn to follow instructions. That's a separate series.

References:
Vaswani, A., and team (2017). Attention Is All You Need. NeurIPS.
Radford, A., and team. (2018). Improving Language Understanding by Generative Pre-Training. (GPT-1)

Series: From Perceptrons to Transformers | Code: GitHub