WaveFunctionLabs

Plain English

A Transformer reads a sequence of tokens (words, subwords, symbols) and predicts what comes next. It does this by letting every token look at every other token and decide how much attention to pay to each one. The result is a probability distribution over the entire vocabulary — and the model samples from it to produce the next token.

It doesn't "understand" language. It compresses patterns in sequences so effectively that the output looks like understanding.

The Architecture

1. Tokenisation — raw text → integer token IDs
2. Embedding — token IDs → dense vectors in ℝᵈ
3. Positional encoding — inject sequence order
4. Self-attention (× N layers) — tokens attend to each other
5. Feed-forward network — per-token nonlinear transformation
6. Layer norm + residual connections — stabilise gradients
7. Output projection — final vector → logits over vocabulary
8. Softmax — logits → probability distribution → sample next token

The Maths: Self-Attention

The core mechanism. For each token, compute three vectors from its embedding:

                    Q = X · WQ    K = X · WK    V = X · WV
                
                    Attention(Q, K, V) = softmax( Q · KT / √dk ) · V

Q (Query)
"What am I looking for?"

K (Key)
"What do I contain?"

V (Value)
"What do I contribute?"

Q · K^T computes a similarity score between every pair of tokens. The softmax normalises these into weights that sum to 1. Dividing by √d_k prevents the dot products from growing too large (which would push softmax into saturation, killing gradients).

The result: each token's output is a weighted blend of all other tokens' values, where the weights are learned relevance scores.

What "Matrix × Vector" Actually Does

Every time you see a W · x in this model, something concrete is happening. A matrix is just a grid of numbers. A vector is a list of numbers. Multiplying them together produces a new list — each output number is a weighted sum of all the inputs.

The mechanical view
Take each row of the matrix. Multiply it element-by-element with the input vector. Sum the results. That's one output number. Repeat for every row. If the matrix is 512×512 and the input is 512 numbers, you get 512 output numbers — each one a different weighted combination of the same 512 inputs.

What it means
Each row of the matrix is a question: "how much of each input feature do I want?" The weights were learned during training. So W_Q · x asks: "given this token's embedding, what query should it produce?" The matrix is the learned transformation — it remixes the inputs into a new representation.

That's it. Every "linear layer" in the model is just this: multiply a matrix by a vector. The entire Transformer is matrix multiplications, with nonlinearities and normalisation between them. The "intelligence" lives in the specific numbers that training put into those matrices.

What "Q · K^T" Actually Computes

Q and K are both matrices — one row per token, each row a vector of numbers. Q · K^T computes the dot product between every query and every key.

A dot product between two vectors is simple: multiply each pair of corresponding numbers and add them up. If two vectors point in the same direction, the dot product is large. If they're unrelated, it's near zero. If they point in opposite directions, it's negative.

Token 5 has a query vector: [0.3, −0.1, 0.8, …]
Token 2 has a key vector: [0.4, −0.2, 0.7, …]
Dot product = (0.3×0.4) + (−0.1×−0.2) + (0.8×0.7) + … = high score → these tokens are relevant to each other

Do this for every pair and you get a square matrix of scores — row i, column j tells you how much token i should pay attention to token j. This is the attention score matrix.

What "Softmax" Actually Does

The attention scores are raw numbers — they could be anything. Softmax converts each row into a probability distribution: all positive, all summing to 1.

                    softmax(zi) = ezi / Σ ezj
                

In plain english:

1. Take each score and raise e (≈2.718) to its power — this makes everything positive and amplifies differences
2. Add up all the results
3. Divide each one by the total

The biggest scores get most of the weight. Small scores get almost nothing. The output is a set of weights that says: "pay 72% attention to token 3, 15% to token 7, 8% to token 1, and basically ignore everything else."

Softmax is the collapse mechanism. It takes a field of raw relevance scores and forces a decision: which tokens matter right now? The sharper the scores, the more decisive the collapse.

Putting It Together: One Step of Attention

Here's the full flow for one token, in plain operations:

1. Take this token's embedding (a list of ~512–4096 numbers)
2. Multiply by W_Q to get a query vector — "what am I looking for?"
3. Every other token also multiplied by W_K to get key vectors — "what do I contain?"
4. Dot product of my query against every key — "how relevant is each token to me?"
5. Divide by √d_k to keep the numbers stable
6. Softmax to turn those scores into weights that sum to 1
7. Every token also multiplied by W_V to get value vectors — "what do I contribute?"
8. Multiply each value vector by its attention weight and add them all up
9. The result is this token's new representation — a blend of everything relevant in the sequence

That's it. That's attention. Every layer does this for every token. Each layer refines the representations. By the final layer, each token's vector encodes its meaning in context — not just what the word means, but what it means here.

Multi-Head Attention

One attention head learns one kind of relationship. Multiple heads (typically 32–128) run in parallel, each with their own W_Q, W_K, W_V projections:

                    MultiHead(Q, K, V) = Concat(head1, …, headh) · WO

                    where headi = Attention(Q·WQi, K·WKi, V·WVi)

One head might learn syntax. Another learns co-reference. Another learns semantic similarity. The model doesn't choose — it learns all of them simultaneously.

Feed-Forward + Residuals

After attention, each token passes through a two-layer MLP (feed-forward network) with a nonlinearity (GeLU/SiLU):

                    FFN(x) = W2 · GeLU(W1 · x + b1) + b2

                    Output = LayerNorm(x + Attention(x)) → LayerNorm(x + FFN(x))

The residual connections (the + x) are critical — they let gradients flow directly back through the network, preventing deep layers from collapsing to nothing. Attention decides what to look at. The FFN decides what to do with it.

Training: Next-Token Prediction

The entire model is trained on one objective: given all previous tokens, predict the next one. The loss function is cross-entropy over the vocabulary:

                    ℒ = −Σ log P(xt | x1, …, xt−1)
                

Simple objective. Massive scale. The model learns grammar, facts, reasoning patterns, code structure — all as side effects of predicting the next token well enough.

series index Part 2: Diffusion Models →

The Transformer