The Building Blocks
When you run an AI model โ inference โ you're executing a computational graph. Every concept below is something that graph is made of. None of them are magical. All of them are just numbers being multiplied and added together, at extraordinary scale.
Decoder vs Encoder: Two Families of Transformer
The original 2017 Transformer had two halves โ an encoder and a decoder. The field split them apart and built entirely different model families from each half. Understanding which half you're using explains most of the differences between models.
Decoder-Only (Autoregressive)
Models: GPT-4, Claude, LLaMA, Mistral, Gemini, Qwen, Command R
How it works: Processes tokens left-to-right. Each token can only attend to tokens that came before it (causal masking). Generates one token at a time, feeding each output back as input.
Trained on: Next-token prediction โ "given everything so far, what comes next?"
Good at: Text generation, reasoning, conversation, code, creative writing โ anything where you produce a sequence one step at a time.
Encoder-Only (Bidirectional)
Models: BERT, RoBERTa, DeBERTa, DistilBERT, BGE, E5
How it works: Processes all tokens simultaneously. Every token can attend to every other token โ forward and backward. The model sees the full input at once.
Trained on: Masked language modelling โ hide random tokens and predict them from context, plus next-sentence prediction.
Good at: Understanding, classification, named entity recognition, semantic search, embeddings โ anything where you analyse existing text rather than generate new text.
The Key Difference: Causal Mask vs Full Attention
Decoder (GPT, Claude, LLaMA): Token 5 can see tokens 1โ4. Token 6 can see tokens 1โ5. Never looks ahead.
Encoder (BERT): Token 5 can see tokens 1โ4 and tokens 6โN. Every token sees the whole sequence.
This single difference โ whether attention is causal (forward-only) or bidirectional (full) โ determines what the model can do. Decoders can generate because they're trained to predict the future without seeing it. Encoders can't generate, but they understand context more deeply because they see the entire input at once.
Why BERT can't write a story: It was trained to fill in blanks, not to continue sequences. It has no concept of "what comes next" โ only "what fits here, given everything around it."
Why GPT can't do BERT's job as well: When GPT classifies a sentence, each token only sees what came before it. BERT sees the whole sentence. For understanding tasks, bidirectional context wins.
Encoder-Decoder (The Original Architecture)
Some models keep both halves. The encoder processes the full input bidirectionally, then the decoder generates output autoregressively while attending to the encoder's representations.
Models: T5, FLAN-T5, mBART, Whisper (for speech)
Good at: Translation, summarisation, speech-to-text โ tasks where the input and output are different sequences with different structures
Size and Scale
The model families also differ dramatically in scale, which explains their different roles in practice:
| Architecture | Typical Size | Inference Cost | Primary Use |
|---|---|---|---|
| Encoder (BERT-class) | 110M โ 350M | Low โ single forward pass | Embeddings, classification, search, NER |
| Encoder-Decoder (T5-class) | 250M โ 11B | Medium โ encode once, decode many | Translation, summarisation, structured output |
| Decoder (GPT-class) | 1B โ 2T+ | High โ one forward pass per token | General reasoning, generation, conversation |
BERT-class models are small because they don't need to generate โ they just need to understand. A 110M parameter BERT can power a semantic search system that handles millions of queries. A 70B decoder model can reason about novel problems but costs 600ร more compute per input.
The big generative models (GPT, Claude, LLaMA) are all decoders. BERT and its descendants are encoders. They share the same attention mechanism, the same matrix multiplications, the same fundamental maths โ the difference is which direction information is allowed to flow.
The Vocabulary
Parameter
A parameter is any number in the model that was learned during training. When someone says "a 70 billion parameter model," they mean the model contains 70 billion individually tuned numbers that collectively encode everything the model knows.
Weights โ the numbers inside the matrices (WQ, WK, WV, FFN matrices, embeddings). The vast majority of parameters.
Biases โ offset values added after a matrix multiply (the +b in Wx + b). Some architectures omit them entirely.
Scale factors โ learned per-element multipliers in normalisation layers (ฮณ in RMSNorm).
Parameters are fixed after training โ they don't change during inference. The number of parameters determines the model's capacity (how much it can learn), its memory footprint (how much RAM/VRAM it needs), and its compute cost (how many operations per token). More parameters generally means more capable, but also more expensive to run.
Not all numbers in a running model are parameters. Activations (intermediate values), the KV cache, and optimizer states are all computed at runtime โ they're not part of the model's parameter count.
Inference
Inference is the act of running a trained model to produce an output. Training teaches the model. Inference uses what it learned.
Training
Feed data in, compute how wrong the output was, adjust parameters to be less wrong. Repeat billions of times. Costs millions of dollars and weeks of GPU time. Happens once.
Inference
Feed input in, run it forward through the model, get an output. No learning happens โ the parameters are frozen. Costs fractions of a penny per request. Happens billions of times.
Everything on this page โ matmuls, attention, softmax, quantisation โ is about making inference faster, cheaper, and more accessible. Training is a one-off cost. Inference is the ongoing cost that scales with every user, every query, every token.
Token & Tokenisation
Models don't process text as characters or words โ they process tokens. A tokeniser splits text into subword chunks that the model was trained on. Each token maps to an integer ID, and each ID maps to a vector in the embedding table.
"Hello world" โ ["Hello", " world"] โ [15496, 995]
"unbelievable" โ ["un", "believ", "able"] โ [348, 31141, 481]
"๐" โ ["๐"] โ [76460]
Common words get a single token: "the", "Hello", "function"
Rare words get split into pieces: "quantisation" โ "quant", "isation"
Vocabulary size is typically 32Kโ128K tokens โ a fixed dictionary the model was trained with
Tokens are the atomic unit of everything. Context length is measured in tokens. Cost is measured per token. The model generates one token at a time. When people say a model has a "128K context window," they mean it can process ~128,000 tokens in a single pass โ roughly 100,000 words.
Embedding
An embedding is a dense vector of numbers that represents a token (or any discrete item) as a point in continuous space. The embedding table is a matrix with one row per vocabulary token โ look up a token ID, get its vector.
token_id = 15496 โ embedding[15496] โ [0.012, โ0.34, 0.77, โฆ]4096 dims
Words with similar meanings end up near each other in embedding space. "King" is near "queen." "Python" is near both "snake" and "programming" โ in different directions. The geometry of the embedding space is the model's understanding of meaning.
The embedding table is the first and last thing in the model. Input tokens are looked up in it to enter the model. At the output, the final hidden state is projected back against it (or a separate output matrix) to produce logits over the vocabulary.
Context Window (Sequence Length)
The maximum number of tokens a model can process in a single pass. Everything โ your prompt, the conversation history, the system instructions, and the model's response โ must fit within this window.
GPT-2 (2019): 1,024 tokens
GPT-3 (2020): 2,048 tokens
GPT-4 (2023): 8Kโ128K tokens
Modern models (2024+): 128Kโ1M+ tokens
Attention is O(Nยฒ) in sequence length โ doubling the context quadruples the attention compute. FlashAttention and other optimisations reduce the memory cost, but the fundamental compute scaling remains. Long context is expensive.
KV Cache
During autoregressive generation, the model produces one token at a time. Without caching, it would need to recompute the Key and Value vectors for every previous token at each step โ an enormous waste of compute.
The KV cache stores the K and V tensors from all previous tokens so they only need to be computed once:
Step 1 (prefill): Process the entire prompt. Compute and cache K, V for all input tokens across all layers.
Step 2+ (decode): For each new token, compute its Q, K, V. Append its K, V to the cache. Attend to the full cached K, V. Generate the next token.
The KV cache is often the dominant memory consumer during long-context inference. For a 70B model with 128K context in BF16, the KV cache alone can exceed 40 GB. This is why quantised KV caches (INT8, FP8) and techniques like GQA (grouped-query attention) matter โ they reduce the cache size without meaningful quality loss.
The distinction between "prefill" (processing the prompt) and "decode" (generating tokens) is fundamental to inference performance. Prefill is compute-bound (large MatMuls). Decode is memory-bound (reading the KV cache for a single token's attention).
Tensor
A tensor is a multi-dimensional array of numbers. That's it. A single number is a 0D tensor (scalar). A list of numbers is 1D (vector). A grid of numbers is 2D (matrix). Stack grids and you get 3D, 4D, and beyond.
Scalar โ a single number: 3.14
Vector โ a list: [0.2, โ0.5, 0.8]
Matrix โ a grid: 512 rows ร 512 columns
Tensor โ any of the above, or higher: [batch ร sequence ร hidden_dim]
Every weight, every activation, every input to an AI model is stored as a tensor. When people say "a 7B parameter model," they mean 7 billion numbers stored across hundreds of tensors.
Weight
A weight is a single learned number inside the model. During training, the model adjusts billions of these weights so that the outputs become useful. After training, the weights are fixed โ they are the model. The weight tensors (WQ, WK, WV, the feed-forward matrices, the embeddings) hold everything the model has ever learned.
The model file you download is just the weight tensors serialised to disk. Load them into GPU memory, and you have a model.
Layer
A layer is one step in the model's processing pipeline. In a Transformer, each layer contains a self-attention block and a feed-forward block. The input enters, gets transformed, and exits as a new representation. Stack 32, 80, or 128 of these and you have a deep model โ "deep" literally means "many layers."
Layer 1 โ learns surface patterns (syntax, common phrases)
Layer N/2 โ learns intermediate structure (relationships, grammar)
Layer N โ learns abstract meaning (intent, reasoning, context)
Each layer refines the representation. Early layers are concrete; later layers are abstract. The "depth" is what gives these models their power.
Feed-Forward Network (FFN / MLP)
Inside every Transformer layer, after attention has decided what to look at, the feed-forward network decides what to do with it. It's a simple two-layer neural network applied independently to each token:
FFN(x) = W2 ยท activation(W1 ยท x + b1) + b2
Step 1: Project the token into a wider space (W1 expands, e.g. 4096 โ 11008 dimensions)
Step 2: Apply a nonlinear activation function (GeLU, SiLU) โ this is where the network gains its expressive power
Step 3: Project back down to the original size (W2 compresses, e.g. 11008 โ 4096)
The expand-then-compress pattern gives the network a "wide bottleneck" where it can mix features freely before squeezing them back down. The FFN accounts for roughly two-thirds of the total parameters in a Transformer โ and two-thirds of the compute.
Think of attention as routing โ it chooses which information to combine. The FFN is processing โ it transforms that information into something new. Every layer does both: route, then process.
Up Projection & Down Projection
The two matrix multiplications inside the feed-forward network have specific names. They describe what happens to the dimensionality:
h = activation(Wup ยท x) โ up projection: 4096 โ 11008
y = Wdown ยท h โ down projection: 11008 โ 4096
Up projection (Wup) โ expands the hidden state into a wider intermediate space. This is where the network gets room to represent complex feature combinations. The expansion ratio is typically 2.7โ4ร the model dimension.
Down projection (Wdown) โ compresses back to the original dimension. The information that survived the activation function gets squeezed back into the residual stream.
In gated architectures (LLaMA, Mistral), there's also a gate projection โ a second up projection whose output element-wise-multiplies the first, acting as a learned gate that controls which features pass through. This is the SwiGLU variant:
FFN(x) = Wdown ยท (SiLU(Wgate ยท x) โ Wup ยท x)
The up and down projections are the largest weight matrices in the model. In a 7B model, each layer's FFN has three matrices of ~4096ร11008 โ that's ~135M parameters per layer, just in the FFN.
Activation Function
An activation function is a simple nonlinear operation applied element-by-element to a vector. Without it, stacking layers of matrix multiplications would collapse into a single matrix multiplication โ the network would be linear, incapable of learning anything complex.
ReLU โ max(0, x) โ kills negatives, passes positives. Simple. Fast. The classic.
GeLU โ x ยท ฮฆ(x) โ a smooth, probabilistic gate. Used in GPT, BERT, most modern LLMs.
SiLU / Swish โ x ยท ฯ(x) โ smooth, self-gated. Used in LLaMA, Mistral.
Sigmoid โ 1/(1+eโx) โ squashes everything to (0,1). Used in gates and outputs.
The word "activation" has a dual meaning in AI. The activation function is the nonlinearity (ReLU, GeLU, etc.). An activation (noun) is the output value after applying that function โ the intermediate numbers flowing through the network during inference. When people talk about "activation memory," they mean the memory used to store all these intermediate values.
RMSNorm โ Root Mean Square Normalisation
Normalisation keeps numbers from exploding or vanishing as they pass through dozens of layers. RMSNorm is the version most modern Transformers use (LLaMA, Mistral, Gemma). It's simpler and faster than the original LayerNorm.
RMSNorm(x) = x / โ(mean(xยฒ) + ฮต) ร ฮณ
Step 1: Square every element, take the mean, take the square root โ that's the RMS (root mean square)
Step 2: Divide each element by the RMS โ now the vector has unit scale
Step 3: Multiply by a learned scale factor ฮณ (per element) โ the model learns the right magnitude
The difference from LayerNorm: RMSNorm skips the mean-subtraction step. It doesn't re-centre the values, only re-scales them. This removes one reduction operation per normalisation โ which matters when you're doing it billions of times.
ฮต (epsilon) is a tiny constant (e.g. 10โปโถ) added to prevent division by zero. Every normalisation formula has one.
MatVec (Matrix-Vector Multiply)
The fundamental operation of neural network inference. Take a matrix (the weights) and multiply it by a vector (the input). Every row of the matrix produces one output number โ a weighted sum of the input.
y = W ยท x
where W is [out_dim ร in_dim], x is [in_dim], y is [out_dim]
A model with 7 billion parameters does billions of these multiply-and-add operations per token. This is why GPUs matter โ they're built to do this in bulk. The entire cost of inference is dominated by matrix multiplications.
MatMul (Matrix-Matrix Multiply)
MatVec multiplies a matrix by a single vector. MatMul multiplies a matrix by another matrix โ which is equivalent to doing many MatVecs in parallel, one per column of the second matrix.
C = A ยท B
where A is [M ร K], B is [K ร N], C is [M ร N]
In attention: Q ยท KT is a MatMul โ every query against every key, producing the [seq_len ร seq_len] attention score matrix
In batched inference: processing multiple tokens at once turns every MatVec into a MatMul โ the batch dimension becomes columns
In the FFN: the up and down projections are MatMuls when processing a full sequence
MatMul is what GPUs are truly optimised for. Tensor Cores on NVIDIA GPUs, AMX on Intel, and the matrix engines on TPUs are all specialised MatMul accelerators. The arithmetic intensity of MatMul (O(Nยณ) compute for O(Nยฒ) data) means the hardware can stay busy โ the ratio of compute to memory access is high enough to saturate the silicon.
Dot Product
The simplest form of similarity measurement. Take two vectors, multiply them element-by-element, and sum the results. If the vectors point in the same direction, the dot product is large. If they're orthogonal, it's zero. If opposite, it's negative.
a ยท b = ฮฃ ai ร bi
In attention, Q ยท KT computes the dot product between every query and every key โ this is how the model measures "how relevant is token A to token B?" Every attention score starts as a dot product.
Cosine Similarity
The dot product measures similarity, but it's affected by vector magnitude โ longer vectors give bigger dot products even if they point the same direction. Cosine similarity fixes this by normalising:
cos(a, b) = (a ยท b) / (โaโ ร โbโ)
+1 โ vectors point in the same direction (identical meaning)
0 โ vectors are orthogonal (unrelated)
โ1 โ vectors point in opposite directions (opposite meaning)
Cosine similarity is the standard metric for comparing embeddings โ sentence similarity, RAG retrieval, semantic search. It measures the angle between two vectors, ignoring their length. Two documents can have very different word counts but similar meaning โ cosine catches that.
Softmax
Takes a list of raw numbers and converts them into a probability distribution โ all positive, all summing to 1. Large values get amplified; small values get crushed toward zero.
softmax(zi) = ezi / ฮฃ ezj
In attention โ converts raw attention scores into weights that sum to 1
At the output โ converts logits into a probability distribution over the full vocabulary
Softmax is the collapse mechanism. It takes a field of possibilities and forces a decision.
Logits
Logits are the raw, unnormalised scores that a model outputs before softmax is applied. The final layer of a Transformer produces one number per vocabulary token โ typically 32,000โ128,000 numbers. These are the logits.
logits = Woutput ยท hfinal
probabilities = softmax( logits / T )
Before softmax: logits can be any real number โ positive, negative, large, small. They have no probabilistic meaning yet.
After softmax: they become a probability distribution โ all positive, summing to 1. Now you can sample from them.
Logits are the model's raw opinion. Temperature scales them. Softmax normalises them. Sampling picks from them. Every text generation pipeline works with logits as the intermediate currency between the model's final layer and the token it actually outputs.
The name comes from "log-odds" โ in logistic regression, the logit function is the inverse of the sigmoid. In modern usage, it just means "the pre-softmax output scores."
Q, K, V โ Query, Key, Value
The three projections that make attention work. For each token, the model computes three different vectors from the same embedding:
Q (Query)
"What am I looking for?" โ the question this token asks of others
K (Key)
"What do I contain?" โ how this token advertises itself to queries
V (Value)
"What do I contribute?" โ the actual information this token carries
Q ยท KT computes relevance. Softmax normalises it. The result weights the V vectors. Each token's output is a weighted blend of every other token's value, based on learned relevance.
Attention โ The Full Mechanism
Attention is the core operation that makes Transformers work. It's how the model decides which tokens are relevant to which other tokens โ and blends their information accordingly.
Attention(Q, K, V) = softmax( Q ยท KT / โdk ) ยท V
1. Compute Q ยท KT โ a matrix of similarity scores between every pair of tokens
2. Divide by โdk โ prevents scores from growing too large, which would push softmax into near-one-hot outputs
3. Apply softmax per row โ each token now has a probability distribution over all other tokens
4. Multiply by V โ each token's output is a weighted sum of all value vectors, where the weights are the attention scores
In practice, models use multi-head attention โ they run 32โ128 independent attention operations in parallel, each with its own Q, K, V projections. One head might track syntax, another co-reference, another semantic similarity. The outputs are concatenated and projected back down.
SDPA โ Scaled Dot-Product Attention
SDPA is the formal name for the attention formula above โ "scaled dot-product attention." The "scaled" part is the division by โdk. In code, this is the function you actually call:
torch.nn.functional.scaled_dot_product_attention(Q, K, V)
PyTorch's SDPA implementation automatically selects the fastest available kernel for your hardware:
FlashAttention โ fuses the entire attention computation into a single GPU kernel, avoiding materialising the full NรN attention matrix. Reduces memory from O(Nยฒ) to O(N). The reason modern models can handle 128K+ context.
Memory-efficient attention โ similar tiling strategy for GPUs that don't support FlashAttention
Math fallback โ the naive implementation when nothing better is available
Before FlashAttention, a 32K context window needed ~4 GB just for the attention matrix of one layer. FlashAttention made long-context models practical by never storing that matrix at all โ it computes attention in tiles, streaming the result.
Output Projection
After multi-head attention concatenates the outputs from all heads, the result passes through one more linear layer โ the output projection:
MultiHead = Concat(head1, โฆ, headh) ยท WO
WO is a learned matrix that mixes the outputs from all attention heads back into a single representation of the original dimension. Without it, the heads would contribute independently โ the output projection lets the model learn how to combine what different heads discovered.
There's also an output projection at the very end of the model โ the language model head โ which maps the final hidden state into logits over the vocabulary. This is the last matrix multiply before softmax and sampling.
RoPE โ Rotary Position Embeddings
Transformers process all tokens in parallel โ they have no inherent sense of order. Positional encoding injects sequence position into the model. RoPE is the method most modern LLMs use (LLaMA, Mistral, Qwen, Gemma).
The idea: encode position by rotating the query and key vectors in pairs of dimensions. Token at position m gets its Q and K vectors rotated by an angle proportional to m:
RoPE(x, m) = x ยท R(mยทฮธ)
where R is a rotation matrix and ฮธ varies by dimension pair
Relative by construction: when Q and K are both rotated, the dot product QยทKT naturally depends on the difference between their positions โ not the absolute position. The model learns relative distance for free.
Extrapolation: because it's based on rotation angles, RoPE can extend to sequence lengths longer than training โ the basis of most context-length extension techniques (NTK-aware scaling, YaRN).
No learned parameters: the rotation frequencies are fixed (like sinusoidal encodings), so RoPE adds zero extra parameters to the model.
The different dimension pairs rotate at different frequencies โ low dimensions rotate slowly (capturing long-range position), high dimensions rotate fast (capturing local position). It's the same idea as Fourier features: encode position at multiple scales simultaneously.
Temperature
A single number that controls how decisive the model is when sampling. It scales the logits before softmax:
P(token) = softmax( logits / T )
T โ 0 โ softmax becomes argmax. The model always picks the most likely token. Deterministic, repetitive.
T = 1 โ raw probabilities. The model samples naturally from its distribution.
T > 1 โ distribution flattens. Less likely tokens get a real chance. Creative, chaotic.
Temperature doesn't change what the model "thinks." It changes how aggressively it commits to its strongest prediction. Low temperature = sharp collapse. High temperature = broad exploration.
Latent / Latent Space
A latent is a hidden representation โ a vector of numbers that the model uses internally but that doesn't correspond directly to anything a human would recognise. The "latent space" is the high-dimensional coordinate system these vectors live in.
In a Transformer
Every token's embedding is a latent โ a point in a 4096-dimensional space. Tokens with similar meanings end up near each other. Each layer moves these points around, refining meaning through geometry.
In a Diffusion Model
The noisy image representation at each denoising step is a latent. Stable Diffusion works in a compressed "latent space" (64ร64 instead of 512ร512 pixels) โ the VAE encoder maps images into it, and the decoder maps back out.
Why it matters: The model never works with raw inputs. It projects everything into latent space first โ a compressed, structured representation where the geometry encodes meaning. Nearby points mean similar things. Directions encode concepts. The model reasons in this space, and the output is decoded back into tokens, pixels, or audio.
When someone says "latent diffusion," they mean "diffusion that happens in a compressed representation space, not in raw pixel space." When someone says "the model's latent representations," they mean the internal vectors the model computes between input and output.
Number Formats: Why Precision Matters
Every weight, every activation, every intermediate result is a number. How you represent that number โ how many bits, what format โ changes everything about cost, speed, and quality.
FP32 โ 32-bit Floating Point
The gold standard. 1 sign bit, 8 exponent bits, 23 mantissa bits. Can represent numbers from โ10โ38 to โ1038 with ~7 decimal digits of precision.
Pro: Maximum precision, no numerical issues
Con: 4 bytes per weight. A 70B model = 280 GB. Needs multiple high-end GPUs just to load.
BF16 โ Brain Float 16
Google's format. 1 sign bit, 8 exponent bits (same range as FP32), but only 7 mantissa bits (~2 decimal digits of precision). Half the memory of FP32, same dynamic range.
Pro: Halves memory. Training-safe โ the wide exponent range prevents overflow/underflow
Con: Less precise than FP16 in the mantissa. Fine for neural nets; terrible for scientific computing.
Most modern models train in BF16. It's the default precision for LLaMA, Mistral, and most open-weight models.
FP32 โ [1 sign] [8 exponent] [23 mantissa] โ 4 bytes โ full precision
FP16 โ [1 sign] [5 exponent] [10 mantissa] โ 2 bytes โ narrower range, more mantissa than BF16
BF16 โ [1 sign] [8 exponent] [7 mantissa] โ 2 bytes โ same range as FP32, less mantissa
Quantisation: Shrinking the Numbers
Quantisation converts floating-point weights into lower-precision integers. Instead of storing each weight as a 16-bit or 32-bit float, you map the values into a smaller set of discrete levels. The idea: neural network weights are statistically well-behaved โ most values cluster around zero, with tails that decay rapidly. You don't need 32 bits of precision to represent them.
INT8 โ 8-bit Integer
Maps each weight to one of 256 levels. A scale factor and zero-point convert between the quantised integers and the original float values.
wfloat โ scale ร (wint8 โ zero_point)
Memory: 1 byte per weight. 70B model = ~70 GB. Fits on a single large GPU.
Quality: Negligible degradation for most models. The standard for production inference.
INT4 โ 4-bit Integer
Maps each weight to one of 16 levels. Half a byte per weight. This is where things get aggressive โ you're representing the entirety of a weight's learned knowledge with just 4 bits.
Memory: 0.5 bytes per weight. 70B model = ~35 GB. Fits on consumer hardware.
Quality: Measurable but often acceptable degradation. GPTQ, AWQ, and GGUF all support this.
Technique: Group quantisation โ quantise blocks of 32โ128 weights together with a shared scale factor, preserving local distributions.
INT4 is the sweet spot for running large models on consumer GPUs. Most "local LLM" setups use 4-bit quantised models.
BitNet & Ternary Weights: {โ1, 0, +1}
The extreme end of quantisation. Microsoft's BitNet b1.58 constrains every weight to one of three values: โ1, 0, or +1. That's 1.58 bits per weight (logโ(3)).
Traditional MatVec: multiply and accumulate (MAC) โ each operation is a float multiply + float add
BitNet MatVec: add, subtract, or skip โ no multiplications at all
When every weight is โ1, 0, or +1, the matrix-vector multiply becomes pure addition and subtraction. The hardware implications are massive:
No multiplier circuits needed โ adders are smaller, faster, and use less energy
~1.58 bits per weight โ a 70B model fits in ~13 GB
Custom silicon โ purpose-built chips can be radically simpler and more efficient
Microcontroller inference โ models can potentially run on devices with no floating-point unit at all
The catch: you can't just quantise a trained model to ternary after the fact. BitNet models must be trained from scratch with ternary constraints. The model learns to distribute its knowledge across three values instead of continuous floats. The research shows that at sufficient scale, ternary models match the quality of full-precision models.
Trits โ The Ternary Digit
A bit holds two states: 0 or 1. A trit holds three: โ1, 0, or +1. It's the fundamental unit of ternary computing, and it's what BitNet weights are measured in.
1 bit โ 2 states โ logโ(2) = 1.00 bits of information
1 trit โ 3 states โ logโ(3) = 1.58 bits of information
1 byte โ 256 states โ 8 bits โ can pack 5 trits (3โต = 243 โค 256)
Each trit encodes a simple instruction for the matrix multiply: +1 means add this input, โ1 means subtract it, 0 means skip it entirely. No multiplication needed โ just routing and addition.
Density: You can pack 5 trits into a single byte. A 70B-parameter ternary model stores its weights in ~17.5 billion bytes โ about 16 GB including overhead.
Hardware: Ternary logic maps naturally to simple circuits โ a trit can be encoded as two bits, or packed more efficiently. Custom silicon for trit-based models can skip the entire floating-point unit.
History: Ternary computing isn't new. The Soviet Setun computer (1958) was built on balanced ternary. BitNet brings the idea back, applied to neural network weights.
A ternary weight isn't a degraded float. It's a different primitive โ one that trades precision for radical simplicity. When the model is trained to work in trits from the start, it learns to encode its knowledge in patterns of add, subtract, and skip.
The Precision-Cost Tradeoff
| Format | Bits/Weight | 70B Model Size | Compute | Quality Loss |
|---|---|---|---|---|
| FP32 | 32 | 280 GB | Float MAC | None (baseline) |
| BF16 | 16 | 140 GB | Float MAC | Negligible |
| INT8 | 8 | 70 GB | Int MAC | Minimal |
| INT4 | 4 | 35 GB | Int MAC | Measurable |
| BitNet 1.58b | 1.58 | ~13 GB | Add/Sub only | ~None at scale* |
* BitNet requires training from scratch with ternary constraints. Post-training quantisation to 1.58b is lossy.
Why This Matters
The entire trajectory of making AI accessible is a story of precision reduction. FP32 models need a server room. BF16 models need a workstation. INT4 models run on a gaming laptop. BitNet models could run on a phone โ or a microcontroller.
Every halving of precision roughly halves memory, doubles throughput, and halves energy consumption. The model quality barely changes until you push below 4 bits โ and with BitNet, even that boundary is dissolving.
Quantisation isn't degradation. It's compression. The same intelligence, in a smaller container. And at the extreme โ ternary weights โ the container becomes so simple that the hardware itself can be redesigned around it.
The future of AI isn't just better models. It's the same models, running everywhere, because the numbers got small enough.