โ† how ai works
PART 5 OF 7

Finetuning, LoRA & PEFT

How you adapt a model without training from scratch.

Why Finetune?

Training a foundation model from scratch costs millions of dollars and months of compute. But you can take an existing model and adapt it to your domain โ€” your data, your tone, your task โ€” for a fraction of the cost. This is finetuning. You start with a model that already understands language (or images, or audio), and you nudge its weights toward your specific use case.

Full Finetuning

Update every weight in the model using your dataset. The model sees your examples and adjusts all parameters via backpropagation, exactly like pre-training but with a smaller learning rate and far fewer steps.

Pros: Maximum expressiveness. The model can learn completely new behaviours.
Cons: You need the full model in GPU memory (a 70B model needs ~140GB in fp16). Risk of catastrophic forgetting โ€” the model becomes great at your task but forgets its general knowledge. You also produce a full-size copy of every weight.

The Memory Problem

A 7B parameter model in fp16 needs ~14GB just to hold the weights. But training also needs:

Weights: 14 GB (fp16)
Gradients: 14 GB (same shape as weights)
Optimiser: 28 GB (Adam stores 2ร— states)
Activations: variable (depends on batch/seq)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Total: ~56 GB minimum for a 7B model

For a 70B model: ~560GB. That's 7ร— A100-80GB GPUs just for the training state. This is why parameter-efficient methods exist.

LoRA โ€” Low-Rank Adaptation

Plain English

Instead of updating all 7 billion weights, freeze the original model entirely and inject tiny trainable side-matrices into each attention layer. These matrices learn the difference between the base model and what you want. At inference, merge them back in โ€” zero extra latency.

The key insight: the weight changes during finetuning are low-rank. You don't need to adjust all dimensions of a weight matrix โ€” most of the change can be captured by a much smaller matrix decomposition.

The Maths

Original: h = W ยท x    (W is dร—d, e.g. 4096ร—4096)

LoRA: h = W ยท x + (B ยท A) ยท x

Where: A is dร—r, B is rร—d   (r โ‰ช d, e.g. r=16)

Merged: W' = W + B ยท A   (no extra cost at inference)

W โ€” frozen original weight matrix (4096 ร— 4096 = 16.7M params)
A โ€” down-projection (4096 ร— 16 = 65K params)
B โ€” up-projection (16 ร— 4096 = 65K params)
Total trainable: 130K params instead of 16.7M โ€” a 128ร— reduction per layer

LoRA In Practice

Rank (r) โ€” controls capacity. r=4 for simple style changes. r=16โ€“64 for domain adaptation. r=256 for complex tasks. Higher rank = more params = closer to full finetune.
Alpha (ฮฑ) โ€” scaling factor. The LoRA output is scaled by ฮฑ/r. Higher alpha = stronger adaptation. Typical: ฮฑ = 2ร—r.
Target modules โ€” which layers get LoRA adapters. Usually Q, K, V projection matrices in attention. Some approaches add them to feed-forward layers too.
Merging โ€” after training, compute W' = W + (ฮฑ/r)ยทBยทA and replace the original weights. The adapter disappears โ€” same model size, different behaviour.
Stacking โ€” you can train multiple LoRA adapters for different tasks and swap them at inference without retraining. One base model, many specialisations.

A LoRA adapter for a 7B model is typically 10โ€“100MB. The full model is 14GB. You can share the base model across 100 customers, each with their own adapter โ€” the economics change completely.

QLoRA โ€” Quantised LoRA

LoRA freezes the base model but still loads it in fp16. QLoRA goes further: quantise the base model to 4-bit (NF4 format), then train LoRA adapters in fp16 on top. The adapters train at full precision, but the frozen base takes 4ร— less memory.

Full finetune 7B: ~56 GB (multiple GPUs)
LoRA 7B (fp16): ~16 GB (one good GPU)
QLoRA 7B (4-bit): ~6 GB (consumer GPU)

QLoRA made it possible to finetune a 65B parameter model on a single 48GB GPU. It finetunes a 7B model on a gaming GPU with 8GB VRAM. The quality loss from quantisation is surprisingly small โ€” within 1% of full-precision LoRA on most benchmarks.

PEFT โ€” Parameter-Efficient Fine-Tuning

LoRA is the most popular PEFT method, but it's not the only one. PEFT is the family of techniques that adapt a model by training only a small fraction of its parameters. They all share the same goal: don't touch most of the weights.

LoRA โ€” low-rank side-matrices in attention layers. Merge at inference. The winner for most use cases.
Prefix Tuning โ€” prepend trainable "virtual tokens" to the input at each layer. The model learns to condition on these soft prefixes. No weight modification at all โ€” just extra context.
Prompt Tuning โ€” simpler version: trainable embeddings prepended to the input only at the first layer. Even cheaper than prefix tuning. Works surprisingly well on large models.
Adapters โ€” small bottleneck layers inserted between existing layers. Input โ†’ down-project โ†’ nonlinearity โ†’ up-project โ†’ add back. Similar idea to LoRA but as separate modules rather than side-matrices.
IAยณ (Infused Adapter by Inhibiting and Amplifying Inner Activations) โ€” learn scaling vectors that multiply the keys, values, and FFN outputs. Even fewer parameters than LoRA โ€” just one vector per layer instead of two matrices.
DoRA (Weight-Decomposed Low-Rank Adaptation) โ€” decomposes weight into magnitude and direction, applies LoRA only to the direction component. Better training dynamics, closer to full finetune quality.

When to Use What

Choose LoRA / QLoRA When

You have domain-specific data (legal, medical, code, your company's tone)
You want to merge back into the base model for zero-overhead inference
You need multiple task-specific adapters on one base model
You're working with limited GPU budget (QLoRA on consumer hardware)
You want to avoid catastrophic forgetting of general knowledge

Choose Full Finetune When

You have massive domain-specific data (millions of examples)
The task requires fundamentally new capabilities, not just style/tone
You have the compute budget (multi-GPU cluster)
You're building a foundation model variant for a specific domain
LoRA at maximum rank still underperforms

Alignment Finetuning: RLHF & DPO

Finetuning isn't just about domain knowledge. The reason ChatGPT sounds helpful instead of just completing text is alignment finetuning โ€” teaching the model how to respond, not just what to say.

RLHF (Reinforcement Learning from Human Feedback)

Step 1 โ€” SFT: Supervised finetuning on human-written examples of good responses. Teaches format and helpfulness.

Step 2 โ€” Reward model: Humans rank multiple outputs. Train a model to predict which response humans prefer.

Step 3 โ€” PPO: Use the reward model as a signal to further finetune the language model via reinforcement learning. The model learns to maximise human preference scores while staying close to the SFT model (KL penalty).

DPO (Direct Preference Optimisation)

Skip the reward model entirely. Given pairs of (preferred, rejected) responses, directly optimise the language model to increase the probability of preferred outputs and decrease rejected ones.

The insight: the optimal RLHF policy has a closed-form solution that only depends on the preference data and the reference model. You can train it with a simple binary cross-entropy loss โ€” no RL, no reward model, no PPO instability.

Simpler, cheaper, more stable. Increasingly preferred over RLHF for alignment.

Finetuning Diffusion Models

Everything above applies to image/video models too. You can LoRA a Stable Diffusion model to learn a specific style, face, or object with just 5โ€“20 images and 15 minutes of training on a consumer GPU.

DreamBooth โ€” finetune the entire model on a few images of a subject, binding it to a unique token (e.g. "sks person"). The model learns to generate that specific subject in any context.
Textual Inversion โ€” freeze the model, learn only a new embedding vector for a concept. The cheapest approach โ€” a single vector. Limited expressiveness but zero risk to base model.
LoRA for Diffusion โ€” inject low-rank adapters into the U-Net's cross-attention layers. Style transfer, character consistency, aesthetic control. Adapters are ~10โ€“100MB vs 2GB+ for the full model. Composable: mix multiple LoRAs at inference with weighted merging.

The Full Training Pipeline

1. Pre-training       โ†’ learn language/vision from massive data  (billions of tokens, months, millions of $)
2. SFT               โ†’ supervised finetuning on curated examples  (thousands of examples, hours)
3. RLHF / DPO        โ†’ align to human preferences               (preference pairs, hours-days)
4. Domain LoRA        โ†’ adapt to specific use case                (your data, minutes-hours)

Each stage is cheaper and faster than the last.
Each stage changes less of the model.
The expensive work is done once. The cheap work is done per customer.

The economics of AI deployment are defined by this pipeline. Pre-training is a fixed cost amortised across all users. Finetuning is the marginal cost per use case. LoRA made that marginal cost collapse by 100ร—. QLoRA collapsed it again. The trend is clear: adaptation is getting cheaper faster than pre-training is getting cheaper.

โ† Part 4: Beyond Transformers & Diffusion series index Part 6: The Wavefunction Metaphor โ†’