Plain English
Start with pure noise. Gradually remove the noise, step by step, guided by a text prompt, until a coherent image emerges. The model doesn't "draw" — it denoises. It's been trained on millions of image–caption pairs and learned what "looks right" for a given description at every noise level.
The image was always there, buried in the noise. The model just learned how to find it.
The Architecture
1. Text encoder — prompt → embedding vectors (CLIP / T5)
2. Noise schedule — define T timesteps of increasing noise
3. Forward process — gradually add Gaussian noise to training images
4. U-Net / DiT — neural network that predicts the noise to remove
5. Reverse process — iteratively denoise from pure noise → image
6. Latent space (optional) — operate in compressed space via VAE
7. Classifier-free guidance — amplify the effect of the text condition
The Maths: Forward Process
Take a clean image x₀. At each timestep t, add Gaussian noise according to a schedule βt:
q(xt | xt−1) = 𝒩(xt; √(1 − βt) · xt−1, βt · I)
With αt = 1 − βt and ᾱt = ∏ αs, we can jump to any timestep directly:
q(xt | x0) = 𝒩(xt; √ᾱt · x0, (1 − ᾱt) · I)
At t = 0, you have the original image. At t = T, you have pure Gaussian noise. The schedule ᾱt controls how fast information is destroyed.
The Maths: Reverse Process
The model learns to reverse this: given a noisy image at timestep t, predict what the slightly-less-noisy version looks like:
pθ(xt−1 | xt) = 𝒩(xt−1; μθ(xt, t), Σθ(xt, t))
In practice, the network predicts the noise εθ that was added, and μ is derived from it:
μθ(xt, t) = (1/√αt) · (xt − (βt/√(1 − ᾱt)) · εθ(xt, t))
The Training Objective
Beautifully simple. Sample a random timestep, add noise, and train the network to predict what noise was added:
ℒ = 𝔼t, x₀, ε [ ‖ε − εθ(√ᾱt · x₀ + √(1−ᾱt) · ε, t)‖² ]
The model learns: "given this noisy mess at this noise level, what does the noise look like?" That's it. Do this well enough across millions of images and every noise level, and the model can reverse-engineer any image from static.
Conditioning: Text → Image
The text prompt enters via cross-attention. The U-Net's attention layers receive text embeddings as keys and values, while the noisy image features are the queries:
CrossAttention(Qimage, Ktext, Vtext) = softmax(Q · KT / √d) · V
Same attention mechanism as the Transformer, but cross-modal: the image queries text for guidance on what to denoise towards. The text doesn't tell the model what to draw — it constrains what the noise can collapse into.
What the U-Net Actually Computes
The U-Net is a convolutional neural network. Convolutions are similar to the matrix multiplications in a Transformer, but instead of operating on sequences, they operate on grids of pixels.
What a convolution does
Take a small grid of weights (e.g. 3×3 — called a kernel). Slide it across the image. At each position, multiply the kernel weights by the pixel values underneath, add them up → one output number. Repeat for every position. The result is a new grid — a feature map — that responds to a particular pattern (edges, textures, shapes).
Why it works for images
Convolutions exploit the fact that images have local structure — nearby pixels are related. A 3×3 kernel can detect an edge. Stack layers and deeper kernels detect eyes, faces, objects. Same principle as attention (weighted combination of inputs) but spatially local instead of global.
The "U" shape: the network downsamples (compresses the image to smaller and smaller feature maps — capturing what, losing where) then upsamples (expands back to full resolution — recovering where, guided by what). Skip connections link matching resolution levels so detail isn't lost.
What One Denoising Step Actually Does
Each step is concrete:
1. Take the current noisy image xt (a grid of numbers — pixel values with noise)
2. Tell the U-Net what timestep t we're at (embedded as a vector, injected into every layer)
3. Feed the text prompt via cross-attention — the image features query the text: "what should I look like?"
4. The U-Net outputs εθ — its prediction of the noise in the current image
5. Subtract a scaled version of that predicted noise from xt
6. Add a small amount of fresh random noise (keeps the process stochastic)
7. The result is xt−1 — slightly less noisy, slightly more coherent
Repeat 20–50 times. Each step the image gets a little clearer. The first few steps establish the broad composition (sky at top, ground at bottom). The middle steps add structure (buildings, trees). The final steps add fine detail (textures, edges, lighting).
The U-Net isn't drawing. It's looking at static and answering: "what part of this is noise and what part is signal?" At each step, it peels away a layer of noise. What remains is the image that was always there — the text just told the model which image to find in the noise.
Classifier-Free Guidance
To strengthen the text's influence, the model runs two predictions — one conditioned on the prompt, one unconditional — and amplifies the difference:
ε̃ = εuncond + w · (εcond − εuncond)
w > 1 pushes the output further toward the prompt (typical: 7–15)
Higher guidance = more prompt adherence, less diversity. It's a knob that controls how tightly the constraint is applied.