โ† how ai works
PART 3 OF 7

Determinism, Seeds & Sampling

Why the same prompt gives different outputs.

Seeds: The Starting Point of Randomness

Both Transformers and diffusion models use random numbers during generation. A seed is the starting value for the random number generator (RNG). Same seed โ†’ same sequence of random numbers โ†’ same output (in theory).

Transformer
The model outputs a probability distribution over tokens. Sampling picks one. The RNG seed controls which token gets chosen when multiple tokens have similar probabilities. Same seed + same prompt + temperature > 0 = same text.

Diffusion
Generation starts from pure noise โ€” the seed determines the specific noise tensor. Different seed = different starting noise = completely different image. Same seed + same prompt + same settings = same image. This is why image tools let you copy seeds.

Source 1: Sampling โ€” Intentional Randomness

Sampling is the deliberate source of variation. The model produces probabilities; sampling decides what to do with them.

Transformer Sampling

After softmax, you have a probability distribution over the vocabulary. Several strategies exist:

Greedy (temperature = 0) โ€” always pick the highest-probability token. Deterministic. Same input = same output. No randomness at all.
Temperature sampling โ€” divide the logits by temperature before softmax. Low temperature (0.1โ€“0.5) โ†’ sharpens the distribution โ†’ more predictable. High temperature (1.0โ€“2.0) โ†’ flattens it โ†’ more creative/random.
Top-k โ€” only consider the top k most likely tokens, zero out the rest, then sample. Prevents the model picking wildly unlikely tokens.
Top-p (nucleus) โ€” only consider the smallest set of tokens whose probabilities sum to p (e.g. 0.9). Adapts the candidate pool size based on confidence.

Temperature literally controls collapse sharpness. At 0, the wavefunction collapses deterministically to the peak. At high values, it explores the full field of possibilities.

Diffusion Sampling

Each denoising step adds a small amount of fresh Gaussian noise (except the final step). This keeps the process stochastic โ€” the model doesn't just converge to the mathematically closest clean image, it explores the space of plausible outputs.

DDPM (original) โ€” full stochastic schedule. Each step adds fresh noise. More variation, slower (1000 steps).
DDIM โ€” deterministic variant. No added noise between steps. Same seed = exactly same image. Much faster (20โ€“50 steps).
Euler / DPM-Solver โ€” treat denoising as solving a differential equation. Faster convergence, tuneable stochasticity.

The scheduler is the collapse trajectory. DDPM explores more of the possibility space. DDIM takes the most direct path. Same destination, different routes through the noise.

Source 2: Floating-Point Non-Determinism โ€” Accidental Randomness

Even with the same seed, same prompt, and greedy/deterministic sampling, you can still get different outputs. This is the one that surprises people.

The maths is deterministic. The hardware isn't.

Floating-point numbers (float16, float32) can't represent most real numbers exactly. They round. And the order in which you do additions affects the rounding:

(a + b) + c โ‰  a + (b + c)   // in floating-point arithmetic

This matters because GPUs are massively parallel โ€” thousands of threads adding numbers together simultaneously. The order those threads finish is non-deterministic. Different execution order โ†’ different rounding โ†’ different results.

Different GPU model โ€” different FP units, different rounding behaviour
Different GPU count โ€” different parallelisation splits, different addition order
Different driver version โ€” different kernel implementations, different thread scheduling
Same GPU, same driver โ€” can still vary if the CUDA runtime doesn't enforce deterministic kernels
float16 vs float32 โ€” fewer precision bits โ†’ larger rounding errors โ†’ bigger divergence

The difference starts tiny โ€” maybe the 7th decimal place. But in an autoregressive model, one slightly different token changes every subsequent token. In diffusion, one slightly different denoising step shifts the trajectory through noise space. Small rounding error โ†’ amplified through thousands of sequential operations โ†’ completely different output.

Transformer Cascade

Token 47 has two candidates at 23.1% and 22.9%.
FP rounding flips which is larger.
Different token chosen.
That token changes the context for token 48, 49, 50โ€ฆ
By token 200, the outputs are completely different.

Diffusion Cascade

Step 12 of 50: noise prediction differs at the 6th decimal place.
Subtracted noise is slightly different.
xtโˆ’1 shifts by a fraction of a pixel.
By step 50, the dog has different fur texture.
Or a slightly different pose. Or a different background.

The Two Sources, Summarised

Sampling โ€” intentional randomness. Controlled by seed, temperature, scheduler. You choose how much variation you want. This is the feature.
FP non-determinism โ€” accidental randomness. Caused by hardware parallelism and rounding. You can't fully control it. This is the physics.

True determinism requires: same seed + same model weights + same precision + same hardware + same driver + deterministic CUDA kernels + same batch size. In practice, most systems settle for "reproducible enough" โ€” same seed gives visually similar results, not bit-identical ones.

โ† Part 2: Diffusion Models series index Part 4: Beyond Transformers & Diffusion โ†’