WaveFunctionLabs

Generative Adversarial Networks (GANs)

Plain English

Two neural networks play a game. The Generator tries to create fake data that looks real. The Discriminator tries to spot the fakes. They train against each other — the generator gets better at faking, the discriminator gets better at detecting — until the fakes are indistinguishable from real data.

Think of it as a forger vs an art inspector. The forger keeps improving until the inspector can't tell the difference. At that point, the forger is a generative model.

The Maths

                            minG maxD V(D, G) =

                              𝔼x~pdata[log D(x)] + 𝔼z~pz[log(1 − D(G(z)))]

D(x) — discriminator's confidence that x is real (0→1)
G(z) — generator's output from random noise z
The game: D wants to maximise (correctly classify real vs fake). G wants to minimise (fool D). At Nash equilibrium, D outputs 0.5 for everything — it can't tell.

GAN Architecture Variants

DCGAN — convolutional generator/discriminator. Replaced fully-connected layers with conv layers. Made training stable enough to actually work.
StyleGAN — injects style at each resolution layer. Produces photorealistic faces. Introduced the "mapping network" that converts z to a style space w before generation.
CycleGAN — unpaired image-to-image translation (horses→zebras). Two generators, two discriminators, cycle-consistency loss.
Pix2Pix — paired image-to-image (sketch→photo). Conditional GAN with L1 reconstruction loss.
WGAN — uses Wasserstein distance instead of JS divergence. Smoother gradients, more stable training. Solved the "mode collapse" problem where the generator only produces a few outputs.

GANs dominated image generation from 2014–2021. Diffusion models largely replaced them for quality, but GANs remain faster at inference — a single forward pass vs 20–50 denoising steps. Some modern systems combine both: use diffusion for quality, then distil into a GAN for speed.

Video Generative Models

Plain English

Video is images + time. The fundamental challenge is temporal coherence — making sure frame 47 looks like it belongs after frame 46. A video model doesn't generate frames independently; it has to understand motion, physics, and continuity.

Most video models extend image architectures into the time dimension. The same attention mechanisms that let a Transformer look across tokens, or a diffusion model denoise pixels, are extended to look across frames.

The Core Problem

An image is a 2D grid of pixels: H × W × 3 (height, width, RGB). A video is a 3D grid: T × H × W × 3 — where T is time (frames). For a 5-second 1080p clip at 24fps, that's 120 × 1080 × 1920 × 3 ≈ 746 million values.

You can't just do attention over 746M tokens. So video models use compressed latent spaces — encode the video into a much smaller representation, generate in that space, then decode back.

Video Architecture Approaches

Temporal Diffusion (Sora, Runway) — extend a 2D diffusion model to 3D. Add temporal attention layers between spatial attention layers. The U-Net (or DiT) denoises across space and time simultaneously. The noise schedule applies to all frames at once.
DiT (Diffusion Transformer) — replace the U-Net with a Transformer operating on video patches. Chop each frame into patches, flatten across time, apply full attention. Sora uses this. The patch-based approach lets you handle variable resolutions and durations.
Autoregressive Video — generate frame-by-frame, conditioning each frame on the previous ones. Like a text Transformer but for images. Slower but naturally handles arbitrarily long sequences.
Latent Video Diffusion — encode each frame with a VAE into a small latent, run diffusion in latent space across time, decode back. The latent compresses 8×–16× spatially. This is how most practical systems work — raw pixel diffusion is too expensive.

                    Spatial attention: attend across pixels within one frame

                    Temporal attention: attend across the same spatial position across frames

                    Full 3D attention: attend across everything (expensive, accurate)

                    Video = Spatial structure × Temporal coherence × Prompt conditioning

The biggest challenge isn't architecture — it's data and compute. Training video models requires orders of magnitude more data than images (each training example is hundreds of frames) and the 3D attention is O(T² × H² × W²) without tricks. Practical systems use factored attention: spatial-then-temporal rather than full 3D.

Audio Generative Models

Plain English

Audio is a 1D signal — amplitude over time — typically sampled at 16–48kHz. That's 16,000–48,000 values per second. A 10-second clip is 480,000 samples. Raw audio generation needs to produce coherent structure at multiple timescales: individual waveform cycles (microseconds), phonemes (milliseconds), words (hundreds of ms), sentences (seconds), and musical phrases (many seconds).

Most models don't work on raw waveforms directly. They work on compressed representations — spectrograms, learned codecs, or discrete audio tokens.

Representations

Mel spectrogram — a 2D image of frequency vs time. Convert audio to this, generate it like an image, then invert it back to audio with a vocoder.

Neural audio codec (EnCodec, SoundStream) — compress audio into discrete tokens using a learned encoder. Now audio generation becomes a language modelling problem — predict the next audio token.

Raw waveform — generate samples directly. Highest quality, highest cost. WaveNet pioneered this with autoregressive generation at 16kHz — 16,000 predictions per second.

Audio Architecture Approaches

WaveNet — autoregressive CNN. Generates one sample at a time, conditioned on all previous samples. Dilated causal convolutions give exponentially growing receptive field. Extremely high quality, extremely slow at inference.
Transformer TTS (Tacotron → VALL-E → Bark) — text → mel spectrogram via attention, then a vocoder (HiFi-GAN, WaveGlow) converts to audio. VALL-E tokenises audio with a codec, then does language-model-style next-token prediction on audio tokens. Voice cloning in 3 seconds of reference audio.
AudioLM / MusicLM — hierarchical token generation. Coarse tokens capture structure and rhythm. Fine tokens capture acoustic detail. A Transformer generates coarse tokens first, then another generates fine tokens conditioned on them.
Audio Diffusion (Riffusion, Stable Audio) — generate mel spectrograms using image diffusion. The spectrogram is literally a 2D image — frequency on one axis, time on the other. Text-conditioned diffusion produces the spectrogram, a vocoder converts to waveform.
Flow Matching (Voicebox, E2 TTS) — learns a continuous flow from noise to speech. Like diffusion but with straight-line paths in latent space instead of the forward/reverse noise schedule. Faster inference, comparable quality.

                    Speech = text → phonemes → mel spectrogram → waveform

                    Music = prompt → coarse audio tokens → fine audio tokens → waveform

                    Sound effects = text → spectrogram via diffusion → waveform

                    Every approach eventually needs a decoder back to raw samples.

The trend across all audio models is the same as vision: compress into a latent space, generate there, decode back. The codec-based approach (turning audio into tokens) is winning because it lets you reuse all the Transformer infrastructure that already works for text.

The Unifying Pattern

Text (Transformers) — tokens are words. Predict next token autoregressively.
Images (Diffusion) — tokens are pixel patches. Denoise all patches simultaneously.
Video (Temporal Diffusion/DiT) — tokens are spatiotemporal patches. Denoise across space and time.
Audio (Codec + Transformer) — tokens are audio codec frames. Predict next audio token.
GANs — no tokens. Direct mapping from noise → output. Generator and discriminator in adversarial equilibrium.

Every modality converges on the same two ideas: compress the signal into tokens or latents, then either predict the next one (autoregressive) or denoise them all at once (diffusion). The architecture differences are engineering — the underlying mechanism is always constrained collapse of a possibility space.

GAN Collapse

Single-shot. Noise → output in one pass. The adversarial training was the collapse — it happened during training, not inference. The generator learned the collapsed distribution.

Video Collapse

3D collapse. The possibility space includes all possible videos. The prompt + temporal coherence constrains it. Denoising collapses space and time simultaneously.

Audio Collapse

Hierarchical collapse. Coarse structure collapses first (rhythm, melody). Fine detail collapses second (timbre, articulation). Like zooming in — macro to micro.

The reason every modality is converging on Transformers + diffusion isn't because those are the "right" architectures. It's because they're the most general-purpose collapse engines. GANs are faster but harder to train and less flexible. Autoregressive models are simpler but sequential. Diffusion is parallel but slow. The field is still searching for the architecture that is fast, parallel, stable, and general.

← Part 3: Determinism, Seeds & Sampling series index Part 5: Finetuning, LoRA & PEFT →

Beyond Transformers & Diffusion

Generative Adversarial Networks (GANs)

Plain English

The Maths

GAN Architecture Variants

Video Generative Models

Plain English

The Core Problem

Video Architecture Approaches

Audio Generative Models

Plain English

Representations

Audio Architecture Approaches

The Unifying Pattern

GAN Collapse

Video Collapse

Audio Collapse