Generative Adversarial Networks (GANs)
Plain English
Two neural networks play a game. The Generator tries to create fake data that looks real. The Discriminator tries to spot the fakes. They train against each other โ the generator gets better at faking, the discriminator gets better at detecting โ until the fakes are indistinguishable from real data.
Think of it as a forger vs an art inspector. The forger keeps improving until the inspector can't tell the difference. At that point, the forger is a generative model.
The Maths
minG maxD V(D, G) =
๐ผx~pdata[log D(x)] + ๐ผz~pz[log(1 โ D(G(z)))]
D(x) โ discriminator's confidence that x is real (0โ1)
G(z) โ generator's output from random noise z
The game: D wants to maximise (correctly classify real vs fake). G wants to minimise (fool D). At Nash equilibrium, D outputs 0.5 for everything โ it can't tell.
GAN Architecture Variants
DCGAN โ convolutional generator/discriminator. Replaced fully-connected layers with conv layers. Made training stable enough to actually work.
StyleGAN โ injects style at each resolution layer. Produces photorealistic faces. Introduced the "mapping network" that converts z to a style space w before generation.
CycleGAN โ unpaired image-to-image translation (horsesโzebras). Two generators, two discriminators, cycle-consistency loss.
Pix2Pix โ paired image-to-image (sketchโphoto). Conditional GAN with L1 reconstruction loss.
WGAN โ uses Wasserstein distance instead of JS divergence. Smoother gradients, more stable training. Solved the "mode collapse" problem where the generator only produces a few outputs.
GANs dominated image generation from 2014โ2021. Diffusion models largely replaced them for quality, but GANs remain faster at inference โ a single forward pass vs 20โ50 denoising steps. Some modern systems combine both: use diffusion for quality, then distil into a GAN for speed.
Video Generative Models
Plain English
Video is images + time. The fundamental challenge is temporal coherence โ making sure frame 47 looks like it belongs after frame 46. A video model doesn't generate frames independently; it has to understand motion, physics, and continuity.
Most video models extend image architectures into the time dimension. The same attention mechanisms that let a Transformer look across tokens, or a diffusion model denoise pixels, are extended to look across frames.
The Core Problem
An image is a 2D grid of pixels: H ร W ร 3 (height, width, RGB). A video is a 3D grid: T ร H ร W ร 3 โ where T is time (frames). For a 5-second 1080p clip at 24fps, that's 120 ร 1080 ร 1920 ร 3 โ 746 million values.
You can't just do attention over 746M tokens. So video models use compressed latent spaces โ encode the video into a much smaller representation, generate in that space, then decode back.
Video Architecture Approaches
Temporal Diffusion (Sora, Runway) โ extend a 2D diffusion model to 3D. Add temporal attention layers between spatial attention layers. The U-Net (or DiT) denoises across space and time simultaneously. The noise schedule applies to all frames at once.
DiT (Diffusion Transformer) โ replace the U-Net with a Transformer operating on video patches. Chop each frame into patches, flatten across time, apply full attention. Sora uses this. The patch-based approach lets you handle variable resolutions and durations.
Autoregressive Video โ generate frame-by-frame, conditioning each frame on the previous ones. Like a text Transformer but for images. Slower but naturally handles arbitrarily long sequences.
Latent Video Diffusion โ encode each frame with a VAE into a small latent, run diffusion in latent space across time, decode back. The latent compresses 8รโ16ร spatially. This is how most practical systems work โ raw pixel diffusion is too expensive.
Spatial attention: attend across pixels within one frame
Temporal attention: attend across the same spatial position across frames
Full 3D attention: attend across everything (expensive, accurate)
Video = Spatial structure ร Temporal coherence ร Prompt conditioning
The biggest challenge isn't architecture โ it's data and compute. Training video models requires orders of magnitude more data than images (each training example is hundreds of frames) and the 3D attention is O(Tยฒ ร Hยฒ ร Wยฒ) without tricks. Practical systems use factored attention: spatial-then-temporal rather than full 3D.
Audio Generative Models
Plain English
Audio is a 1D signal โ amplitude over time โ typically sampled at 16โ48kHz. That's 16,000โ48,000 values per second. A 10-second clip is 480,000 samples. Raw audio generation needs to produce coherent structure at multiple timescales: individual waveform cycles (microseconds), phonemes (milliseconds), words (hundreds of ms), sentences (seconds), and musical phrases (many seconds).
Most models don't work on raw waveforms directly. They work on compressed representations โ spectrograms, learned codecs, or discrete audio tokens.
Representations
Mel spectrogram โ a 2D image of frequency vs time. Convert audio to this, generate it like an image, then invert it back to audio with a vocoder.
Neural audio codec (EnCodec, SoundStream) โ compress audio into discrete tokens using a learned encoder. Now audio generation becomes a language modelling problem โ predict the next audio token.
Raw waveform โ generate samples directly. Highest quality, highest cost. WaveNet pioneered this with autoregressive generation at 16kHz โ 16,000 predictions per second.
Audio Architecture Approaches
WaveNet โ autoregressive CNN. Generates one sample at a time, conditioned on all previous samples. Dilated causal convolutions give exponentially growing receptive field. Extremely high quality, extremely slow at inference.
Transformer TTS (Tacotron โ VALL-E โ Bark) โ text โ mel spectrogram via attention, then a vocoder (HiFi-GAN, WaveGlow) converts to audio. VALL-E tokenises audio with a codec, then does language-model-style next-token prediction on audio tokens. Voice cloning in 3 seconds of reference audio.
AudioLM / MusicLM โ hierarchical token generation. Coarse tokens capture structure and rhythm. Fine tokens capture acoustic detail. A Transformer generates coarse tokens first, then another generates fine tokens conditioned on them.
Audio Diffusion (Riffusion, Stable Audio) โ generate mel spectrograms using image diffusion. The spectrogram is literally a 2D image โ frequency on one axis, time on the other. Text-conditioned diffusion produces the spectrogram, a vocoder converts to waveform.
Flow Matching (Voicebox, E2 TTS) โ learns a continuous flow from noise to speech. Like diffusion but with straight-line paths in latent space instead of the forward/reverse noise schedule. Faster inference, comparable quality.
Speech = text โ phonemes โ mel spectrogram โ waveform
Music = prompt โ coarse audio tokens โ fine audio tokens โ waveform
Sound effects = text โ spectrogram via diffusion โ waveform
Every approach eventually needs a decoder back to raw samples.
The trend across all audio models is the same as vision: compress into a latent space, generate there, decode back. The codec-based approach (turning audio into tokens) is winning because it lets you reuse all the Transformer infrastructure that already works for text.
The Unifying Pattern
Text (Transformers) โ tokens are words. Predict next token autoregressively.
Images (Diffusion) โ tokens are pixel patches. Denoise all patches simultaneously.
Video (Temporal Diffusion/DiT) โ tokens are spatiotemporal patches. Denoise across space and time.
Audio (Codec + Transformer) โ tokens are audio codec frames. Predict next audio token.
GANs โ no tokens. Direct mapping from noise โ output. Generator and discriminator in adversarial equilibrium.
Every modality converges on the same two ideas: compress the signal into tokens or latents, then either predict the next one (autoregressive) or denoise them all at once (diffusion). The architecture differences are engineering โ the underlying mechanism is always constrained collapse of a possibility space.
GAN Collapse
Single-shot. Noise โ output in one pass. The adversarial training was the collapse โ it happened during training, not inference. The generator learned the collapsed distribution.
Video Collapse
3D collapse. The possibility space includes all possible videos. The prompt + temporal coherence constrains it. Denoising collapses space and time simultaneously.
Audio Collapse
Hierarchical collapse. Coarse structure collapses first (rhythm, melody). Fine detail collapses second (timbre, articulation). Like zooming in โ macro to micro.
The reason every modality is converging on Transformers + diffusion isn't because those are the "right" architectures. It's because they're the most general-purpose collapse engines. GANs are faster but harder to train and less flexible. Autoregressive models are simpler but sequential. Diffusion is parallel but slow. The field is still searching for the architecture that is fast, parallel, stable, and general.