You can recognise a duck from a silhouette. From a sketch. From a photo at an angle you've never seen before. From a cartoon that shares almost no pixel-level similarity with a real duck. Your visual system doesn't match pixels โ it matches structure.
What Is Structure?
Structure is spatial relationship between features. A duck has a round head, a flat bill extending forward, a roughly oval body, short legs set far back. These relationships hold across poses, lighting, colours, and art styles. Structure is invariant โ it survives transformation.
This is exactly what convolutional neural networks learn. Not pixels. Not colours. Spatial feature hierarchies.
The Feature Ladder
Layer 1 โ edges. Horizontal, vertical, diagonal. Every image, every object.
Layer 2 โ textures. Feather patterns, water ripples, smooth surfaces.
Layer 3 โ parts. Bill shapes, eye patterns, wing curves.
Layer 4 โ objects. "This combination of parts, in this spatial arrangement, is a duck."
Layer 5 โ scenes. "A duck on water." "Ducks in a park." Context.
Each layer compresses: millions of pixels โ thousands of edges โ hundreds of textures โ dozens of parts โ a handful of objects. Recognition is compression with structure preservation.
Embeddings: Duckness as a Point in Space
After the feature ladder, a model doesn't output "duck" directly. It produces a vector โ a point in a high-dimensional space. A 512-dimensional embedding. And in that space, something remarkable happens:
Mallards cluster together.
Rubber ducks are nearby but offset โ structurally similar, different texture.
Geese are close โ similar body plan, different proportions.
Swans are further โ elongated neck shifts the structural signature.
Penguins are distant โ different posture, proportions, everything.
Cars are in a completely different region.
The distance between points is structural similarity. "Looks like a duck" has a precise mathematical meaning: close to the duck region in embedding space.
How Diffusion Models Use Structure
A diffusion model generates images by denoising. At each step, the U-Net asks: "given this partially-noisy image and the prompt 'duck,' what noise should I remove?" The answer depends entirely on structural priors โ the model has learned what spatial arrangements of features are consistent with "duck."
Early denoising steps establish global structure โ rough shape, pose, composition. Late steps add local detail โ feather texture, eye highlights, water reflections. The model generates structure top-down, from the abstract to the specific.
What It Knows
Ducks have bills (not beaks like songbirds).
They float (waterline cuts the body roughly in half).
Mallard males have green heads.
Ducklings are yellow and fluffy.
They appear in ponds, parks, rivers.
What It Doesn't Know
What "wet" feels like.
That ducks migrate.
That the quack is produced by a syrinx.
That "duck" is also a verb.
Anything that isn't spatial structure.
Where Structure Breaks
Adversarial examples โ change a few pixels imperceptibly and the model sees "truck" instead of "duck." The feature space has regions where small perturbations cross decision boundaries that don't align with human perception.
Decoys โ a wooden duck decoy has perfect duck structure. A vision model correctly identifies it as a duck. Is that a success or a failure? It looks like a duck. It isn't one.
Unusual poses โ a duck diving underwater, upside down, mid-flight from behind. The structural template doesn't match. Recognition degrades when structure breaks expected patterns.
Cross-domain transfer โ a duck emoji, a duck hieroglyph, the letter "b" that kinda looks like a duck. Structure recognition generalises further than you'd expect โ and sometimes further than you'd want.
Structure is the most intuitive channel. It's what we mean when we say "I know it when I see it." But seeing is just pattern matching on spatial features. The duck you see is a compressed representation โ a lossy projection of reality through learned feature detectors. What you see isn't the duck. It's your model of the duck.