You see something in a pond. It has feathers, a flat bill, webbed feet. It floats. It quacks. Someone next to you says "duck."
Three things just happened. Three completely different ways of recognising what that thing is:
Looks Like a Duck
Structural recognition. You matched visual features โ shape, colour, texture, proportions โ against a learned template. You didn't need it to move or make a sound. The form was enough.
Quacks Like a Duck
Behavioural recognition. You heard a sound and matched it to an expected behaviour. You didn't need to see it. The action was enough. This is duck typing โ identity through interface.
The Word "Duck"
Symbolic recognition. Someone used a token โ a sound, a word, four letters โ that maps to the concept. You didn't need to see it or hear it. The symbol was enough.
Three Channels, One Concept
These are three entirely different computational processes. Structure operates on spatial patterns. Behaviour operates on temporal patterns. Symbol operates on agreed conventions. They converge on the same concept from completely different directions.
Every system that "knows" things โ biological, computational, artificial โ uses some combination of these three. And the failures are different for each:
Structure fails when things look alike but aren't โ a decoy duck, a photo of a duck, a cloud shaped like a duck.
Behaviour fails when things act alike but aren't โ a parrot quacking, a speaker playing duck sounds, a rubber duck squeaking.
Symbol fails when the mapping breaks โ "duck" means a waterbird, a cricket stroke, dodging something, a type of fabric, and a term of endearment. The word is overloaded. The concept is not.
Why This Matters for AI
Every AI architecture is fundamentally a bet on which channel matters most:
Vision models (CNNs, ViT, diffusion) โ bet on structure. They learn spatial feature hierarchies. They know what ducks look like.
Language models (Transformers) โ bet on behaviour and co-occurrence. They learn what words appear near "duck." They know what ducks do, what's said about them, what contexts they appear in.
Multimodal models (CLIP, GPT-4V) โ bridge the channels. They learn that the image of a duck, the word "duck," and descriptions of duck behaviour all point to the same region in a shared representation space.
This series explores each channel โ how it works, how it's computed, where it breaks, and what happens when you combine them.
The Deep Question
Does any system โ biological or artificial โ actually know what a duck is? Or do they all just have different approximations of duckness, built from whichever channel they have access to?
Structure says: it has the shape of a duck.
Behaviour says: it acts like a duck.
Symbol says: we agreed to call it a duck.
None of them say what a duck is. They say what a duck is like. That might be all that knowing ever was.
The concept of "duck" is a wavefunction. Structure, behaviour, and symbol are three measurement bases. Each collapses the concept differently. None captures it completely.