D-U-C-K. Four letters. One syllable. A vibration pattern in air. A sequence of Unicode code points. A token ID (probably somewhere around 37812, depending on the tokeniser). None of these have any physical relationship to the actual animal. The connection is pure convention.
The Symbol Problem
Words are arbitrary. In French it's "canard." In Japanese it's "ใขใใซ." In Mandarin it's "้ธญ." Different sounds, different letters, same bird. The symbol has no intrinsic connection to the thing it represents. This is Saussure's insight: the signifier is arbitrary relative to the signified.
And yet symbols are the most powerful representational channel we have. You just read "duck" and your brain activated visual, behavioural, and conceptual representations simultaneously. Four letters did what evolution took millions of years to wire.
How Machines Process Symbols
"duck" โ tokeniser โ [37812] โ embedding lookup โ [0.12, -0.45, 0.83, ...]
The token ID is arbitrary (just an index).
The embedding vector is learned โ placed by training.
The vector's position encodes meaning.
The tokeniser doesn't understand "duck." It's a lookup table. But the embedding vector โ the 768 or 4096 numbers that represent "duck" inside the model โ that encodes something. It was placed in a specific region of vector space by billions of training examples that showed the model what "duck" means by showing it what "duck" does in sentences.
Word2Vec: The Original Insight
The 2013 result that changed everything: if you train a model to predict words from their context (or context from words), the learned vectors capture semantic relationships as geometric relationships:
king โ man + woman โ queen
duck โ bird + fish โ ??? (probably something aquatic)
paris โ france + japan โ tokyo
The symbols have no inherent structure. But the vectors learned from symbol co-occurrence have rich structure. The geometry of the space reflects the geometry of the concepts. Analogy is vector arithmetic.
The Grounding Problem
A language model has a rich representation of "duck." It knows ducks swim, quack, have feathers, are birds, are eaten, appear in parks. But it has never seen, heard, touched, or smelled a duck. Its "duck" is grounded in other symbols, not in sensory experience.
The Chinese Room
Searle's thought experiment: a person in a room follows rules to manipulate Chinese symbols perfectly, producing correct outputs โ but doesn't understand Chinese. Symbols can be processed without grounding. Is an LLM the room? It manipulates "duck" perfectly in context. Does it understand ducks?
The Counter-Argument
Maybe grounding in other symbols is grounding. Your concept of "quark" is grounded in textbooks, not sensory experience. Your concept of "justice" has no physical referent at all. Many of your concepts are symbol-to-symbol mappings. The LLM's "duck" might be more like your "quark" than your "red."
Polysemy: When One Symbol Is Many Concepts
"Duck" the bird. "Duck" the verb (to dodge). "Duck" the fabric. "Duck" the cricket stroke. "Duck" the term of endearment ("ducks"). "Lame duck." "Sitting duck." "Rubber duck debugging."
One symbol. At least eight concepts. How does a Transformer handle this?
Contextual embeddings. The initial token embedding for "duck" is the same regardless of context. But after passing through attention layers, the representation changes. In "the duck swam," the vector shifts toward animal-space. In "duck!" it shifts toward action-space. In "sitting duck," it shifts toward vulnerability/idiom-space. The same symbol gets different representations depending on what's around it. Polysemy is resolved by attention.
Where Symbols Break
Tokenisation artefacts โ "duck" is one token. "ducklings" might be two: "duck" + "lings." "Muscovy duck" is probably three tokens. The boundary between symbols doesn't align with the boundary between concepts.
Out-of-vocabulary โ a new word has no embedding. The model must infer meaning from subword tokens. It handles "duckish" by combining "duck" + "ish" โ but has never seen the word, so the representation is approximate.
Translation gaps โ some concepts don't have one-to-one symbol mappings across languages. The Japanese distinction between wild ducks (้ดจ, kamo) and domestic ducks (ใขใใซ, ahiru) collapses to one English word.
Deceptive fluency โ because symbols can be manipulated without grounding, a model can produce grammatically perfect, contextually appropriate text about ducks that contains fabricated facts. The symbol-level output is flawless. The reference-level output is wrong.
The symbol is the most efficient channel โ four letters carry the weight of an entire concept. But efficiency comes from compression, and compression is lossy. The word "duck" activates a representation, but the representation is built from co-occurrence, not from reality. The map is not the territory. The word is not the duck.