โ† ducks
PART 6 OF 7

All the Ducks at Once

Multimodal convergence โ€” what happens when you combine every channel.

Structure. Behaviour. Symbol. Three channels. Three different ways to represent a concept. What happens when you combine them all?

CLIP: Bridging Vision and Language

In 2021, OpenAI released CLIP (Contrastive Language-Image Pretraining). The idea was simple and profound: train a model on 400 million image-text pairs so that images and their captions end up near each other in the same vector space.

Image of a duck โ†’ image encoder โ†’ vector [0.2, -0.1, 0.8, ...]
"A duck swimming" โ†’ text encoder โ†’ vector [0.19, -0.12, 0.79, ...]

cosine_similarity(image_vec, text_vec) โ‰ˆ 0.95

The image and the caption are nearby in the same space.

Structure and symbol now share a common geometry. "Looks like a duck" and "the word duck" converge to the same region. You can search images with text. You can search text with images. The channels are aligned.

The Shared Embedding Space

What Gets Close Together

A photo of a mallard + "mallard duck" + "a green-headed bird on a pond"

A rubber duck photo + "rubber duck" + "yellow bath toy"

All of these cluster in the same neighbourhood โ€” because the training data contained all these associations.

What Gets Separated

"Duck" (bird) vs "duck" (verb) โ€” the text encoder uses context to separate these. "Duck!" gives a different vector than "the duck."

A photo of a duck vs a painting of a duck โ€” nearby but offset. The model learns artistic style as a dimension.

CLIPSeg: Finding the Duck in the Scene

CLIP tells you that an image contains a duck. But where is the duck? Which pixels are duck and which are pond? That's segmentation โ€” assigning a label to every pixel in the image. CLIPSeg combines CLIP's text-image understanding with a segmentation decoder to do this with just a text prompt.

Input: image + "a duck"
Output: heatmap where every pixel gets a score for "duckness"

High score pixels โ†’ the duck
Low score pixels โ†’ the background (water, reeds, sky)

Classification โ€” "there's a duck in this image" (whole-image label)
Detection โ€” "there's a duck at this bounding box" (rectangle around it)
Segmentation โ€” "these exact pixels are duck" (pixel-level mask)

Each step is finer-grained than the last. Classification knows what. Detection knows where, roughly. Segmentation knows exactly which pixels.

How CLIPSeg Works

The Architecture

Takes CLIP's frozen image encoder and adds a lightweight decoder on top. The text prompt is encoded by CLIP's text encoder into a conditioning vector. The decoder uses that vector to produce a per-pixel segmentation mask โ€” guided by the text, grounded in the image features.

Why It Matters

Traditional segmentation needs training data with hand-drawn pixel masks โ€” expensive and slow. CLIPSeg segments anything you can describe in words. Ask for "duck" and it finds ducks. Ask for "water near the duck" and it finds that too. Zero-shot, open-vocabulary segmentation โ€” the text prompt is the class definition.

SAM and the Segmentation Landscape

CLIPSeg isn't alone. Meta's SAM (Segment Anything Model) takes a different approach โ€” give it a point, a box, or a text prompt and it segments any object. The field is converging on models that can isolate any concept in any image, guided by language:

CLIPSeg โ€” text-prompted segmentation via CLIP features. Lightweight, fast.
SAM โ€” point/box/text-prompted segmentation. Trained on 1B+ masks. Heavier but more precise.
Grounding DINO + SAM โ€” detect with language, then segment. "Find the duck" โ†’ bounding box โ†’ pixel-perfect mask.
SEEM / X-Decoder โ€” unified models that classify, detect, and segment in one pass.

The pattern: structure recognition (knowing what a duck looks like) is now directed by symbolic recognition (the word "duck"). The channels aren't just aligned โ€” they're collaborating. Language tells vision what to look for. Vision tells language what's actually there.

GPT-4V and Beyond: All Channels Active

Modern multimodal models combine everything. They can:

See a duck (vision encoder โ†’ structural features)
Describe it (language model โ†’ behavioural/symbolic generation)
Reason about it ("this is a mallard, which is a dabbling duck, common in temperate regions")
Compare it ("this duck is similar to the one in the other image but is a different species")
Answer questions about it ("what is this duck eating?" โ†’ "it appears to be eating aquatic vegetation")

All three channels operating simultaneously. Structure provides the visual grounding. Behaviour provides the contextual knowledge. Symbol provides the communication medium. The representation of "duck" is no longer one vector โ€” it's a web of aligned representations across modalities.

Is This Understanding?

The Case For

It can identify species from photos. Describe habitats. Predict behaviour. Generate novel duck images that are structurally correct. Answer questions that require integrating visual and linguistic knowledge. If a human did all this, we'd say they "understand" ducks.

The Case Against

It has no embodied experience. It has never been cold and seen ducks flying south. It doesn't know what it's like to hold a duckling. Its "understanding" is a statistical alignment of modalities, not lived experience. It can tell you about ducks. It can't be near a duck.

The Convergence Hypothesis

Maybe understanding isn't a binary. Maybe it's a gradient:

No channels โ€” no concept at all. A rock.
One channel โ€” partial recognition. A CNN that classifies ducks.
Two channels โ€” correlated recognition. CLIP connecting images and text.
Three channels โ€” multimodal representation. GPT-4V seeing, reading, reasoning.
Three channels + embodiment โ€” biological understanding. A child who has seen, touched, chased, and been startled by a duck.
All channels + all experience โ€” complete understanding. Probably impossible. An asymptote.

Each additional channel doesn't just add information โ€” it constrains the representation. A concept grounded in one channel can drift. A concept grounded in three channels is pinned. The more channels that agree, the more the representation reflects something real rather than something hallucinated.

Structure shows you the form.
Behaviour shows you the function.
Symbol gives you the handle.

Understanding might just be: enough channels, agreeing, with enough constraint, that the representation becomes useful.

The duck is a wavefunction. Each channel is a measurement. None captures the duck completely. But together, they collapse toward something that works.

This entire series was about one question: how do systems represent concepts? The answer: imperfectly, through multiple channels, each with its own strengths and failure modes. Combine them and you get something that looks like understanding. Whether it is understanding depends on where you draw the line โ€” and that line might not exist.

โ† Part 5: When It's Not a Duck series index Part 7: Duck Typing for Humans โ†’