Type "a corgi on grass." Out of 24 photos in the gallery, the corgi rises to the top. Score: 0.31.
Type "something to eat." A bowl of strawberries, a plate of pasta, and a wood-fired pizza take the medal positions. Score range: 0.25 – 0.27.
No server. No API key. No image got uploaded anywhere. The whole pipeline — a 150 MB neural network and 24 image embeddings — lives in a tab in your browser.
This is CLIP, the model that quietly powers a huge slice of modern computer vision. And in 2026, you can ship it on Vercel for free.
The idea behind CLIP
OpenAI released CLIP in 2021 with one beautifully simple idea: train one model to put text and images into the same vector space.
That's it. That's the whole trick.
CLIP has two encoders. The text encoder turns "a corgi puppy" into a 512-dimensional vector. The vision encoder turns a photo of a corgi into a 512-dimensional vector. They were trained on 400 million (caption, image) pairs scraped from the web so that paired text and images end up near each other in that vector space.
Once that's true, you can do search the way a database does it. The distance between the vector for "a corgi puppy" and the vector for an actual corgi photo is small. The distance between "a corgi puppy" and a photo of an astronaut is large. Sort by distance. Done.
"a corgi puppy" ─▶ text encoder ─▶ [0.04, -0.12, 0.07, ...] ─┐
├─▶ cosine sim
[image bytes] ─▶ vision encoder ─▶ [0.05, -0.10, 0.08, ...] ─┘
The math at the end is two for-loops and a multiplication:
function cosineSim(a: Float32Array, b: Float32Array) {
let dot = 0
for (let i = 0; i < a.length; i++) dot += a[i] * b[i]
return dot
}
That's it. That's the search engine.
Why this matters
If you understand "embed everything into the same vector space and compare with a dot product," you understand the heart of:
- Pinterest's visual search. "Find me more like this."
- Stable Diffusion's text conditioning. "Generate this prompt" is "find a region of vector space the model has learned to produce."
- Dataset deduplication. "Which of these 50 million images are near-duplicates?" Cluster by vector distance.
- Zero-shot classification. "Is this a cat, a dog, or a goat?" Encode the three labels, encode the image, take the closest.
- Content moderation at scale. "Is this image semantically similar to the policy violations we've labelled?"
Every one of these has been a multi-million-dollar engineering problem for some company in the last five years. The core trick is what we're about to build in 200 lines of TypeScript.
Step 1: load CLIP into the browser
The thing that used to require a Python server with a GPU now runs in a tab. The library doing this magic is Transformers.js — Hugging Face's port of their Python transformers library to ONNX Runtime in JavaScript.
import {
AutoTokenizer, AutoProcessor,
CLIPTextModelWithProjection, CLIPVisionModelWithProjection
} from '@xenova/transformers'
const MODEL_ID = 'Xenova/clip-vit-base-patch32'
const [tokenizer, processor, textModel, visionModel] = await Promise.all([
AutoTokenizer.from_pretrained(MODEL_ID),
AutoProcessor.from_pretrained(MODEL_ID),
CLIPTextModelWithProjection.from_pretrained(MODEL_ID),
CLIPVisionModelWithProjection.from_pretrained(MODEL_ID)
])
First visit: ~150 MB of ONNX weights stream from the Hugging Face CDN into your browser's Cache API. Every visit after that: a few hundred milliseconds.
Step 2: encode a phrase
async function encodeText(text: string) {
const inputs = tokenizer(text, { padding: true, truncation: true })
const { text_embeds } = await textModel(inputs)
return l2normalise(text_embeds.data) // 512-d Float32Array
}
The text_embeds field is the projected vector — the one that lives in the shared space. The un-projected hidden state is the wrong vector to compare against.
We L2-normalise (divide by length) so cosine similarity reduces to a dot product. This is the kind of small detail nobody explains in tutorials but matters: without normalisation, your ranking becomes "which image is bright" not "which image matches your query."
Step 3: encode an image
async function encodeImage(url: string) {
const image = await RawImage.read(url)
const inputs = await processor(image)
const { image_embeds } = await visionModel(inputs)
return l2normalise(image_embeds.data)
}
The processor handles resize → centre-crop → normalisation with CLIP's specific mean/std values. The vision model is a Vision Transformer — it cuts the 224×224 image into 7×7 patches of 32×32 px, treats each patch as a token, runs them through a transformer (yes, the same architecture as GPT), and projects the [CLS] token down to 512-d.
Same 512-d. Same space as the text. That's the whole magic.
Step 4: rank
const queryVec = await encodeText("a corgi on grass")
const imageVecs = await Promise.all(images.map(img => encodeImage(img.url)))
const ranked = imageVecs
.map((v, i) => ({ image: images[i], score: cosineSim(queryVec, v) }))
.sort((a, b) => b.score - a.score)
That's the search engine. Eight lines.
Step 5: cache so it stays fast
The model weights cache in the Cache API automatically. But re-encoding 24 images on every visit is ~5 seconds of WASM work for no reason — the vectors don't change. Stash them in IndexedDB keyed by image id + model id:
const STORE = 'embeddings'
async function putCached(modelId: string, imageId: string, vec: Float32Array) {
const db = await openDb()
const tx = db.transaction(STORE, 'readwrite')
tx.objectStore(STORE).put(vec.buffer, `${modelId}::${imageId}`)
}
48 KB for the whole gallery. Warm reloads now feel instant.
What this changes
Five years ago, "text-to-image search" was a paper. Two years ago, it was a Python server with a GPU and an SDK. Today, it's a Vercel deploy.
The line between "real AI engineering" and "a weekend project" keeps moving. Not because the models got smaller — CLIP is still 150 MB. The browser got bigger. WebAssembly. ONNX Runtime Web. IndexedDB. Cache API. The runtime stack ate everything the old Python service used to do.
If you're a beginner reading this thinking "AI is too hard, I'd never build that": you just read the whole thing. The code is on GitHub, every commit walks you through one concept, and there's a live demo at clip-from-zero.vercel.app. Clone it. Open clip.ts. Read the four functions. That's CLIP.
The next time you hear someone talking about "embeddings" or "vector search" or "RAG" or "multimodal" — you know what they mean. Numbers in a 512-d space. Cosine similarity. A dot product.
That's it.
🔗 Code: github.com/dev48v/clip-from-zero
🌐 Live demo: clip-from-zero.vercel.app
📚 Series: TechFromZero — a new technology every day, all free, all open source.
