I Built a Text-to-Image Search Engine That Runs Entirely in the Browser

Dev.to / 5/24/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article demonstrates a text-to-image search approach using CLIP, where both text queries and images are embedded into the same vector space for similarity search.
It shows that by encoding a query (e.g., “a corgi on grass”) and comparing it against precomputed image embeddings, the most relevant photo can be ranked without heavy infrastructure.
A key highlight is that the entire pipeline runs locally in the browser—no server, no API key, and no image uploads—using a ~150MB neural network plus 24 image embeddings.
The post notes that CLIP was introduced by OpenAI in 2021 and is widely used in modern computer vision, but in 2026 it can be shipped conveniently via Vercel.
The example illustrates that retrieval is essentially reduced to computing cosine similarity (a dot product over vectors) between the text and vision embeddings.
The article is presented as an implementation-focused build, emphasizing how the concept of shared embeddings enables database-like search over multimodal content.

Type "a corgi on grass." Out of 24 photos in the gallery, the corgi rises to the top. Score: 0.31.

Type "something to eat." A bowl of strawberries, a plate of pasta, and a wood-fired pizza take the medal positions. Score range: 0.25 – 0.27.

No server. No API key. No image got uploaded anywhere. The whole pipeline — a 150 MB neural network and 24 image embeddings — lives in a tab in your browser.

This is CLIP, the model that quietly powers a huge slice of modern computer vision. And in 2026, you can ship it on Vercel for free.

The idea behind CLIP

OpenAI released CLIP in 2021 with one beautifully simple idea: train one model to put text and images into the same vector space.

That's it. That's the whole trick.

CLIP has two encoders. The text encoder turns "a corgi puppy" into a 512-dimensional vector. The vision encoder turns a photo of a corgi into a 512-dimensional vector. They were trained on 400 million (caption, image) pairs scraped from the web so that paired text and images end up near each other in that vector space.

Once that's true, you can do search the way a database does it. The distance between the vector for "a corgi puppy" and the vector for an actual corgi photo is small. The distance between "a corgi puppy" and a photo of an astronaut is large. Sort by distance. Done.

"a corgi puppy"  ─▶  text encoder  ─▶  [0.04, -0.12, 0.07, ...]  ─┐
                                                                   ├─▶  cosine sim
[image bytes]    ─▶  vision encoder ─▶  [0.05, -0.10, 0.08, ...]  ─┘

The math at the end is two for-loops and a multiplication:

function cosineSim(a: Float32Array, b: Float32Array) {
  let dot = 0
  for (let i = 0; i < a.length; i++) dot += a[i] * b[i]
  return dot
}

That's it. That's the search engine.

Why this matters

If you understand "embed everything into the same vector space and compare with a dot product," you understand the heart of:

Pinterest's visual search. "Find me more like this."
Stable Diffusion's text conditioning. "Generate this prompt" is "find a region of vector space the model has learned to produce."
Dataset deduplication. "Which of these 50 million images are near-duplicates?" Cluster by vector distance.
Zero-shot classification. "Is this a cat, a dog, or a goat?" Encode the three labels, encode the image, take the closest.
Content moderation at scale. "Is this image semantically similar to the policy violations we've labelled?"

Every one of these has been a multi-million-dollar engineering problem for some company in the last five years. The core trick is what we're about to build in 200 lines of TypeScript.

Step 1: load CLIP into the browser

The thing that used to require a Python server with a GPU now runs in a tab. The library doing this magic is Transformers.js — Hugging Face's port of their Python transformers library to ONNX Runtime in JavaScript.

import {
  AutoTokenizer, AutoProcessor,
  CLIPTextModelWithProjection, CLIPVisionModelWithProjection
} from '@xenova/transformers'

const MODEL_ID = 'Xenova/clip-vit-base-patch32'

const [tokenizer, processor, textModel, visionModel] = await Promise.all([
  AutoTokenizer.from_pretrained(MODEL_ID),
  AutoProcessor.from_pretrained(MODEL_ID),
  CLIPTextModelWithProjection.from_pretrained(MODEL_ID),
  CLIPVisionModelWithProjection.from_pretrained(MODEL_ID)
])

First visit: ~150 MB of ONNX weights stream from the Hugging Face CDN into your browser's Cache API. Every visit after that: a few hundred milliseconds.

Step 2: encode a phrase

async function encodeText(text: string) {
  const inputs = tokenizer(text, { padding: true, truncation: true })
  const { text_embeds } = await textModel(inputs)
  return l2normalise(text_embeds.data)   // 512-d Float32Array
}

The text_embeds field is the projected vector — the one that lives in the shared space. The un-projected hidden state is the wrong vector to compare against.

We L2-normalise (divide by length) so cosine similarity reduces to a dot product. This is the kind of small detail nobody explains in tutorials but matters: without normalisation, your ranking becomes "which image is bright" not "which image matches your query."

Step 3: encode an image

async function encodeImage(url: string) {
  const image = await RawImage.read(url)
  const inputs = await processor(image)
  const { image_embeds } = await visionModel(inputs)
  return l2normalise(image_embeds.data)
}

The processor handles resize → centre-crop → normalisation with CLIP's specific mean/std values. The vision model is a Vision Transformer — it cuts the 224×224 image into 7×7 patches of 32×32 px, treats each patch as a token, runs them through a transformer (yes, the same architecture as GPT), and projects the [CLS] token down to 512-d.

Same 512-d. Same space as the text. That's the whole magic.

Step 4: rank

const queryVec = await encodeText("a corgi on grass")
const imageVecs = await Promise.all(images.map(img => encodeImage(img.url)))

const ranked = imageVecs
  .map((v, i) => ({ image: images[i], score: cosineSim(queryVec, v) }))
  .sort((a, b) => b.score - a.score)

That's the search engine. Eight lines.

Step 5: cache so it stays fast

The model weights cache in the Cache API automatically. But re-encoding 24 images on every visit is ~5 seconds of WASM work for no reason — the vectors don't change. Stash them in IndexedDB keyed by image id + model id:

const STORE = 'embeddings'

async function putCached(modelId: string, imageId: string, vec: Float32Array) {
  const db = await openDb()
  const tx = db.transaction(STORE, 'readwrite')
  tx.objectStore(STORE).put(vec.buffer, `${modelId}::${imageId}`)
}

48 KB for the whole gallery. Warm reloads now feel instant.

What this changes

Five years ago, "text-to-image search" was a paper. Two years ago, it was a Python server with a GPU and an SDK. Today, it's a Vercel deploy.

The line between "real AI engineering" and "a weekend project" keeps moving. Not because the models got smaller — CLIP is still 150 MB. The browser got bigger. WebAssembly. ONNX Runtime Web. IndexedDB. Cache API. The runtime stack ate everything the old Python service used to do.

If you're a beginner reading this thinking "AI is too hard, I'd never build that": you just read the whole thing. The code is on GitHub, every commit walks you through one concept, and there's a live demo at clip-from-zero.vercel.app. Clone it. Open clip.ts. Read the four functions. That's CLIP.

The next time you hear someone talking about "embeddings" or "vector search" or "RAG" or "multimodal" — you know what they mean. Numbers in a 512-d space. Cosine similarity. A dot product.

That's it.

🔗 Code: github.com/dev48v/clip-from-zero
🌐 Live demo: clip-from-zero.vercel.app
📚 Series: TechFromZero — a new technology every day, all free, all open source.

Black Hat USA

AI Business

AiFinPay: The AiFinPay SDK offers a seamless and efficient w

Dev.to

AiFinPay: The AiFinPay SDK provides a seamless and secure pa

Dev.to

AiFinPay: The AiFinPay SDK offers a seamless and secure paym

Dev.to

AiFinPay: The AiFinPay SDK offers a seamless and efficient w

Dev.to

I Built a Text-to-Image Search Engine That Runs Entirely in the Browser

Key Points

The idea behind CLIP

Why this matters

Step 1: load CLIP into the browser

Step 2: encode a phrase

Step 3: encode an image

Step 4: rank

Step 5: cache so it stays fast

What this changes

Related Articles

Black Hat USA

AiFinPay: The AiFinPay SDK offers a seamless and efficient w

AiFinPay: The AiFinPay SDK provides a seamless and secure pa

AiFinPay: The AiFinPay SDK offers a seamless and secure paym

AiFinPay: The AiFinPay SDK offers a seamless and efficient w

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer