I Asked Three Coding Agents to Build My Son's Cricket Coach a Website. The Result Wasn't Decided by the Model — It Was Decided by Taste.

Dev.to / 4/28/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The author tested three AI coding agents (OpenAI Codex GPT-5.5, Anthropic Claude Opus 4.7, and Google Gemini 3.1 Pro) with the exact same short prompt and 18 WhatsApp photos to build a single-page website for a kids’ cricket academy.
  • Despite identical inputs, the five outputs varied widely, and the differences were not mainly about visual quality but about how each model interpreted “easy contact from a phone.”
  • The winning version was not the prettiest; it better matched the real customer context in Bengaluru where parents contact via WhatsApp rather than using contact forms.
  • The experiment suggests that for practical web building with AI, alignment with user intent and local communication habits matters more than model effort level or aesthetics.
  • The author frames this as a useful “low-stakes” evaluation of AI coding agents, emphasizing taste, product understanding, and requirement interpretation over surface-level design.
  • The visible takeaway is that prompts should specify critical behavioral requirements (e.g., WhatsApp-first contact) if you want consistent outcomes across different agents.

TL;DR — Codex GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro. Same prompt. Same 18 photos. Five total runs across different effort budgets. The one that won wasn't the prettiest. It was the one that understood the job: parents in Bengaluru enquire on WhatsApp, not contact forms.

My son's cricket coach asked me for a website.

Saturday afternoon. He runs Bangalore Royal Cricket Academy — a small but seriously good cricket academy for kids. He had two phone numbers, a folder of 18 WhatsApp photos taken by parents, and a single line of brief: "Like a real cricket academy, parents should be able to call or WhatsApp from their phone."

I'm a CTO. I'm in the trenches with AI coding agents most weeks. This felt like a clean, low-stakes test.

So I gave the exact same prompt and the exact same 18 photos to three coding agents:

  • OpenAI Codex (GPT-5.5, medium effort)
  • Anthropic Claude Opus 4.7 (low effort, then re-run on medium)
  • Google Gemini 3.1 Pro (low effort, then re-run on high)

Five outputs. One Saturday. Five very different opinions on what "a cricket academy website" actually is.

I went in expecting a verdict on visual quality. I didn't get one. I got something more interesting.

The setup

The prompt was deliberately short:

Build a single-page website for Bangalore Royal Cricket Academy. Brand line: "Nurturing champions, one delivery at a time." Programs: Summer Camp, Weekday Batch, Weekend Batch, Intensive (elite). Two phone numbers. The photos are in /photos for website. Parents should be able to contact us easily from their phone.

That's it. No design system. No colour palette. No mention of WhatsApp by name. No mention of tests, deployment, SEO meta, or Cloudflare. Whatever each agent decided "easily contact us from their phone" meant — that was on the agent.

What I got back, in five outputs

1. Claude Opus 4.7, low effort

Single-file HTML, Tailwind via CDN, Bebas Neue display font, royal navy + gold palette.

The headline made me sit up: "CHAMPIONS ARE / BUILT HERE." with the second half in gold. It was the only one of the five where the hero felt like it belonged on a printed flyer the coach would hand out at a school. Visually polished.

Engineering-wise, thin: no tests, no OG tags beyond a <meta description>, photos referenced as img-01.jpgimg-18.jpg, all 14 used in a uniform 4-column grid. Tel: links only. No WhatsApp.

2. Claude Opus 4.7, medium effort

Same starting point, completely different output.

<section id="top" class="relative h-screen min-h-[640px] w-full overflow-hidden">
  <div class="absolute inset-0">
    <img src="assets/photos/brca-01.jpeg" alt="" class="kenburns" />
    <div class="absolute inset-0 bg-gradient-to-b from-navy-deep/85 via-navy/70 to-navy-deep/95"></div>
  </div>
  ...
</section>

Full-screen hero with a Ken Burns animation on the image. A scroll indicator with an animated dot inside a mouse outline. A gold cricket-seam pattern divider between sections — actual dashed lines that look like ball stitching. Two-image collage in the about section with offset margins. CSS-columns masonry gallery using all 15 photos. Inline-SVG favicon as a data URI (one fewer request). OG tags. theme-color. WhatsApp deep-link button on the contact section.

<a href="https://wa.me/917337726777?text=Hi%20BRCA%2C%20I%27d%20like%20to%20know%20more%20about%20your%20programs."
   target="_blank" rel="noopener"
   class="bg-gold text-navy font-semibold px-6 py-3.5 rounded-md">
  💬 Message us on WhatsApp
</a>

This was the prettiest output of the five. By a clear margin. Bebas Neue + Inter, Ken Burns, gold seam, masonry — the only one I'd let near a printer.

Still Tailwind via CDN. Still no test suite. Still no automated deploy. Photos renamed semantically (brca-01.jpeg).

3. Codex GPT-5.5, medium effort

Vanilla HTML + 800-line vanilla CSS + 16 lines of vanilla JS. White-and-navy local-business layout. Numbered "01–04" feature blocks. WhatsApp green CTAs in the contact section.

It looks less editorial than Claude-medium. It also does five things none of the others did.

One. It picked 6 photos out of 18 and renamed them by content:

brca-team-ground.jpeg
brca-trophy-team.jpeg
brca-trophy-presentation.jpeg
brca-young-achievers.jpeg
brca-coaching-moment.jpeg
brca-floodlight-batch.jpeg

That's editorial judgement encoded in code output. It chose; it didn't dump everything into a grid.

Two. It wrote a _headers file:

/*
  X-Content-Type-Options: nosniff
  Referrer-Policy: strict-origin-when-cross-origin
  Permissions-Policy: camera=(), microphone=(), geolocation=()

/assets/*
  Cache-Control: public, max-age=31536000, immutable

Security headers and cache rules. I didn't ask for them.

Three. It wrote a real test suite using node:test:

test('home page exposes call and WhatsApp enrollment links', async () => {
  const html = await text('index.html');
  assert.match(html, /href="tel:\+917337726777"/);
  assert.match(html, /href="tel:\+917337736777"/);
  assert.match(html, /https:\/\/wa\.me\/917337726777/);
  assert.match(html, /https:\/\/wa\.me\/917337736777/);
});

test('referenced local assets and Cloudflare Pages config exist', async () => {
  const imageRefs = [...html.matchAll(/src="([^"]+\.(?:jpg|jpeg|png|webp))"/gi)]
    .map(m => m[1]);
  assert.ok(imageRefs.length >= 6, 'at least six academy photos are used');
  for (const ref of imageRefs) {
    assert.ok(existsSync(new URL(ref, root)), `${ref} exists`);
  }
});

Three tests. They assert brand text, both phone numbers, both WhatsApp links, security file existence, responsive CSS, and that every referenced image actually exists on disk. That last one is the one I respect most. It catches the single most common silent break in a static site.

Four. Every primary CTA is a wa.me deep link with prefilled message text:

<a class="contact-link whatsapp"
   href="https://wa.me/917337726777?text=Hi%20BRCA%2C%20I%20would%20like%20to%20know%20more%20about%20cricket%20training."
   target="_blank" rel="noopener">
  <span>WhatsApp</span>
  <strong>7337726777</strong>
</a>

Not just wa.me/91…. Pre-filled message text. Parent taps. Message lands. Zero typing.

Five. It deployed it. It opened my browser, walked me through a Cloudflare OAuth handshake, then pushed the build to Cloudflare Pages. The .wrangler/cache/pages.json left behind:

{ "account_id": "...", "project_name": "brca-academy" }

Most coding agents stop at "here's the HTML." Codex stopped at a live URL. That distinction — treating "build a website" as a unit of work that includes shipping, not just generating markup — is what made me rate it the most production-ready output of the five.

4. Gemini 3.1 Pro, low effort

Dark slate background. Electric blue + amber accents. 60 lines of vanilla JS with an IntersectionObserver scroll-reveal effect.

It looked like a SaaS analytics dashboard. Wrong audience by about ten years. Photos referenced as photo_1.jpegphoto_18.jpeg. Tel: only.

5. Gemini 3.1 Pro, high effort

Palette fixed: navy + amber. Playfair Display + Outfit for typography. About section with an image collage and an "Elite Training Facility" badge. Wider elite-program card with a dedicated highlights box. Mobile menu with hamburger.

Visually, a different website from the low-effort version. Genuinely better.

What it still didn't have:

  • WhatsApp deep links. Anywhere. Tel: only.
  • OG tags or theme-color.
  • A test suite.
  • A deployment config.
  • Semantic photo names — still img1.jpeg through img8.jpeg.

More budget bought better visuals. It didn't buy better judgement about what a Bengaluru cricket academy website is for.

What actually decided it

Not the prettiest hero. Not the cleverest animation.

This:

In Bengaluru, parents enquire on WhatsApp. Not email. Not contact forms. Not phone calls until they've messaged first.

The single biggest conversion lever for an Indian local business website is wa.me deep linking with prefilled message text. Parent opens the page. Parent taps the button. WhatsApp opens with "Hi BRCA, I would like to know more about cricket training" already typed. They send. Coach gets a notification.

Codex did this on every primary CTA. Claude-medium did it as one button at the bottom of the contact section. Claude-low, Gemini-low, and Gemini-high didn't do it at all.

That single decision was worth more than the prettiest hero in the comparison.

The thing I wasn't expecting

I went in assuming effort budget would be the variable that explained quality differences.

Compare what happened when I doubled the effort budget on each model:

Claude (low → medium): The visual quality jumped from "pretty" to "editorial-grade". It added Ken Burns animation, masonry gallery, OG tags, a theme-color, semantic photo names, and a WhatsApp button. It also renamed photos from img-XX.jpg to brca-XX.jpeg. The model used the extra budget to upgrade both taste and product judgement.

Gemini (low → high): The visual quality jumped. The palette got fixed. The typography got upgraded. The layout got more sophisticated.

It still didn't add WhatsApp.

It still didn't write tests.

It still didn't deploy.

It still left photos as img1.jpeg.

More budget didn't teach the model what the website was for. It only taught it to make the wrong website prettier.

The headline isn't Codex won because GPT-5.5 is the best model. The headline is:

Effort budget isn't the variable that explains output quality. Taste is.

Codex on a single medium run produced more production-ready output than Gemini on high. Claude on medium produced the most beautiful site in the lineup. Gemini on high produced a much-improved-but-still-fundamentally-misjudged website.

The extra budget surfaced what each model already understood about the job. It didn't change what the model thought the job was.

Sidebar: Two paths to a Cloudflare token

Worth mentioning because it's the kind of thing CTOs care about.

When each agent needed to deploy to Cloudflare Pages, they took one of two paths:

Path A — silent OAuth. Codex (medium) and Gemini (low) opened my browser, walked me through Cloudflare's OAuth flow, and got a session. Fast. Smooth. I never saw the token. The agent now has access to my entire Cloudflare account for the duration of that session.

Path B — paste-your-own-token. Claude (at every effort level) and Gemini (at medium effort) said: "Go to Cloudflare → My Profile → API Tokens → Create Token with these specific scopes — Account: Cloudflare Pages: Edit — and paste it here. I won't see your account session." More friction at install time. Also more control: the token is scoped, I can see exactly what I gave the agent, I can rotate or revoke it without touching my main session.

Both are defensible. Path A optimises for time-to-deploy. Path B optimises for credential hygiene.

If you're a solo developer building a side project, Path A is probably fine. If you're running production infrastructure for a fintech and an AI agent is asking for credentials, Path B is the only answer. The fact that two of three agents converge to Path B at higher effort levels — Claude always, Gemini at medium and above — suggests their "thoughtful" mode is more security-aware. Codex stayed silent-OAuth even at medium. Worth knowing.

What this means for picking a coding agent in 2026

Three takeaways, none of them about benchmarks.

One. Test the agent on a job, not on a problem. "Build a website" and "build a website that converts WhatsApp leads for an Indian local business" are different evaluations. The first is a syntax exercise. The second tells you whether the agent can read the room.

Two. Effort budgets are amplifiers, not teachers. They make a model more of what it already is. If a model doesn't understand the job at low effort, high effort will produce a more polished version of the wrong thing.

Three. Production scaffolding is the cheapest signal of seriousness. Tests. Headers. OG meta. A 404.html. Curated photos with content-aware filenames. None of these were in my prompt. The agent that wrote all of them on its own is the one I trust with code I can't review line by line.

Coda — what actually shipped

I have to be honest about something the single-shot benchmark couldn't capture.

Codex won my engineering eval. That stands. It's the one I'd hand a junior dev and say "ship it."

But the one I reached for next was Claude.

Two more prompts with the medium-effort Claude — "add a persistent WhatsApp floating button," "add a three-card contact section like a real local business, with primary office / coaching desk / WhatsApp" — and a bit of browser automation to handle the Cloudflare deploy and DNS, and the site went live at brca.in.

That's the version the coach is using today. WhatsApp floating button. Three contact cards. A "Free trial session available" pill the coach asked for after the first parent enquiry. A schedule strip. Custom domain. Live HTTPS.

Why Claude, not Codex — given my own engineering verdict?

Because the single-shot test answers "which agent has the best instincts." The shipping test answers "which agent do I want as a collaborator."

Those are different questions. They had different answers for me.

Claude was the one I wanted to keep editing. The Bebas Neue + gold-seam aesthetic, the masonry gallery, the Ken Burns hero — those are the parts of the design I didn't want to throw away. Codex's output was more correct. Claude's output was the one I had a relationship with.

That's a real signal. Worth saying out loud.

The closer

The coach got a website. Parents got a WhatsApp button. The site is live at brca.in.

The first parent message landed in the inbox before sundown. "Hi BRCA, I would like to know more about cricket training."

The one-shot finding holds: at first contact, taste decided the comparison. Codex's instinct for what an Indian local business website needed to do was sharper than any other model in the lineup.

But the part of the comparison nobody benchmarks is the part that matters most after the demo: which agent do you actually want to keep working with.

For me, on this job — it was Claude.

Champions are built here. Apparently websites too.

Site live at brca.in. Drop a comment if you'd like the source code for all five runs — happy to share the GitHub repo.

What I'd love to know: which agent are you reaching for in 2026 — and what's the smallest job you've used to test whether it actually understands the room? Reply below.