Key Points

The post argues that OpenAI’s ChatGPT voice mode uses an older, weaker model than many users assume, based on its self-reported knowledge cutoff of April 2024 ("GPT-4o era").
It highlights a perceived capability gap between model access points (free/voice) and higher-tier offerings, where more advanced models are said to handle significantly harder tasks.
The author connects the gap to differences in domains and feedback/reward structures—verifiable reward signals (like unit tests) are easier to use for reinforcement learning than subjective tasks like writing.
The discussion draws on Andrej Karpathy’s view that the model used in a given product experience shapes what people conclude about overall AI capability.
Overall, the takeaway is that how and where users access models can materially affect expectations, perceptions, and evaluations of AI performance.

Simon Willison’s Weblog

Sponsored by: Teleport — Connect agents to your infra in seconds with Teleport Beams. Built-in identity. Zero secrets. Get early access

10th April 2026

I think it's non-obvious to many people that the OpenAI voice mode runs on a much older, much weaker model - it feels like the AI that you can talk to should be the smartest AI but it really isn't.

If you ask ChatGPT voice mode for its knowledge cutoff date it tells you April 2024 - it's a GPT-4o era model.

This thought inspired by this Andrej Karpathy tweet about the growing gap in understanding of AI capability based on the access points and domains people are using the models with:

[...] It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and at the same time, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems.

This part really works and has made dramatic strides because 2 properties:

these domains offer explicit reward functions that are verifiable meaning they are easily amenable to reinforcement learning training (e.g. unit tests passed yes or no, in contrast to writing, which is much harder to explicitly judge), but also

they are a lot more valuable in b2b settings, meaning that the biggest fraction of the team is focused on improving them.

Posted 10th April 2026 at 3:56 pm

Recent articles

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools - 8th April 2026
Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me - 7th April 2026
The Axios supply chain attack used individually targeted social engineering - 3rd April 2026

This is a note by Simon Willison, posted on 10th April 2026.

ai 1955 openai 405 andrej-karpathy 42 generative-ai 1735 chatgpt 193 llms 1702

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe

ChatGPT voice mode is a weaker model

Key Points

Simon Willison’s Weblog

Recent articles

Monthly briefing

Related Articles

Black Hat Asia

Why Your pip Install Output Doesn't Belong in Claude's Context

I Logged Every Decision My AI Agent Made for a Week. Here's What I Learned.

The Rise of Vibe Coding and AI-Assisted Software Development

AI Transforms App Development Empowering New Creators and Accelerating Innovation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer