Key Points

Simon Willison shares a macOS command-line recipe using `uv` to transcribe an audio file with the Gemma 4 E2B model via MLX and the `mlx-vlm` toolkit.
The example runs `mlx_vlm.generate` with an audio input, a simple transcription prompt, and configurable generation parameters like `--max-tokens` and `--temperature`.
A brief test on a 14-second WAV demonstrates the approach working end-to-end, while also showing occasional transcription errors (e.g., mishearing “right here” as “front”).
The post is positioned as a practical “how to” note for getting local/MLX-based audio transcription using a Gemma 4 variant and supporting libraries.

Simon Willison’s Weblog

Sponsored by: Teleport — Connect agents to your infra in seconds with Teleport Beams. Built-in identity. Zero secrets. Get early access

12th April 2026

Thanks to a tip from Rahim Nathwani, here's a uv run recipe for transcribing an audio file on macOS using the 10.28 GB Gemma 4 E2B model with MLX and mlx-vlm:

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0

Your browser does not support the audio element.

I tried it on this 14 second .wav file and it output the following:

This front here is a quick voice memo. I want to try it out with MLX VLM. Just going to see if it can be transcribed by Gemma and how that works.

(That was supposed to be "This right here..." and "... how well that works" but I can hear why it misinterpreted that as "front" and "how that works".)

Posted 12th April 2026 at 11:57 pm