DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

Reddit r/LocalLLaMA / 5/5/2026

📰 NewsSignals & Early TrendsIndustry & Market MovesModels & Research

共有:

Key Points

DeepSeek V4 Pro reportedly matches GPT-5.2 on FoodTruck Bench, an agentic 30-day benchmark where models control a simulated food truck using 34 tools with persistent memory and daily reflection.
The article emphasizes that the performance gap between China and the US on this benchmark has narrowed to around ten weeks, compared with about a year previously.
DeepSeek V4 Pro is far cheaper than GPT-5.2, costing about 0.435/M input and 0.87/M output (plus discounted cache reads), translating to roughly ~17× lower cost for the same agentic workload.
In cost-efficiency by “net worth per dollar” API spend, DeepSeek V4 Pro ranks #2 overall—behind only Gemma 4 31B and ahead of premium-tier models.
A follow-up update says Xiaomi MiMo v2.5 Pro also performed strongly (5/5 survived, ~1,019% median ROI) to land #6 on the leaderboard, reinforcing that multiple Chinese agentic models are now competing in frontier tiers at low per-run cost.

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

Tested DeepSeek V4 Pro on FoodTruck Bench — our 30-day agentic benchmark where models run a food truck via 34 tools (locations, pricing, inventory, staff, weather, events) with persistent memory and daily reflection.

First Chinese model to land in the frontier tier on our benchmark. Tied with Grok 4.3 Latest on outcome, within 3% of GPT-5.2's median, #4 overall behind Opus 4.6, GPT-5.2, and Grok 4.3.

The timing is the interesting part. We tested GPT-5.2 in mid-February. DeepSeek V4 Pro matches its numbers ten weeks later. The China–US frontier gap on this benchmark used to feel like a year. Right now it's about ten weeks.

The pricing gap is even sharper. GPT-5.2 charges $1.75/M input and $14/M output. DeepSeek V4 Pro is at $0.435/M input and $0.87/M output, with discounted cache reads on top — ~17× cheaper for the same agentic workload. That's promo pricing today, but DeepSeek's track record is that promo becomes the floor.

On cost-efficiency (net worth per dollar of API spend) DeepSeek V4 Pro is #2 overall on the leaderboard — behind only Gemma 4 31B, ahead of every premium-tier model.

Against Grok 4.3 Latest specifically the medians are basically tied at the same price, but DeepSeek wins on consistency: zero loans, ~6× less food waste, 30% more meals served per day, 2.4× tighter outcome distribution. Grok matches DeepSeek's peak. DeepSeek matches its own peak every time.

Opus 4.6's peak run is still higher than DeepSeek's. Gemma is still cheaper. Otherwise this is a real frontier-tier competitor at a Chinese price point.

Update — Xiaomi MiMo v2.5 Pro just finished its run set as well: 5/5 survived, +1,019% median ROI, $22,388 median net worth at $2.41/run. Lands at #6 on the leaderboard, between Gemma 4 31B and Sonnet 4.6. Slightly behind DeepSeek on outcome and consistency (wider variance — $9K worst run vs $29K best), but a real result for a Chinese model at this price point.

That's now two Chinese models in our top 6, both at sub-$3.5/run. When we started this benchmark in February, neither of these tiers existed outside US labs.

Congrats to the DeepSeek and Xiaomi MiMo teams.

Full write-up: https://foodtruckbench.com/blog/deepseek-v4-pro
Leaderboard: https://foodtruckbench.com

submitted by /u/Disastrous_Theme5906
[link] [comments]