We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article describes YC-Bench, a benchmark where an LLM runs a simulated startup over roughly a year with hundreds of turns, managing employees, contracts, payroll, and delayed/sparse feedback plus adversarial clients who inflate requirements after acceptance.
  • In a test of 12 LLMs across 3 seeds each, GLM-5 nearly matched Claude Opus 4.6 in final average funds (about $1.21M vs. $1.27M) while costing ~11× less per run ($7.62 vs. $86 in API cost).
  • Most other models performed worse and several went bankrupt, suggesting that capabilities for long-horizon planning and robustness under uncertainty are critical beyond headline benchmark scores.
  • The benchmark highlights that long-horizon coherence under delayed feedback is often missing from existing evaluations, where many models fall into loops, abandon strategies, or keep accepting bad tasks.
  • A key predictor of success was whether the model used a persistent scratchpad to record what it learned; top performers rewrote notes ~34 times per run, while bottom models recorded almost nothing.
We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.

We built YC-Bench, a benchmark where an LLM plays CEO of a simulated startup over a full year (~hundreds of turns). It manages employees, picks contracts, handles payroll, and survives a market where ~35% of clients secretly inflate work requirements after you accept their task. Feedback is delayed and sparse with no hand-holding.

12 models, 3 seeds each. Here's the leaderboard:

  • 🥇 Claude Opus 4.6 - $1.27M avg final funds (~$86/run in API cost)
  • 🥈 GLM-5 - $1.21M avg (~$7.62/run)
  • 🥉 GPT-5.4 - $1.00M avg (~$23/run)
  • Everyone else - below starting capital of $200K. Several went bankrupt.

GLM-5 is the finding we keep coming back to. It's within 5% of Opus on raw performance and costs a fraction to run. For anyone building production agentic pipelines, the cost-efficiency curve here is real and Kimi-K2.5 actually tops the revenue-per-API-dollar chart at 2.5× better than the next model.

The benchmark exposes something most evals miss: long-horizon coherence under delayed feedback. When you can't tell immediately whether a decision was good, most models collapse into loops, abandon strategies they just wrote, or keep accepting tasks from clients they've already identified as bad.

The strongest predictor of success wasn't model size or benchmark score but it was whether the model actively used a persistent scratchpad to record what it learned. Top models rewrote their notes ~34 times per run. Bottom models averaged 0–2 entries.

📄 Paper: https://arxiv.org/abs/2604.01212
🌐 Leaderboard: https://collinear-ai.github.io/yc-bench/
💻 Code (fully open-source):https://github.com/collinear-ai/yc-bench

Feel free to run any of your models and happy to reply to your queries!

submitted by /u/DreadMutant
[link] [comments]