| We built YC-Bench, a benchmark where an LLM plays CEO of a simulated startup over a full year (~hundreds of turns). It manages employees, picks contracts, handles payroll, and survives a market where ~35% of clients secretly inflate work requirements after you accept their task. Feedback is delayed and sparse with no hand-holding. 12 models, 3 seeds each. Here's the leaderboard:
GLM-5 is the finding we keep coming back to. It's within 5% of Opus on raw performance and costs a fraction to run. For anyone building production agentic pipelines, the cost-efficiency curve here is real and Kimi-K2.5 actually tops the revenue-per-API-dollar chart at 2.5× better than the next model. The benchmark exposes something most evals miss: long-horizon coherence under delayed feedback. When you can't tell immediately whether a decision was good, most models collapse into loops, abandon strategies they just wrote, or keep accepting tasks from clients they've already identified as bad. The strongest predictor of success wasn't model size or benchmark score but it was whether the model actively used a persistent scratchpad to record what it learned. Top models rewrote their notes ~34 times per run. Bottom models averaged 0–2 entries. 📄 Paper: https://arxiv.org/abs/2604.01212 Feel free to run any of your models and happy to reply to your queries! [link] [comments] |
We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.
Reddit r/LocalLLaMA / 4/4/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The article describes YC-Bench, a benchmark where an LLM runs a simulated startup over roughly a year with hundreds of turns, managing employees, contracts, payroll, and delayed/sparse feedback plus adversarial clients who inflate requirements after acceptance.
- In a test of 12 LLMs across 3 seeds each, GLM-5 nearly matched Claude Opus 4.6 in final average funds (about $1.21M vs. $1.27M) while costing ~11× less per run ($7.62 vs. $86 in API cost).
- Most other models performed worse and several went bankrupt, suggesting that capabilities for long-horizon planning and robustness under uncertainty are critical beyond headline benchmark scores.
- The benchmark highlights that long-horizon coherence under delayed feedback is often missing from existing evaluations, where many models fall into loops, abandon strategies, or keep accepting bad tasks.
- A key predictor of success was whether the model used a persistent scratchpad to record what it learned; top performers rewrote notes ~34 times per run, while bottom models recorded almost nothing.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

I Audited 30+ Small Businesses on Their AI Visibility. Here's What Most Are Getting Wrong.
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Один промпт заменил мне 3 часа работы с текстами в день
Dev.to