| MiniMax just dropped M2.7, their best model yet. I work with the Kilo Code team and we always test new models when they come out, so we ran M2.7 against Qwen3.5-plus, GLM-5, Kimi K2.5, and Qwen3.5-397b across two benchmarks:
TL;DR: M2.7 scores 86.2% on PinchBench, placing 5th overall and within 1.2 points of Claude Opus 4.6. On Kilo Bench, it passes 47% of tasks with a distinct behavioral profile — it may over-explore hard problems (which can lead to timeouts) but solves tasks that no other model can. It’s a fast and affordable model that fills some gaps that frontier models miss. PinchBench: #5 Out of 50 Models PinchBench runs standardized OpenClaw agent tasks and grades them via automated checks and an LLM judge. M2.7 scored 86.2%, landing just behind GLM-5 and GPT-5.4 (both 86.4%) and just ahead of Qwen3.5-plus (85.8%). What’s notable is the jump from M2.5 (82.5%) to M2.7 (86.2%) — a 3.7-point improvement that moved MiniMax from the middle of the pack into the top tier. Kilo Bench: 89 Tasks vs 5 Other Models M2.7 came in second overall at 47%, two points behind Qwen3.5-plus. But the raw pass rate doesn’t tell the full story. One pattern stood out: MiniMax-M2.7 reads extensively before writing. It pulls in surrounding files, analyzes dependencies, traces call chains. On tasks where that extra context pays off, it catches things other models miss. On tasks where the clock is ticking, that might cause it to run out of time. Where M2.7 Stands Out The most interesting finding from Kilo Bench isn’t the pass rate. It’s what each model uniquely solves. Every model in this comparison solved tasks that no other model could: M2.7’s unique win on the SPARQL task is a good example of its strength: the task required understanding that an EU-country filter was an eligibility criterion, not an output filter. That’s a reasoning distinction, not a coding one. A hypothetical oracle that picks the best model per task would solve 60 out of 89 tasks (67%) — a 36% improvement over the best single model. These models aren’t interchangeable. They’re complementary. The 89 tasks split into clear tiers:
Token Efficiency Based on both benchmarks, here’s how M2.7 fits into the model landscape available in Kilo: M2.7 is a strong pick when you’re working on tasks that reward deep context gathering — complex refactors, codebase-wide changes, or anything where understanding surrounding code matters more than speed. Its PinchBench score puts it in the same tier as GPT-5.4 and GLM-5 for general agent tasks. Compared to frontier models like Opus 4.6 and GPT 5.4 that offer the same attributes, it’s much less expensive at $0.30/M input and $1.20/M output. Consider a different model (even such as M2.1 or M2.5) when you need very fast iteration cycles or are working on well-scoped, time-sensitive tasks. M2.7’s median task duration (355s) is notably longer than its predecessors. Full analysis - https://blog.kilo.ai/p/minimax-m27 [link] [comments] |
Benchmarked MiniMax M2.7 through 2 benchmarks. Here's how it did
Reddit r/LocalLLaMA / 3/19/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- MiniMax M2.7 was released and benchmarked against several competitors (Qwen3.5-plus, GLM-5, Kimi K2.5, and Qwen3.5-397b) across PinchBench and Kilo Bench.
- In PinchBench, M2.7 scored 86.2%, ranking 5th out of 50 models and trailing GLM-5 and GPT-5.4 but narrowly ahead of Qwen3.5-plus; the result marks a 3.7-point jump from M2.5.
- In Kilo Bench, M2.7 passed 47% of tasks, finishing second overall behind Qwen3.5-plus, with a distinct behavioral profile that tends to over-explore hard problems and may cause timeouts.
- Qualitatively, M2.7 reads extensively before writing, pulling in surrounding files and dependencies to solve tasks others miss, which can be advantageous on complex problems but slower under tight time constraints.
- Overall, M2.7 is fast and affordable, filling gaps that frontier models miss and moving MiniMax into the top tier of evaluated models.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to