TLDR: I ran JetBrains' Kotlin HumanEval on 11 local models, including some small ones that fit on a 16 GB VRAM GPU. Here are the results.
- pass@1 / pass@3:
- GPT-OSS 20B: 85% / 95%
- Qwen3.5-35B-a3b: 77% / 86%
- EssentialAI RNJ-1: 75% / 81% ← 8.8 GB file size
- Seed-OSS-36B: 74% / 81%
- GLM 4.7 Flash: 68% / 78%
A few things I found interesting:
- GPT-OSS 20B still dominates at 85% pass@1, despite being one of the smaller models by file size (12 GB)
- EssentialAI RNJ-1 at 8.8 GB took third place overall, beating models 2-3x its size
- Qwen jumped 18 points in seven months
Happy to answer questions about the setup.
[link] [comments]




