AI Navigate

I Ran Kotlin HumanEval on 11 Local LLMs. An 8GB Model Beat Several 30B Models

Reddit r/LocalLLaMA / 3/15/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author ran JetBrains' Kotlin HumanEval on 11 local LLMs, including some that fit on a 16 GB VRAM GPU.
  • In the results, GPT-OSS 20B achieved pass@1 of 85% and pass@3 of 95%, Qwen3.5-35B-a3b 77% / 86%, EssentialAI RNJ-1 75% / 81% (8.8 GB), Seed-OSS-36B 74% / 81%, and GLM 4.7 Flash 68% / 78%.
  • GPT-OSS 20B dominates pass@1 despite being a relatively small model (~12 GB), while RNJ-1 at 8.8 GB placed third, beating models two to three times larger.
  • Qwen improved by 18 points in seven months, signaling rapid progress among local LLMs.

TLDR: I ran JetBrains' Kotlin HumanEval on 11 local models, including some small ones that fit on a 16 GB VRAM GPU. Here are the results.

  • pass@1 / pass@3:
    • GPT-OSS 20B: 85% / 95%
    • Qwen3.5-35B-a3b: 77% / 86%
    • EssentialAI RNJ-1: 75% / 81% ← 8.8 GB file size
    • Seed-OSS-36B: 74% / 81%
    • GLM 4.7 Flash: 68% / 78%

A few things I found interesting:

  • GPT-OSS 20B still dominates at 85% pass@1, despite being one of the smaller models by file size (12 GB)
  • EssentialAI RNJ-1 at 8.8 GB took third place overall, beating models 2-3x its size
  • Qwen jumped 18 points in seven months

Happy to answer questions about the setup.

submitted by /u/codeforlyfe
[link] [comments]