We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

Reddit r/LocalLLaMA / 5/2/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post reports that using Qwen3.6-27B with an agentic LangGraph/LangChain setup and local web search can achieve very high SimpleQA performance fully on a single RTX 3090 with an Ollama backend.
In the stated local benchmarks, Qwen3.6-27B reaches 95.7% SimpleQA (287/300) and 77.0% on xbench-DeepSearch (77/100), outperforming smaller Qwen3.5-9B in the same setup.
The author frames these results as agent + search performance rather than closed-book accuracy, and notes they are broadly comparable to public end-to-end agent systems like Perplexity Deep Research.
The evaluation includes self-grading by the same Qwen3.6-27B model and emphasizes caveats such as possible benchmark contamination on newer base models, judge noise, small sample sizes, and language bias in xbench-DeepSearch.
A key takeaway is that tool-calling quality and the LangGraph agent strategy may matter more than raw parameter count for local deep research workflows.

LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far. I haven't reported in a while because I thought I was not ready for another prominent post in one of the leading outlets of Local LLM research.

But I think the LDR community finally there again. I think it is finally time to report again.

Setup

RTX 3090, 24GB
Ollama backend (qwen3.6:27b)
LDR's langgraph_agent strategy — LangChain create_agent() with tool-calling, parallel subtopic decomposition, up to 50 iterations
LLM grader: qwen3.6:27b self-graded (I have used opus to review examples and it generally only underestimates accuracy)

Benchmarks (fully local LLM with web search)

Model	SimpleQA	xbench-DeepSearch
Qwen3.6-27B	95.7% (287/300)	77.0% (77/100)
Qwen3.5-9B	91.2% (182/200)	59.0% (59/100)
gpt-oss-20B	85.4% (295/346)	–

sample size is small, but the benchmarks were not rerun multiple times and you can see from the other rows that this is unlikely just chance. Full leaderboard: https://huggingface.co/datasets/local-deep-research/ldr-benchmarks

Important framing — these are agent + search scores, not closed-book

However, also note that these are similar benchmarks results to Perplexity Deep Research (93.9%), tavily (93.3%) etc. [Tavily forces the LLM to answer only from retrieved docs (pure retrieval test). Perplexity Deep Research is an end-to-end agent and discloses no grader or sample size. ]

Even if our results where only 90% it would already be a great success.

Also I can confirm from using it daily that these results feel consistent with my performance on random querries I do for daily questions.

Caveats:

SimpleQA contamination risk on newer base models is real
LLM-judge noise + Sampling error
bench-DeepSearch is in chinese so an advantage for the chinese qwen models
No BrowseComp / GAIA numbers yet - But I also dont believe we are good at this benchmark yet. I will have to run some benchmarks to verify the current state

The thing that surprised me:

Results seem to track tool-calling quality more than raw size for local deep research. The langgraph_agent strategy hammers the model with multi-iteration tool calls, parallel subagent decomposition, and structured output — exactly the axis where the newer Qwen generations have improved most. Hypothesis only; if anyone wants to design an ablation we'd love the data.

Some cool LDR features that I want to additionally highlight:

Journal Quality System (shipped v1.6.0) - academic source grading using OpenAlex, DOAJ. I haven't seen this anywhere else in the open-source deep-research space.
Per-user SQLCipher AES-256 DB (PBKDF2-HMAC-SHA512, 256k iterations) — admins can't read your data at rest. No password recovery; we don't hold the keys.
Zero telemetry. no telemetry, no analytics, no tracking.
Cosign-signed Docker images with SLSA provenance + SBOMs.
MIT licensed. Everything open source

Repo: https://github.com/LearningCircuit/local-deep-research

Happy to share strategy configs, help reproduce the Qwen runs

Thanks to all the academic and other open source foundational work that made this repo possible.

submitted by /u/ComplexIt
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/2DailyView insight →

Black Hat USA

AI Business

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

I tracked my referral sources for 30 days. AI chatbots are beating Google.

Dev.to

When AI Agents Trade Autonomously: Building Economic Actors That Never Sleep

Dev.to

TestSprite: Review Mendalam dari Developer Indonesia — Lokalisasi, Tanggal, dan Mata Uang Rupiah

Dev.to

We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

I tracked my referral sources for 30 days. AI chatbots are beating Google.

When AI Agents Trade Autonomously: Building Economic Actors That Never Sleep

TestSprite: Review Mendalam dari Developer Indonesia — Lokalisasi, Tanggal, dan Mata Uang Rupiah

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer