We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local

Reddit r/LocalLLaMA / 5/2/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The post reports that using Qwen3.6-27B with an agentic LangGraph/LangChain setup and local web search can achieve very high SimpleQA performance fully on a single RTX 3090 with an Ollama backend.
  • In the stated local benchmarks, Qwen3.6-27B reaches 95.7% SimpleQA (287/300) and 77.0% on xbench-DeepSearch (77/100), outperforming smaller Qwen3.5-9B in the same setup.
  • The author frames these results as agent + search performance rather than closed-book accuracy, and notes they are broadly comparable to public end-to-end agent systems like Perplexity Deep Research.
  • The evaluation includes self-grading by the same Qwen3.6-27B model and emphasizes caveats such as possible benchmark contamination on newer base models, judge noise, small sample sizes, and language bias in xbench-DeepSearch.
  • A key takeaway is that tool-calling quality and the LangGraph agent strategy may matter more than raw parameter count for local deep research workflows.

LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far. I haven't reported in a while because I thought I was not ready for another prominent post in one of the leading outlets of Local LLM research.

But I think the LDR community finally there again. I think it is finally time to report again.

Setup

  • RTX 3090, 24GB
  • Ollama backend (qwen3.6:27b)
  • LDR's langgraph_agent strategy — LangChain create_agent() with tool-calling, parallel subtopic decomposition, up to 50 iterations
  • LLM grader: qwen3.6:27b self-graded (I have used opus to review examples and it generally only underestimates accuracy)

Benchmarks (fully local LLM with web search)

Model SimpleQA xbench-DeepSearch
Qwen3.6-27B 95.7% (287/300) 77.0% (77/100)
Qwen3.5-9B 91.2% (182/200) 59.0% (59/100)
gpt-oss-20B 85.4% (295/346)

sample size is small, but the benchmarks were not rerun multiple times and you can see from the other rows that this is unlikely just chance. Full leaderboard: https://huggingface.co/datasets/local-deep-research/ldr-benchmarks

Important framing — these are agent + search scores, not closed-book

However, also note that these are similar benchmarks results to Perplexity Deep Research (93.9%), tavily (93.3%) etc. [Tavily forces the LLM to answer only from retrieved docs (pure retrieval test). Perplexity Deep Research is an end-to-end agent and discloses no grader or sample size. ]

Even if our results where only 90% it would already be a great success.

Also I can confirm from using it daily that these results feel consistent with my performance on random querries I do for daily questions.

Caveats:

  • SimpleQA contamination risk on newer base models is real
  • LLM-judge noise + Sampling error
  • bench-DeepSearch is in chinese so an advantage for the chinese qwen models
  • No BrowseComp / GAIA numbers yet - But I also dont believe we are good at this benchmark yet. I will have to run some benchmarks to verify the current state

The thing that surprised me:

Results seem to track tool-calling quality more than raw size for local deep research. The langgraph_agent strategy hammers the model with multi-iteration tool calls, parallel subagent decomposition, and structured output — exactly the axis where the newer Qwen generations have improved most. Hypothesis only; if anyone wants to design an ablation we'd love the data.

Some cool LDR features that I want to additionally highlight:

  • Journal Quality System (shipped v1.6.0) - academic source grading using OpenAlex, DOAJ. I haven't seen this anywhere else in the open-source deep-research space.
  • Per-user SQLCipher AES-256 DB (PBKDF2-HMAC-SHA512, 256k iterations) — admins can't read your data at rest. No password recovery; we don't hold the keys.
  • Zero telemetry. no telemetry, no analytics, no tracking.
  • Cosign-signed Docker images with SLSA provenance + SBOMs.
  • MIT licensed. Everything open source

Repo: https://github.com/LearningCircuit/local-deep-research

Happy to share strategy configs, help reproduce the Qwen runs

Thanks to all the academic and other open source foundational work that made this repo possible.

submitted by /u/ComplexIt
[link] [comments]