Dynamic few-shot retrieval on Apple's on-device 3B LLM: 40% → 70%+ on shell commands

Reddit r/LocalLLaMA / 4/10/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

An experiment using Apple’s on-device ~3B LLM for shell-command generation found baseline accuracy around ~40%, with documentation-style context not improving results and self-critique sometimes reducing accuracy.
The biggest improvement came from dynamic few-shot retrieval: pulling relevant examples from a ~21k community TLDR corpus via FTS5 and presenting them as “solved examples to copy,” raising accuracy to ~70%+ at ~0.5 seconds per query.
Accuracy scaled with the retrieved “bank” size and curation; the author reports further gains to ~78% using custom overrides, suggesting retrieval quality is a key lever for small on-device models.
Adding self-consistency (multiple samples + majority vote) and layering CoT on top of retrieval greatly increased latency (~3x) but produced minimal accuracy gains overall, with self-consistency mainly reducing variance.
The author notes Apple supports LoRA adapters on FoundationModels as a potential next step, but emphasizes that context framing can be more important than the underlying text for this task.

Dynamic few-shot retrieval on Apple's on-device 3B LLM: 40% → 70%+ on shell commands

I've been poking at Apple's on-device 3B model (via FoundationModels on Tahoe) to see where its ceiling sits on code-adjacent tasks. Tested shell command generation as a concrete benchmark (100 prompts, ~10 approaches)

https://i.redd.it/ferxmyorh7ug1.gif

Bare model: ~40% correct. Mostly flags and some command hallucinations. Feeding documentation as context didn't help. Not man pages, not tldr as docs, not self-critique loops. All within noise of baseline, and self-critique was actively worse (33%); the model "fixes" correct commands into wrong ones.

What worked: dynamic few-shot retrieval from tldr's 21k community examples via FTS5. Same corpus, reframed as solved examples to copy from instead of reference material. Clean held-out: ~70% at 0.5s per query. That's a 30-point jump from reframing alone. Accuracy scales with bank size, so more or better-curated examples will push it further (I got it up to 78% with custom overrides).

I also tested self-consistency (temp 0.3, 3 samples, majority vote) and CoT on top of retrieval. Both ~3x slower, neither moved accuracy much, but SC crushed variance across runs. Probably worth exploring this more.

Haven't tried finetuning yet. Apple allows LoRA adapters on FoundationModels, so that's the obvious next lever, though it complicates distribution.

Takeaway: for small on-device models, how you frame the context matters more than what's in it. Same 21k strings, 30+ point gap depending on whether they're presented as docs or examples. Curious if others have seen the same split on Qwen 3B / Gemma 2B / Phi-3.

Full writeup with everything I tried: https://es617.dev/2026/04/08/apple-on-device-llm-shell.html

The repo with CLI and benchmark data, if anyone wants to play with it. https://github.com/es617/hunch

submitted by /u/es617_dev
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

The Register

I tested and ranked every ai companion app I tried and here's my honest breakdown

Reddit r/artificial

Dynamic few-shot retrieval on Apple's on-device 3B LLM: 40% → 70%+ on shell commands

Key Points

Related Articles

Black Hat USA

Black Hat Asia

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

I tested and ranked every ai companion app I tried and here's my honest breakdown

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer