I recently posted about using Apple's on-device 3B model to build a shell assistant and benchmarking accuracy. I used the bare model and dynamic retrieval.
As a next step, I wanted to explore training a LoRA adapter for the model, partly to see if it improves accuracy, but mostly to understand what Apple's training toolkit looks like and whether you can get it running on accessible hardware. Just wanted to share here in case someone is interested.
Apple ships a Python training toolkit with a 12GB checkpoint. Standard LoRA needs ~24GB just to load the model and ~15GB GPU for training. My 24GB Mac OOM'd. So I created a custom QLoRA pipeline: memory-mapped loading + 4-bit quantization drops it to ~1GB RAM and ~5GB GPU. Runs on a free Colab T4 or locally on a 24GB Mac.
For Mac training specifically: bitsandbytes just merged native Metal kernels (PR #1875, not in a release yet, install from git). Makes local training ~2x faster than CPU fallback. Still ~4x slower than a T4 but fully local, no uploads.
All three paths produce equivalent adapters. A100 LoRA, T4 QLoRA, Mac QLoRA; same accuracy within noise. Even with a minimal training set, adapters improve the bare model from ~40% to ~75%, and to ~86% combined with retrieval. Haven't optimized the training data yet so there's likely more headroom.
**Bug worth knowing about:** The adapter framework silently writes a ~160MB copy of the adapter to a SIP-protected cache on every call from CLI tools. No cleanup. I hit 269GB of invisible disk usage over ~300 benchmark runs. Only visible from Recovery Mode. Apple confirmed the bug.
Training notebooks, QLoRA scripts, and MPS setup instructions are all in the repo if someone is interested. https://github.com/es617/hunch/tree/main/training
Read more: https://es617.dev/2026/04/19/training-apple-on-device-llm.html
[link] [comments]

