Training LoRA adapters for Apple's on-device 3B model on a free Colab T4 and a Mac

Reddit r/LocalLLaMA / 4/21/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author explores training LoRA adapters for Apple’s on-device 3B model to improve accuracy and to understand Apple’s training toolkit requirements on consumer hardware.
  • Standard LoRA training proved too memory-heavy for their 24GB Mac, so they built a custom QLoRA pipeline using memory-mapped loading and 4-bit quantization to fit within ~1GB RAM and ~5GB GPU on a free Colab T4 or a local Mac.
  • LoRA training conducted on an A100, a T4 (QLoRA), and a 24GB Mac (QLoRA) produced adapters with equivalent accuracy, with even a minimal dataset raising accuracy from ~40% to ~75% (and ~86% when combined with retrieval).
  • They report a critical bug: the adapter framework silently writes an ~160MB copy of the adapter to a SIP-protected cache on every CLI call without cleanup, causing massive hidden disk usage until discovered via Recovery Mode (Apple confirmed the issue).
  • Training notebooks and setup instructions are published in a GitHub repository for others to reproduce the process on accessible hardware.

I recently posted about using Apple's on-device 3B model to build a shell assistant and benchmarking accuracy. I used the bare model and dynamic retrieval.

As a next step, I wanted to explore training a LoRA adapter for the model, partly to see if it improves accuracy, but mostly to understand what Apple's training toolkit looks like and whether you can get it running on accessible hardware. Just wanted to share here in case someone is interested.

Apple ships a Python training toolkit with a 12GB checkpoint. Standard LoRA needs ~24GB just to load the model and ~15GB GPU for training. My 24GB Mac OOM'd. So I created a custom QLoRA pipeline: memory-mapped loading + 4-bit quantization drops it to ~1GB RAM and ~5GB GPU. Runs on a free Colab T4 or locally on a 24GB Mac.

For Mac training specifically: bitsandbytes just merged native Metal kernels (PR #1875, not in a release yet, install from git). Makes local training ~2x faster than CPU fallback. Still ~4x slower than a T4 but fully local, no uploads.

All three paths produce equivalent adapters. A100 LoRA, T4 QLoRA, Mac QLoRA; same accuracy within noise. Even with a minimal training set, adapters improve the bare model from ~40% to ~75%, and to ~86% combined with retrieval. Haven't optimized the training data yet so there's likely more headroom.

**Bug worth knowing about:** The adapter framework silently writes a ~160MB copy of the adapter to a SIP-protected cache on every call from CLI tools. No cleanup. I hit 269GB of invisible disk usage over ~300 benchmark runs. Only visible from Recovery Mode. Apple confirmed the bug.

Training notebooks, QLoRA scripts, and MPS setup instructions are all in the repo if someone is interested. https://github.com/es617/hunch/tree/main/training

Read more: https://es617.dev/2026/04/19/training-apple-on-device-llm.html

submitted by /u/es617_dev
[link] [comments]