Tinylora shows lora training works at 13 parameters + own experiments to verify claims

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The “tinylora” work argues that meaningful LoRA fine-tuning can alter model behavior using only about 13 trainable parameters, and the author reports successfully replicating the paper’s results.
  • In the author’s replication on Qwen3.5, increasing rank/global parameter count can hurt convergence, suggesting there is a narrow regime where tiny parameterization remains optimizable.
  • The author finds improvements by giving attention layers and MLP layers their own shared 13-parameter sets (26 total: 13 for all attention, 13 for all MLP), which outperforms a single shared global 13-parameter approach.
  • They propose further experiments comparing global (shared across many layers) versus local (per-layer) parameter optimization, potentially using 2–6 parameters per layer to better target layer-specific adjustments.
  • The author suggests these tiny adapters are less suitable for memorizing facts but may be effective at steering behavior, and they hint at a “behavior lookup table” concept analogous to DeepSeek’s engram idea but implemented as a library of LoRA adapters.

The tinylora paper shows that we can alter model behavior with only a few parameters.

https://arxiv.org/pdf/2602.04118

I tried replicating the paper, and made a tinylora implementation for qwen3.5, and it does work, it's crazy to think about. I got the same results as the paper, for example, increasing the rank just made the optimization space too large for it to converge correctly.

What did improve it, was giving the MLP and attention layers their own shared 13 parameters to adjust. IE all mlp layers has 13 parameters together, and all attention layers has 13, so a total of 26. That was better than just increasing the number of global parameters overall or having a global 13 parameter count like in the paper.

Next I would like to try giving each individual mlp and attention layer their own parameters to optimize, maybe even 2-6 for each, to see if the individual layers can better adjust the model despite lower parameters vs. a higher number of parameters shared across more layers. To test the global vs. local optimization of the model.

My hypothesis is also that this wouldn't be well suited for memorizing facts, but it seems good at altering behavior, as I tested it on downstream tasks via lm-eval.

What this might implicate

We might be able to train models with much less memory than we initially thought, but only for changing behavior. Imagine something like the new engram from the deepseek paper,
https://github.com/deepseek-ai/Engram
But instead of an engram lookup, we could have a lookup table for behaviors made of lora adapters, much larger and more varied than Moe, which could be updated over time even, as they are very small and require very little memory to train.

submitted by /u/fiery_prometheus
[link] [comments]