Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

arXiv cs.LG / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Ouroboros introduces an input-conditioned “Controller” hypernetwork for recursive transformers, generating a per-recurrence diagonal modulation vector so each depth step can perform distinct, hidden-state-dependent transformations.
  • The approach keeps the recursive transformer’s main weights frozen and uses SVD-initialized LoRA bases that get modulated per step, adding only 9.2M trainable parameters while enabling input-dependent depth behavior.
  • Stability and effective deep iteration are improved via gated recurrence (with strong initial retention bias) and per-step LayerNorm, and the paper reports gated recurrence is essential because removing it degrades performance.
  • On a Qwen2.5-3B “Prelude/Recurrent/Coda” setup with partial layer retention, Ouroboros reduces training loss by 43.4% over an unmodified 17-layer baseline and recovers 51.3% of the performance lost by removing layers.
  • Despite strong training-distribution gains, the Controller does not yet outperform the baseline on held-out text, which the authors attribute to frozen downstream layers and analyze further.

Abstract

Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros