Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE
arXiv cs.LG / 3/13/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper systematically investigates partial RoPE by applying rotary position embedding to only a subset of hidden dimensions and evaluates its impact on training dynamics across architectures, sequence lengths, and datasets.
- It reports memory savings up to 10x compared with the standard RoPE cache while achieving comparable final loss.
- It finds that using RoPE on roughly 10% of dimensions yields convergence similar to full RoPE across model sizes and data qualities.
- It observes that NoPE can produce unstable learning trajectories, which can be mitigated by minimal RoPE application or by QK-Norm that converges to a higher loss.
- It offers practical guidance for balancing efficiency and training stability in transformer design by emphasizing partial RoPE as a viable option.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to