Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE

arXiv cs.LG / 3/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper systematically investigates partial RoPE by applying rotary position embedding to only a subset of hidden dimensions and evaluates its impact on training dynamics across architectures, sequence lengths, and datasets.
It reports memory savings up to 10x compared with the standard RoPE cache while achieving comparable final loss.
It finds that using RoPE on roughly 10% of dimensions yields convergence similar to full RoPE across model sizes and data qualities.
It observes that NoPE can produce unstable learning trajectories, which can be mitigated by minimal RoPE application or by QK-Norm that converges to a higher loss.
It offers practical guidance for balancing efficiency and training stability in transformer design by emphasizing partial RoPE as a viable option.

Abstract

Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache, while achieving comparable final loss. In this work, we present a systematic study examining the impact of partial RoPE on training dynamics and convergence across architectures and datasets. Our findings uncover several notable patterns: (1) applying RoPE to only a small fraction of dimensions (around 10%) achieves convergence comparable to using full RoPE; (2) these trends hold consistently across model size, sequence lengths and datasets of varying quality and architectures, with higher-quality data resulting in lower overall loss and similar benchmark performance; and (3) some models trained with NoPE (No Positional Encoding) showcase unstable learning trajectories, which can be alleviated through minimal RoPE application or QK-Norm which converges to a higher loss. Together, these results offer practical guidance for model designers aiming to balance efficiency and training stability, while emphasizing the previously overlooked importance of partial RoPE.

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Dev.to

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE

Key Points

Abstract

Related Articles

Automating the Chase: AI for Festival Vendor Compliance

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer