Efficient Matrix Implementation for Rotary Position Embedding

arXiv cs.LG / 4/14/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies that common Rotary Position Embedding (RoPE) implementations incur avoidable overhead due to vector-level split/merge operations, especially in multi-dimensional (2D/3D) settings where hardware utilization suffers.
  • It introduces RoME (Rotary Matrix position Embedding), a mathematically equivalent reformulation that replaces vector operations with unified matrix transformations.
  • By removing dimension-specific operations, RoME simplifies implementation and supports more efficient fused parallel execution on modern NPUs (across Cube and Vector units).
  • Experiments report speedups both at the individual operator level and across full Transformer models, indicating practical performance benefits beyond micro-optimizations.
  • The authors provide an implementation reference in a public repository link for evaluation and integration.

Abstract

Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.