Efficient Matrix Implementation for Rotary Position Embedding

arXiv cs.LG / 4/14/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies that common Rotary Position Embedding (RoPE) implementations incur avoidable overhead due to vector-level split/merge operations, especially in multi-dimensional (2D/3D) settings where hardware utilization suffers.
It introduces RoME (Rotary Matrix position Embedding), a mathematically equivalent reformulation that replaces vector operations with unified matrix transformations.
By removing dimension-specific operations, RoME simplifies implementation and supports more efficient fused parallel execution on modern NPUs (across Cube and Vector units).
Experiments report speedups both at the individual operator level and across full Transformer models, indicating practical performance benefits beyond micro-optimizations.
The authors provide an implementation reference in a public repository link for evaluation and integration.

Abstract

Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.