CAWN: Continuous Acoustic Wave Networks for Autoregressive Language Modeling

arXiv cs.CL / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes CAWN, a fully continuous sequence-mixing architecture for autoregressive language modeling that replaces transformer attention with multi-headed complex-domain phasors and causal phase accumulation for O(L) scaling.
  • To address long-context signal degradation seen in some linear-time sequence models, CAWN adds a dual-gated Selective Phase Resonance mechanism with frequency-dependent retention, hard-threshold gating, and a Temporal Syntax Cache for short-term dependencies.
  • It improves spatial/feature mixing by using depth-wise harmonic convolutions instead of standard dense projections, and it adds Block Attention Residuals for depth-wise state routing.
  • A 150M-parameter prototype is trained on a 100B-token corpus using continuous streaming, evaluated at a 5B-token milestone, and reportedly supports targeted retrieval across 2,000,000 tokens with strict VRAM plateauing at 8.72 GB via O(1) chunked prefill state passing.
  • The authors report empirical benefits using a Targeted Semantic Retrieval protocol, including robust vocabulary acquisition and extended contextual denoising.

Abstract

Modern Large Language Models (LLMs) rely on Transformer self-attention, which scales quadratically with sequence length. Recent linear-time alternatives, like State Space Models (SSMs), often suffer from signal degradation over extended contexts. We introduce the Continuous Acoustic Wave Network (CAWN), a fully continuous sequence-mixing architecture. Instead of discrete matrix-based attention, CAWN projects hidden states into multi-headed complex-domain phasors, achieving sequence mixing through a causal, O(L) Phase Accumulation mechanism. To prevent signal degradation over ultra-long contexts, we introduce a dual-gated Selective Phase Resonance mechanism incorporating Frequency-Dependent Retention, Hard-Threshold Gating via Straight-Through Estimation, and a Temporal Syntax Cache to capture short-term local dependencies. We also replace standard dense linear projections with Depth-wise Harmonic Convolutions for optimal spatial frequency mixing, augmented by Block Attention Residuals for depth-wise state routing. Scaled to a 150M-parameter model, CAWN utilizes custom Triton kernels for hardware-efficient, true-complex phase accumulation in float32. Trained via a continuous streaming loop on a 100-Billion-token corpus, the prototype is evaluated at a 5-Billion-token milestone. Empirical evaluations via a Targeted Semantic Retrieval protocol demonstrate robust vocabulary acquisition and extended explicitly learned contextual denoising. By leveraging O(1) state-passing via chunked prefill, the model retrieves targeted information across 2,000,000 tokens while strictly plateauing at 8.72 GB of Peak VRAM, empirically overcoming the O(L^2) context memory wall.