Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

arXiv cs.LG / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • Nautile-370M is a newly introduced 371M-parameter small language model built to perform efficient reasoning within tight parameter and inference budgets.
  • Its architecture alternates two SeqCond Attention (SCA) layers—using a linear-time spectral sequence operator—with one standard transformer layer to balance long-context/state tracking with attention-like routing.
  • The authors report training on limited compute: a single Google TPU v4-64 pod slice via TPU Research Cloud (TRC), followed by a reinforcement learning stage on a single NVIDIA DGX Spark.
  • The paper provides a theoretical result that SCA can exactly retrieve individual tokens from prefix summaries and can emulate softmax attention, arguing SCA is at least as expressive as full self-attention in the continuous limit.
  • It also outlines a dedicated training data pipeline and proposes a reinforcement learning stage tailored to reasoning, verification, and response quality.

Abstract

We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a linear-time spectral sequence operator inspired by SeqCondenser, alternate with one transformer layer. This design aims to retain the long-context efficiency and state-tracking benefits of structured sequential models while preserving the expressive token-to-token routing of attention. The model was trained on a single Cloud TPU v4-64 pod slice provided through the Google TPU Research Cloud (TRC) program; the subsequent reinforcement learning stage was carried out on a single NVIDIA DGX Spark. We prove that the SCA readout mechanism can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special case, establishing that SCA is at least as expressive as full self-attention in the continuous limit. We also describe the training data pipeline and outline a reinforcement learning stage specialized for reasoning, verification, and response quality.