State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

arXiv cs.LG / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The State Stream Transformer (SST) V2 proposes a parameter-efficient way to retain and stream a rich latent residual state across positions, instead of reconstructing latent reasoning context from scratch at every token.
  • SST V2 introduces an FFN-driven nonlinear recurrence within each decoder layer, using a learned horizontal blend to carry latent states through the full sequence and enabling extra “deliberation” at inference time via additional FLOPs.
  • The paper presents a two-pass parallel training method to handle the otherwise sequential dependency created by the recurrence, making compute-efficient training feasible.
  • Co-trained into an existing 27B backbone using a small GSM8K-only dataset, SST V2 improves out-of-distribution GPQA-Diamond performance by +15.15 points and reduces remaining GSM8K errors by 46%, suggesting gains come from the architectural mechanism.
  • Analysis and probing indicate the latent state exploration moves the model across distinct “semantic basins” in continuous latent space and can already predict—at the first generated token—whether the eventual answer will hold or change after additional latent computation across later positions.

Abstract

Current transformers discard their rich latent residual stream between positions, reconstructing latent reasoning context at each new position and leaving potential reasoning capacity untapped. The State Stream Transformer (SST) V2 enables parameter-efficient reasoning in continuous latent space through an FFN-driven nonlinear recurrence at each decoder layer, where latent states are streamed horizontally across the full sequence via a learned blend. This same mechanism supports continuous latent deliberation per position at inference time, dedicating additional FLOPs to exploring abstract reasoning before committing to a token. A two-pass parallel training procedure resolves the sequential dependency of the recurrence to allow compute-efficient training. Hidden state analysis shows the state stream facilitates reasoning through exploration of distinct semantic basins in continuous latent space, where transitions at content-dependent positions move the model into a substantially different Bayesian posterior, directly influencing the latent space at future positions. We also find, via a learned probe, that at the first generated token position, the latent state already predicts whether the eventual answer will survive or break under additional latent computation for every subsequent position. Co-trained into an existing 27B backbone using only a small dataset of GSM8K examples, the SST delivers a +15.15 point gain over a fine-tuning-matched baseline on out-of-distribution GPQA-Diamond and cuts that same baseline's remaining GSM8K errors by 46%, together showing that the reasoning improvement is attributable to the architectural mechanism rather than scale or training data. On GPQA-Diamond, the resulting 27B SST also achieves higher accuracy than several larger open-weight and proprietary systems, including open-weight models up to 25 times larger.