The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model

arXiv cs.LG / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the UNDO Flip-Flop task, extending the standard Flip-Flop to require reversible semantic state retrieval under non-monotonic updates.
  • Experiments on one-layer and two-layer Mamba-2 show consistent failure to learn the provably expressible bounded stack rollback behavior, instead settling on a local toggle heuristic.
  • Under an adversarial retraction pressure test (within the training length distribution), performance for the two-layer model collapses to 41.10% accuracy, below random chance.
  • Causal ablation indicates the bottleneck is retrieval rather than storage, highlighting a gap between architectural expressivity and what gradient-based optimization can reliably discover.
  • The authors argue that theoretical expressivity results alone are insufficient to predict real training success for reversible semantic state management in state space models.

Abstract

State space models (SSMs) have been shown to possess the theoretical capacity to model both star-free sequential tasks and bounded hierarchical structures Sarrof et al. (2024). However, formal expressivity results do not guarantee that gradient-based optimisation will reliably discover the corresponding solutions. Existing benchmarks probe either monotonic state tracking, as in the standard Flip-Flop task, or structural nesting, as in the Dyck languages, but neither isolates reversible semantic state retrieval. We introduce the UNDO Flip-Flop task to fill this gap. By extending the standard Flip-Flop with an UNDO, the task requires a model to maintain an implicit bounded stack and recover historical states under non-monotonic update sequences. We evaluate one-layer and two-layer Mamba-2 under this framework. Both variants fail to acquire the provably expressible stack-based rollback mechanism, converging instead on a local toggle heuristic that inverts the current state rather than retrieving stored history. Under an adversarial retraction pressure test held within the training length distribution, the two-layer model collapses to 41.10% accuracy, which is below random chance. The results confirm systematic rather than incidental failure. Causal ablation shows that the bottleneck lies in retrieval, not storage. These results draw a clear line between what an architecture can in principle represent and what gradient descent reliably learns, a distinction that theoretical expressivity analyses alone cannot capture.