StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

StateSMix is a new fully self-contained lossless compression approach that trains a Mamba-style state space model online (token-by-token) on the file being compressed, without pre-trained weights, GPUs, or external dependencies.
The compressor combines continuously updated probability estimates from the SSM over BPE tokens with sparse n-gram context mixing (bigram through 32-gram) implemented as nine large hash tables and integrated via a softmax-invariant logit-bias mechanism.
An entropy-adaptive scaling mechanism modulates how much the n-gram component contributes based on the SSM’s predictive confidence, aiming to avoid over-correcting when the neural predictor is already reliable.
On the enwik8 benchmark, StateSMix reports 2.123 bpb (1 MB), 2.149 bpb (3 MB), and 2.162 bpb (10 MB), outperforming xz (LZMA2) by 8.7%, 5.4%, and 0.7% respectively, with ablations showing the SSM is the primary driver and n-grams add a smaller complementary gain.
The system is implemented in pure C using AVX2 SIMD, achieves about 2,000 tokens/second on commodity x86-64 hardware, and gains about 1.9x speedup from OpenMP parallelization on 4 cores.

Abstract

We present StateSMix, a fully self-contained lossless compressor that couples an online-trained Mamba-style State Space Model (SSM) with sparse n-gram context mixing and arithmetic coding. The model is initialised from scratch and trained token-by-token on the file being compressed, requiring no pre-trained weights, no GPU, and no external dependencies. The SSM (DM=32, NL=2, approximately 120K active parameters per file) provides a continuously-updated probability estimate over BPE tokens, while nine sparse n-gram hash tables (bigram through 32-gram, 16M slots each) add exact local and long-range pattern memorisation via a softmax-invariant logit-bias mechanism that updates only non-zero-count tokens. An entropy-adaptive scaling mechanism modulates the n-gram contribution based on the SSM's predictive confidence, preventing over-correction when the neural model is already well-calibrated. On the standard enwik8 benchmark, StateSMix achieves 2.123 bpb on 1 MB, 2.149 bpb on 3 MB, and 2.162 bpb on 10 MB, beating xz -9e (LZMA2) by 8.7%, 5.4%, and 0.7% respectively. Ablation experiments establish the SSM as the dominant compression engine: it alone accounts for a 46.6% size reduction over a frequency-count baseline and beats xz without any n-gram component, while n-gram tables provide a complementary 4.1% gain through exact context memorisation. OpenMP parallelisation of the training loop yields 1.9x speedup on 4 cores. The system is implemented in pure C with AVX2 SIMD and processes approximately 2,000 tokens per second on commodity x86-64 hardware.

Black Hat USA

AI Business

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

Dev.to

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

Dev.to

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing

Key Points

Abstract

Related Articles

Black Hat USA

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

PaioClaw Review: What You Actually Get for $15/mo vs DIY OpenClaw

SIFS (SIFS Is Fast Search) - local code search for coding agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer