Faster Superword Tokenization

arXiv cs.CL / 4/8/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces a faster way to train BoundlessBPE/SuperBPE by aggregating “supermerge candidates” via frequency, avoiding the need to keep full documents in memory.
It proposes a two-phase formulation of BoundlessBPE that cleanly separates learning regular merges from learning supermerges while matching the original algorithm’s results.
The authors report a drastic training-speed improvement on 1GB of data, reducing BoundlessBPE from 4.7 CPU days to about 603 seconds and SuperBPE to about 593 seconds (over 600x faster).
The work shows near-equivalence between the updated two-phase BoundlessBPE and SuperBPE, including replacing SuperBPE’s manually chosen hyperparameter with an automatically determined one in BoundlessBPE.
The paper open-sources reference Python and performance-oriented Rust implementations for BPE, BoundlessBPE, and SuperBPE.

Abstract

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend and improve BPE by relaxing this limitation and allowing the formation of superwords, which are combinations of pretokens that form phrases. However, previous implementations were impractical to train: for example, BoundlessBPE took 4.7 CPU days to train on 1GB of data. We show that supermerge candidates, two or more consecutive pretokens eligible to form a supermerge, can be aggregated by frequency much like regular pretokens. This avoids keeping full documents in memory, as the original implementations of BoundlessBPE and SuperBPE required, leading to a significant training speedup. We present a two-phase formulation of BoundlessBPE that separates first-phase learning of regular merges from second-phase learning of supermerges, producing identical results to the original implementation. We also show a near-equivalence between two-phase BoundlessBPE and SuperBPE, with the difference being that a manually selected hyperparameter used in SuperBPE can be automatically determined in the second phase of BoundlessBPE. These changes enable a much faster implementation, allowing training on that same 1GB of data in 603 and 593 seconds for BoundlessBPE and SuperBPE, respectively, a more than 600x increase in speed. For each of BoundlessBPE, SuperBPE, and BPE, we open-source both a reference Python implementation and a fast Rust implementation.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/8DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Efficient Inference with SGLang: Text and Image Generation

The Batch

Meta's latest model is as open as Zuckerberg's private school

The Register

I Have an AI Agent That Tests My Own Product Every 3 Hours

Dev.to

Faster Superword Tokenization

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Efficient Inference with SGLang: Text and Image Generation

Meta's latest model is as open as Zuckerberg's private school

I Have an AI Agent That Tests My Own Product Every 3 Hours

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer