I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

Reddit r/MachineLearning / 5/2/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A private Usenet archive spanning 1980–2013 has been assembled and processed into a 103.1B-token (cl100k_base) corpus with 408M posts across 18,347 newsgroups.
  • The dataset underwent extensive preprocessing, including full deduplication, quoted-text handling, exclusion of alt.binaries.* before record-level cleaning, and email redaction using pattern matching plus SHA-256 hashing of Message-IDs.
  • Raw MBOX archives were converted to gzip-compressed JSONL, and language detection was applied to every record using Meta’s fastText LID-176, resulting in 96.6% English with meaningful coverage of 100+ other languages.
  • The author highlights the corpus’s “temporal arc” that captures long-term language evolution—sparse pre-1986, growing through the early 1990s, peaking around 1999–2000, and declining as Usenet was displaced by forums and social media.
  • A data card, cleaning methodology, and representative samples are published on Hugging Face for use and inspection by researchers and practitioners.

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around... a complete Usenet archive spanning 1980 to 2013.

Here's what it ended up being:

  • 103.1 billion tokens (cl100k_base)
  • 408 million posts across 9 newsgroup hierarchies
  • 18,347 newsgroups covered
  • 33 years of continuous coverage

The processing pipeline included full deduplication, binary removal (alt.binaries.* excluded at the hierarchy level before record-level cleaning), quoted text handling, email address redaction via pattern matching and SHA-256 hashing of Message-IDs, and conversion from raw MBOX archives to gzip-compressed JSONL.

Language detection was run on every record using Meta's fasttext LID-176. The corpus is 96.6% English with meaningful representation from 100+ other languages — the soc.culture.* groups in particular have high non-English density.

The thing I find most interesting about this dataset from a training perspective is the temporal arc. Volume is sparse pre-1986, grows steadily through the early 90s, peaks around 1999–2000, then declines as Usenet gets displaced by forums and social media. That's a 33-year window of language evolution baked into a single coherent corpus — before SEO, before engagement optimization, before AI-generated content existed.

I've published a full data card, cleaning methodology, and representative samples (5K posts per hierarchy + combined sets) on Hugging Face: https://huggingface.co/datasets/OwnedByDanes/Usenet-Corpus-1980-2013

Happy to answer questions about the processing pipeline or the data itself.

submitted by /u/OwnerByDane
[link] [comments]