I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

Reddit r/MachineLearning / 5/2/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A private Usenet archive spanning 1980–2013 has been assembled and processed into a 103.1B-token (cl100k_base) corpus with 408M posts across 18,347 newsgroups.
The dataset underwent extensive preprocessing, including full deduplication, quoted-text handling, exclusion of alt.binaries.* before record-level cleaning, and email redaction using pattern matching plus SHA-256 hashing of Message-IDs.
Raw MBOX archives were converted to gzip-compressed JSONL, and language detection was applied to every record using Meta’s fastText LID-176, resulting in 96.6% English with meaningful coverage of 100+ other languages.
The author highlights the corpus’s “temporal arc” that captures long-term language evolution—sparse pre-1986, growing through the early 1990s, peaking around 1999–2000, and declining as Usenet was displaced by forums and social media.
A data card, cleaning methodology, and representative samples are published on Hugging Face for use and inspection by researchers and practitioners.

For the past several years I've been quietly assembling and processing what I believe is one of the larger privately held pretraining corpora around... a complete Usenet archive spanning 1980 to 2013.

Here's what it ended up being:

103.1 billion tokens (cl100k_base)
408 million posts across 9 newsgroup hierarchies
18,347 newsgroups covered
33 years of continuous coverage

The processing pipeline included full deduplication, binary removal (alt.binaries.* excluded at the hierarchy level before record-level cleaning), quoted text handling, email address redaction via pattern matching and SHA-256 hashing of Message-IDs, and conversion from raw MBOX archives to gzip-compressed JSONL.

Language detection was run on every record using Meta's fasttext LID-176. The corpus is 96.6% English with meaningful representation from 100+ other languages — the soc.culture.* groups in particular have high non-English density.

The thing I find most interesting about this dataset from a training perspective is the temporal arc. Volume is sparse pre-1986, grows steadily through the early 90s, peaks around 1999–2000, then declines as Usenet gets displaced by forums and social media. That's a 33-year window of language evolution baked into a single coherent corpus — before SEO, before engagement optimization, before AI-generated content existed.

I've published a full data card, cleaning methodology, and representative samples (5K posts per hierarchy + combined sets) on Hugging Face: https://huggingface.co/datasets/OwnedByDanes/Usenet-Corpus-1980-2013

Happy to answer questions about the processing pipeline or the data itself.

submitted by /u/OwnerByDane
[link] [comments]

Black Hat USA

AI Business

GPT-5.5 Outperforms (and Hallucinates), Kimi K2.6 Leads Open LLMs, AI Strains Climate Pledges, Strategic Thinking in LLMs vs. Humans

The Batch

langchain-openrouter==0.2.3

LangChain Releases

Stop Your RAG Pipeline From Hallucinating: A 15-Line Fix published

Dev.to

Edge-to-Cloud Swarm Coordination for smart agriculture microgrid orchestration with embodied agent feedback loops

Dev.to

I spent years building a 103B-token Usenet corpus (1980–2013) and finally documented it [P]

Key Points

Related Articles

Black Hat USA

GPT-5.5 Outperforms (and Hallucinates), Kimi K2.6 Leads Open LLMs, AI Strains Climate Pledges, Strategic Thinking in LLMs vs. Humans

langchain-openrouter==0.2.3

Stop Your RAG Pipeline From Hallucinating: A 15-Line Fix published

Edge-to-Cloud Swarm Coordination for smart agriculture microgrid orchestration with embodied agent feedback loops

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer