talkieを紹介:1930年の“ビンテージ”データで学習した13B言語モデル

Simon Willison's Blog / 2026/4/28

📰 ニュースTools & Practical UsageModels & Research

要点

  • 新しいオープンソース「talkie」プロジェクトでは、1931年以前の歴史的英語テキストで学習した13B言語モデルが2種類(ベースとチャット用)として登場し、いずれもApache 2.0ライセンスで提供されます。
  • ベースモデル(talkie-1930-13b-base)は260Bトークンで学習され、チャットモデル(talkie-1930-13b-it)は1931年以前の参考文献から抽出したインストラクション—レスポンスのデータでファインチューニングされています。
  • 開発チームはモデルを試せるWebチャット体験を用意し、知識カットオフが歴史データに基づいているにもかかわらず会話インターフェースを可能にすることを目指しています。
  • 付随するレポートでは、未来の出来事をどれくらい当てられるか、カットオフ以降の知識を「発明」できるのか、そして例示からどれだけプログラミングを学習できるのかといった研究課題が挙げられています。
  • 米国では1931年のカットオフ以前の学習データが著作権外に当たる可能性があるため、ベースモデルの学習データ自体も将来的に公開されることへの期待が述べられています。
Sponsored by: Sonar — Now with SAST + SCA for secure, dependency-aware Agentic Engineering. SonarQube Advanced Security

28th April 2026 - Link Blog

Introducing talkie: a 13B vintage language model from 1930 (via) New project from Nick Levine, David Duvenaud, and Alec Radford (of GPT, GPT-2, Whisper fame).

talkie-1930-13b-base (53.1 GB) is a "13B language model trained on 260B tokens of historical pre-1931 English text".

talkie-1930-13b-it (26.6 GB) is a checkpoint "finetuned using a novel dataset of instruction-response pairs extracted from pre-1931 reference works", designed to power a chat interface. You can try that out here.

Both models are Apache 2.0 licensed. Since the training data for the base model is entirely out of copyright (the USA copyright cutoff date is currently January 1, 1931), I'm hoping they later decide to release the training data as well.

Their report suggests some fascinating research objectives for this class of model, including:

  • How good are these models at predicting the future? "we calculated the surprisingness of short descriptions of historical events to a 13B model trained on pre-1931 text"
  • Can these models invent things that are past their knowledge cutoffs? "As Demis Hassabis has asked, could a model trained up to 1911 independently discover General Relativity, as Einstein did in 1915?"
  • Can they be taught to program? "Figure 3 (left-hand side) shows an early example of such a test, measuring how well models trained on pre-1931 text can, when given a few demonstration examples of Python programs, write new correct programs."

I have a long-running interest in what I call "vegan models" - LLMs that are trained entirely on licensed or out-of-copyright data. I think the base model here qualifies, but the chat model isn't entirely pure due to the reliance on non-vegan models to help with the fine-tuning - emphasis mine:

First, we generated instruction-response pairs from historical texts with regular structure, such as etiquette manuals, letter-writing manuals, cookbooks, dictionaries, encyclopedias, and poetry and fable collections (see Figure 7), and fine-tuned our base model on them using a simple chat format.

Next, to improve instruction-following abilities, we generated synthetic prompts covering different types of tasks, such as summarizing documents, responding to direct information requests, and continuing multi-turn conversations coherently. We then ran online direct preference optimization on rollouts generated from these prompts, using Claude Sonnet 4.6 as a judge. [...]

Finally, we did another round of supervised fine-tuning, this time on rejection-sampled multi-turn synthetic chats between Claude Opus 4.6 and talkie, to smooth out persistent rough edges in its conversational abilities.

A major challenge in training talkie was avoiding accidental contamination, either from post-1931 text that snuck into the training corpus or from the fine-tuning assistance from modern LLMs introducing anachronistic knowledge to the chat model.

I recently wrote about Mr. Chatterbox, a similar vintage language model project which also had to turn to modern LLMs to create synthetic conversations to help train the chat model.

The talkie team aspire to eventually move beyond this limitation:

While we have tried to post-train talkie free from modern influence, reinforcement learning with AI feedback inevitably shapes talkie’s behavior anachronistically. (The 7B version of talkie emerged from RL speaking in listicles.) As we scale up, we hope to be able to use our vintage base models themselves as judges to enable a fully bootstrapped era-appropriate post-training pipeline.

I couldn't resist hitting the talkie demo with my classic test prompt:

Generate an SVG of a pelican riding a bicycle

It replied:

An SVG of a pelican riding a bicycle was generated in 1860. The pelican was represented as perched on the saddle, with its long bill pointed forward, and its legs astride the handles. The idea was suggested by the fact that pelicans are sometimes seen fishing on horseback, on the banks of the Rhine.

Posted 28th April 2026 at 2:47 am

This is a link post by Simon Willison, posted on 28th April 2026.

ai 1987 generative-ai 1762 local-llms 156 llms 1728 training-data 63 ai-ethics 294 llm-release 197

Monthly briefing

Sponsor me for $10/month and get a curated email digest of the month's most important LLM developments.

Pay me to send you less!

Sponsor & subscribe