[P] ClaudeFormer: Building a Transformer Out of Claudes — Collaboration Request

Reddit r/MachineLearning / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The post proposes “ClaudeFormer,” a multi-agent framework that emulates a Transformer architecture by using multiple Anthropic Claude agents to implement attention (router), MLP computation (workers), and value passing (source-written updates).
  • In the design, each worker Claude maintains its own persistent “residual.md” knowledge state, allowing computation and recovery even if conversation context is compacted or restarted.
  • The attention head router does not read full worker states; instead it ingests short 2–5K token summaries containing each worker’s queries and keys to decide which workers should exchange information.
  • The system’s verification is integrated into routing: when proof claims appear, the router dispatches the claim to an adversary worker for attempt/disproof rather than running a separate verification phase.
  • The author argues the approach can approximate a single model with much larger effective context by coordinating many smaller-context agents (e.g., 30 agents with 1M tokens each to simulate 30M tokens) for frontier math research.

I'm looking to work with people interested in math, machine learning, or agentic coding, on creating a multi-agent framework to do frontier math research. My core idea is to build the transformer architecture out of Claudes.

The Architecture (Q/K/V Attention)

concretely, the setup consists of a single claude controlling a team of claudes through the agent teams feature. The team leader is the “attention head” claude, and the team members are the mlp claudes.

ClaudeFormer maps precisely to the transformer. Each component has a direct analog:

Workers = MLPs. Each worker Claude maintains a single file — their `residual.md` — which contains their complete knowledge state. Every computation, every pattern, every conjecture, every proof attempt. The worker reads this file, thinks, does Macaulay2 computations, and updates the file. The file IS the state. If the worker's conversation context gets compacted or even restarted from scratch, they can fully recover by re-reading their residual file. Workers are effectively stateless functions applied to their residual stream.

The Router = Attention Mechanism. One Claude acts as the attention head. But critically, it does NOT read everyone's full residual files (which can be up to 350K tokens each / as much context as a single claude can actually use). Instead, each worker publishes a short summary (2-5K tokens) containing their Keys ("what I found") and Queries ("what I need"). The router reads all summaries and computes attention: "Worker 1's finding X answers Worker 7's question Y." Then it dispatches.

Values = Written by the Source. Once the router has decided to share some information between workers, it tells Worker 1: "Send your Hasse invariant data to Worker 7 — they need it." Worker 1 then writes the Value directly to Worker 7's inbox file, tailored to what Worker 7 needs. The source writes the value because they understand their own work best.

Verification = A Routing Decision. When the router sees a proof claim in a worker's summary, it routes it to an adversary: "Worker 1, send your proof to Worker 5. Worker 5, try to disprove it." This happens organically during dispatch, not in a special verification phase.

The Theory Behind Why I Expect It to Work

The ClaudeFormer is meant to emulate a single Claude working on the problem with a much larger context window. The idea is that you can simulate a Claude with 30,000,000 tokens of context with 30 different Claudes that each have 1,000,000 tokens of context.

The benefit that a Claude with 30,000,000 tokens of context has is that across those 30,000,000 tokens, they can explore tons of different questions and avenues of the problem. And if any previous ideas would connect to a current idea they're exploring, those connections are in context, and the connections will be made. That is the actual benefit that having a greatly increased context window gives you — you can make connections across that massive context.

However, there's a heuristic that only 10% or so of all the previous information in your context window is actually relevant to your current problem. So, where a 30,000,000 token context window Claude has all that information in its context, you can emulate this by having a router that selectively routes all the relevant information and doesn't fill up context with the irrelevant information.

The two-level information hierarchy makes this scale. Workers maintain huge private files (350K tokens), but communicate through short summaries (2-5K tokens). The router reads N summaries, not N full files. With 30 workers at 5K tokens each, the router reads 150K tokens — well within a single context window.

Scaling Dimensions

The architecture scales in three ways:

  1. Context Window: How many Claudes you have working in parallel. Each worker's residual file can hold up to 350K tokens — a massive amount of research context. More workers = more directions explored simultaneously.

  2. Depth: How many dispatch cycles you run. Each cycle: workers think → publish summaries → router dispatches → workers exchange values → workers integrate and continue. More depth = more cross-pollination of ideas. The limit here is how long of a residual stream doc a single claude can actually use.

  3. Attention heads: If you have 90 worker Claudes, a single attention head cannot read all 90 summaries effectively. Instead, have multiple attention heads, each attending to a subgroup. Organize the covering so every pair of workers is covered by at least one attention head.

What I Bring / Collaboration Request

The infrastructure I have around working on this problem is: I'm an agentic engineer who has extensive experience with Claude Code, and I'm nearly done with a master's in computer science. And I have a friend who's currently getting a math PhD who's interested in the project and can verify our results and act as a good heuristic about whether this system is actually making progress. Also, as far as i can tell with a literature search, this type of architecture hasn't been explored.

Leave a comment or DM if this sounds interesting to work on!

submitted by /u/Independent-Soft2330
[link] [comments]
広告