Over the last month I've been working on a custom architecture that fully replaces the residual stream transformers use with a structured workspace.
The goal isn't to claim "I beat transformers", it's a thought experiment into what happens structurally when you enforce a workspace instead, and where the compute actually goes.
The findings were fun to discover and very interesting.
CWT has 22.9M core compute (attn+FFN) vs 41.7M in the compute-matched baseline, and comes within 1.7% PPL, roughly a ~45% gap in core compute for near-equivalent quality.
The other thing a structured workspace gives you is full visibility into how the model operates on a per-token basis. You can watch and record it as 3D visuals, which standard transformers can't really offer easily, if at all.
All code, model weights, and paper are open source. This is my first proper research paper, feedback and ideas are fully welcome.
Paper:
https://steel-skull.github.io/CWT-V5.6/
Model:
https://huggingface.co/Steelskull/CWT-V5.6
Model code:
https://github.com/Steel-skull/CWT-V5.6
PS: there was compute and monetary constraints on this project, as I was paying out of pocket, so please understand some things are limited in scope.
[link] [comments]


