Decoupled Attention from Weights - Gemma 4 26B

Reddit r/LocalLLaMA / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post highlights a method called “decoupled attention” that splits the attention state from the model weights, placing attention on one local machine and weights on another.
This approach aims to bypass the usual scaling constraints of running large local LLMs by distributing compute and memory requirements across multiple machines.
It specifically references the “Gemma 4 26B” context and provides links to a repository with functional code and to a video overview.
The author frames the work as exciting because it makes local deployment more feasible by reducing the resource bottleneck that weights typically create on a single device.

Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: https://github.com/chrishayuk/larql

edit: just found https://www.youtube.com/watch?v=1jGR4zqpyKA for excellent overview of what's happening here.

submitted by /u/yeah-ok
[link] [comments]