Decoupled Attention from Weights - Gemma 4 26B

Reddit r/LocalLLaMA / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post highlights a method called “decoupled attention” that splits the attention state from the model weights, placing attention on one local machine and weights on another.
  • This approach aims to bypass the usual scaling constraints of running large local LLMs by distributing compute and memory requirements across multiple machines.
  • It specifically references the “Gemma 4 26B” context and provides links to a repository with functional code and to a video overview.
  • The author frames the work as exciting because it makes local deployment more feasible by reducing the resource bottleneck that weights typically create on a single device.

Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: https://github.com/chrishayuk/larql

edit: just found https://www.youtube.com/watch?v=1jGR4zqpyKA for excellent overview of what's happening here.

submitted by /u/yeah-ok
[link] [comments]