Ignoring the 8 bit size of Nvidia’s marketed 4 bit quantization of the dense model…
The dense model KV cache architecture uses 3x or more the memory than what I have seen with other models. It seems like the big choice was 256 head dim instead of 128.
I am looking at 490KB per 8 bit token of KV cache versus 128KB on Qwen3.
I am running the nvidia weights at 4 bit on an rtx pro 6000 with 96GB of RAM and 8 bit kv cache and still only have room for 115k tokens.
I was surprised is all. The model scales well in vllm and seems quite smart.
[link] [comments]

