Anyone running Kimi on low VRAM + offloading to RAM? (im sure most)

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A Reddit user tests running the Kimi model with limited GPU memory (e.g., a 12GB Tesla T4) by offloading the rest of the model to system RAM, aiming to understand the performance impact.
  • They report throughput results on a dual Xeon Platinum CPU setup (48 cores, 1.5TB RAM) with CPU-only execution reaching about 20 tokens/s input and 1.6 tokens/s output, described as very poor.
  • They mention using NUMA and observe an unexpected behavior: a Q8 model (from Unsloth) runs slightly faster than a Q4 model on their system.
  • The post focuses on practical benchmarking and questions about how quantization level and RAM offloading affect output token speed on low-VRAM hardware.
  • Overall, it highlights the performance trade-offs and tuning considerations for local LLM inference when VRAM is insufficient.

Im curious how much output token benefits from something smaller like a 12gb Tesla T4, and offloading the remainder of the model to RAM

I get about ~1.6t/s output ~20t/s input CPU only.. which is obviously terrible. I'm using NUMA.. I have dual xeon platinum 24c(so 48c/96t) and 1.5T of RAM

Strangely enough, the Q8 model from un sloth, run slightly faster than the Q4 model on my system

submitted by /u/Creative-Type9411
[link] [comments]