Anyone running Kimi on low VRAM + offloading to RAM? (im sure most)

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A Reddit user tests running the Kimi model with limited GPU memory (e.g., a 12GB Tesla T4) by offloading the rest of the model to system RAM, aiming to understand the performance impact.
They report throughput results on a dual Xeon Platinum CPU setup (48 cores, 1.5TB RAM) with CPU-only execution reaching about 20 tokens/s input and 1.6 tokens/s output, described as very poor.
They mention using NUMA and observe an unexpected behavior: a Q8 model (from Unsloth) runs slightly faster than a Q4 model on their system.
The post focuses on practical benchmarking and questions about how quantization level and RAM offloading affect output token speed on low-VRAM hardware.
Overall, it highlights the performance trade-offs and tuning considerations for local LLM inference when VRAM is insufficient.

Im curious how much output token benefits from something smaller like a 12gb Tesla T4, and offloading the remainder of the model to RAM

I get about ~1.6t/s output ~20t/s input CPU only.. which is obviously terrible. I'm using NUMA.. I have dual xeon platinum 24c(so 48c/96t) and 1.5T of RAM

Strangely enough, the Q8 model from un sloth, run slightly faster than the Q4 model on my system

submitted by /u/Creative-Type9411
[link] [comments]

Black Hat USA

AI Business

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

Anyone running Kimi on low VRAM + offloading to RAM? (im sure most)

Key Points

Related Articles

Black Hat USA

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Meta will use AI to analyze height and bone structure to identify if users are underage

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer