Config
- CtxSize: 131,072
- GpuLayers: 99
- CpuMoeLayers: 38
- Threads: 16
- BatchSize/UBatchSize: 4096/4096
- CacheType K/V: q8_0
- Tool Context: file mode (tools.kilocode.official.md)
| Metric | M Model | XL Model | Difference |
|---|---|---|---|
| Avg Tokens/sec | 28.92 | 29.78 | +0.86 (+3.0%) |
| Median Tokens/sec | 30.96 | 32.08 | +1.12 (+3.6%) |
| Avg Wall Seconds | 108.03s | 99.93s | -8.10s (-7.5%) |
| Avg Output Tokens | 3,031.8 | 2,895.8 | -136 (-4.5%) |
| Avg Input Tokens/sec | 50.20 | 55.96 | +5.76 (+11.5%) |
| Avg Decode Tokens/sec | 75.89 | 76.44 | +0.55 (+0.7%) |
Runs ~33% slower for the first run because my code has a bug that includes the initiation time, and as you know for an moe model you have to pass it from storage into ram. It's run 5 times to try to cancel is out, but still included it because that's how i would realistically use it (turning it on, using it once, turning it off to run something, etc).
[link] [comments]




