FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

arXiv cs.LG / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces FairyFuse, a CPU-focused LLM inference system that eliminates floating-point multiplications by using ternary weights in {-1, 0, +1}, replacing multiplies with masked additions/subtractions or no-ops.
FairyFuse fuses the eight real-valued sub-GEMVs from each widely-linear layer into a single AVX-512 loop, enabling “multiplication-free” execution on commodity CPUs.
Roofline analysis suggests that 16x weight compression can move the bottleneck from memory bandwidth toward compute, which the authors report as a 29.6x speedup at the kernel level on bandwidth-limited CPUs.
End-to-end benchmarks show 32.4 tokens/second on a single Intel Xeon 8558P, improving over llama.cpp Q4_K_M by 1.24x while maintaining near-lossless quality (e.g., WikiText-2 perplexity 5.52 vs 5.47 FP16).
The approach delivers little benefit on GPUs, indicating the optimization is specifically tuned for CPU memory-bandwidth constraints.

Abstract

Large language models are increasingly deployed on CPU-only platforms where memory bandwidth is the primary bottleneck for autoregressive generation. Weight quantization to four bits or below reduces memory pressure, yet existing systems still dequantize weights and perform floating-point multiplications, limiting the achievable gains. Ternary weights in {-1, 0, +1} provide a more efficient alternative, replacing multiplications with conditional additions, subtractions, or no-ops. While Fairy2i shows that ternary LLMs can match FP16 quality, its runtime does not exploit this structure. We present FairyFuse, an inference system that enables multiplication-free execution on commodity CPUs by fusing the eight real-valued sub-GEMVs of each widely-linear layer into a single AVX-512 loop using masked additions and subtractions, with zero floating-point multiplications. Roofline analysis shows that 16x weight compression shifts memory-bound GEMV toward the compute regime on bandwidth-limited CPUs, yielding a 29.6x kernel speedup while offering little benefit on GPUs. End-to-end, FairyFuse achieves 32.4 tokens per second on a single Intel Xeon 8558P, outperforming llama.cpp Q4_K_M by 1.24x with near-lossless quality (WikiText-2 perplexity 5.52 vs. 5.47 FP16; downstream accuracy 66.0%).

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Dev.to

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Reddit r/LocalLLaMA

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

Key Points

Abstract

Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer