INT3 compression+fused metal kernels [R]

Reddit r/MachineLearning / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The researcher compresses models using INT3 quantization (reporting +0.14 nats) and pairs this with a newly built 2-bit KV cache to better support long-horizon tasks.
  • They have shipped an INT3-compressed model together with an INT2 KV cache implementation via custom fused Metal kernels optimized for Apple Silicon (M-series) Macs.
  • A Qwen 7B model is currently available in preview using this approach.
  • The project continues to optimize the kernels and is working on Triton-based GPU kernels for broader hardware support, with additional models planned.
  • The author invites feedback and asks the community which models (up to ~100B parameters) they should compress next, providing the Spiral repo for access and installation.

Hey guys, I am a researcher and solo founder. I compress models with INT3 at +0.14 nats and built a 2-bit KV cache for long-horizon tasks. I shipped both (INT3 model + INT2 KV) with custom fused Metal kernels for Mac (M-series). Currently Qwen 7B is available in preview.

#install brew install reinforceai/spiral/spiral #chat spiral-chat 

I am optimizing kernels further and working on Triton kernels for GPU support. There is still more room to pack more efficiently, I will share more models soon. I will appreciate any feedback or any model you want me to compress within 100B parameters.

github.com/ReinforceAI/spiral

submitted by /u/Financial_Buy_2287
[link] [comments]