OneComp: One-Line Revolution for Generative AI Model Compression

arXiv cs.AI / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article introduces OneComp, an open-source framework aimed at making post-training generative AI model compression practical under real constraints like memory, latency, and hardware cost.
  • OneComp takes a model identifier and target hardware, automatically inspects the model, plans mixed-precision assignments, and runs progressive quantization stages from layer-wise compression to block-wise and global refinement.
  • A central design idea is using the first quantized checkpoint as a “deployable pivot,” so later stages consistently improve the same model and quality increases as more compute is invested.
  • The work targets the fragmentation problem in compression practice by turning a heterogeneous expert workflow (quantization algorithms, precision budgets, calibration, and hardware execution regimes) into a reproducible, resource-adaptive pipeline.

Abstract

Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven calibration strategies, and hardware-dependent execution regimes. We present OneComp, an open-source compression framework that transforms this expert workflow into a reproducible, resource-adaptive pipeline. Given a model identifier and available hardware, OneComp automatically inspects the model, plans mixed-precision assignments, and executes progressive quantization stages, ranging from layer-wise compression to block-wise refinement and global refinement. A key architectural choice is treating the first quantized checkpoint as a deployable pivot, ensuring that each subsequent stage improves the same model and that quality increases as more compute is invested. By converting state-of-the-art compression research into an extensible, open-source, hardware-aware pipeline, OneComp bridges the gap between algorithmic innovation and production-grade model deployment.