PQuantML: A Tool for End-to-End Hardware-aware Model Compression

arXiv cs.LG / 3/30/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

PQuantML is introduced as a new open-source, hardware-aware library for end-to-end neural network model compression focused on meeting strict latency constraints in deployment environments.
The tool provides a unified workflow to apply pruning and fixed-point quantization either jointly or separately, including support for high-granularity quantization.
It includes multiple pruning techniques with different granularities and is designed to simplify training compressed models without requiring separate toolchains.
Experiments on tasks such as jet substructure classification and real-time LHC-oriented jet tagging show substantial reductions in parameter counts and bit-widths while preserving accuracy.
The paper compares PQuantML’s compression results against existing approaches like QKeras and HGQ.

Abstract

PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.