PQuantML: A Tool for End-to-End Hardware-aware Model Compression

arXiv cs.LG / 3/30/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • PQuantML is introduced as a new open-source, hardware-aware library for end-to-end neural network model compression focused on meeting strict latency constraints in deployment environments.
  • The tool provides a unified workflow to apply pruning and fixed-point quantization either jointly or separately, including support for high-granularity quantization.
  • It includes multiple pruning techniques with different granularities and is designed to simplify training compressed models without requiring separate toolchains.
  • Experiments on tasks such as jet substructure classification and real-time LHC-oriented jet tagging show substantial reductions in parameter counts and bit-widths while preserving accuracy.
  • The paper compares PQuantML’s compression results against existing approaches like QKeras and HGQ.

Abstract

PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.