HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference

arXiv cs.LG / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces HGQ-LUT, a new LUT-aware training (LAT) method that targets ultra-low-latency, FPGA-efficient DNN inference while making training substantially more practical.
  • HGQ-LUT accelerates training by more than 100× on modern GPUs compared with prior state-of-the-art LAT approaches, aiming to eliminate the slow-training bottleneck.
  • It adds specialized LUT-Dense and LUT-Conv layers that use regular, accelerator-friendly tensor operations during training, then compile into hardware logic LUTs for deployment.
  • By combining fine-grained heterogeneous quantization (including zero-bit pruning) with a LUT-aware resource surrogate, HGQ-LUT can automatically explore accuracy–resource trade-offs without manual bit-width tuning.
  • The work integrates HGQ-LUT into open-source toolchains to support an end-to-end workflow and bit-exact verification for hybrid networks mixing LUT-based and conventional arithmetic blocks, with real-world motivation including CERN LHC experiments.

Abstract

Lookup-table (LUT) based neural networks can deliver ultra-low latency and excellent hardware efficiency on FPGAs by mapping arithmetic operations directly onto the logic primitives. However, state-of-the-art LUT-aware training (LAT) approaches remain difficult to use in practice: they are often orders of magnitude slower to train than conventional networks, require non-trivial manual tuning for hardware efficiency, and lack an end-to-end workflow. This work presents HGQ-LUT, integrated in https://github.com/calad0i/HGQ2, a new LAT approach that achieves state-of-the-art hardware efficiency while accelerating training by over 100 times on modern GPUs. HGQ-LUT introduces LUT-Dense and LUT-Conv layers that are implemented with regular, accelerator-efficient tensor operations during training, which are then compiled into logic LUTs for hardware. By combining these layers with fine-grained, element-wise heterogeneous quantization (including zero-bit pruning) and a LUT-aware resource surrogate, HGQ-LUT enables the automatic exploration of accuracy-resource trade-offs without manual bit-width tuning. We further integrate HGQ-LUT into open-source toolchains, enabling unified design, compilation, and bit-exact verification of hybrid architectures that mix LUT-based with conventional arithmetic blocks. These features make LAT-based DNNs practical for real-world deployment, such as at the CERN Large Hadron Collider's experiments.