DHFP-PE: Dual-Precision Hybrid Floating Point Processing Element for AI Acceleration

arXiv cs.RO / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a fully pipelined dual-precision floating-point MAC processing element tailored for energy-efficient AI and edge workloads, supporting both FP8 (E4M3, E5M2) and FP4 (E2M1, E1M2) formats.
It introduces a bit-partitioning method that lets a single 4-bit multiplier behave either as a conventional 4×4 multiplier for FP8 operations or as two parallel 2×2 multipliers for smaller operand cases, achieving full hardware utilization without duplicating logic.
The design is implemented in 28 nm technology and reports 1.94 GHz operating frequency, 0.00396 mm² area, and 2.13 mW power consumption.
Compared with prior state-of-the-art approaches, the architecture claims up to 60.4% area reduction and 86.6% power savings, indicating strong efficiency potential for low-precision MAC-heavy accelerators.
The work is positioned as an accelerator-friendly hardware building block that could improve throughput-per-watt in AI systems that rely on low-precision arithmetic.

Abstract

The rapid adoption of low-precision arithmetic in artificial intelligence and edge computing has created a strong demand for energy-efficient and flexible floating-point multiply-accumulate (MAC) units. This paper presents a fully pipelined dual-precision floating-point MAC processing engine supporting FP8 formats (E4M3, E5M2) and FP4 formats (E2M1, E1M2), specifically optimized for low-power and high-throughput AI workloads. The proposed architecture employs a novel bit-partitioning technique that enables a single 4-bit unit multiplier to operate either as a standard 4x4 multiplier for FP8 or as two parallel 2x2 multipliers for 2-bit operands, achieving 100 percent hardware utilization without duplicating logic. Implemented in 28 nm technology, the proposed processing engine achieves an operating frequency of 1.94 GHz with an area of 0.00396 mm^2 and power consumption of 2.13 mW, resulting in up to 60.4 percent area reduction and 86.6 percent power savings compared to state-of-the-art designs.