Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Key Points

The author reports that after converting a transformer model to FP16 and optimizing inference with ONNX Runtime, further gains from both unstructured/structured pruning and ONNX graph optimizations were minimal, leaving the model at roughly ~162 MB.

Hi everyone, I’ve been working on optimizing a transformer-based neural network for both inference speed and model size, but I feel like I’ve hit a plateau and would appreciate some guidance. So far I’ve converted weights to FP16 (about 2× size reduction), exported and optimized with ONNX Runtime for inference speed, and tried both unstructured and structured pruning as well as ONNX graph optimizations, but none of these gave significant additional gains, and I’m still around ~162 MB per model. At this point I’m considering next steps like low-rank factorization (SVD/LoRA-style compression), more aggressive quantization (INT8/INT4 like GPTQ, AWQ, or SmoothQuant), knowledge distillation into a smaller student model, or more hardware/runtime-specific optimizations like TensorRT or FlashAttention, but I’m not sure which of these actually gives meaningful real-world improvements after FP16 + pruning. I’d really appreciate advice on what approaches tend to work best in practice for transformer compression beyond what I’ve already tried, and whether low-rank methods are actually effective post-training or if distillation/quantization is usually the only real win at this stage.

submitted by /u/Fragrant_Rate_2583
[link] [comments]

Optimizing Transformer model size & inference beyond FP16 + ONNX (pruning/graph opt didn’t help much) [P]

Key Points

Related Articles

Black Hat USA

Training ChatGPT on Private Data: A Technical Reference

AI Tutor and Doubt Solver — EaseLearn AI Complete Review 2026

Doubt Solver App Free — Best Camera Based Doubt Solving India 2026

Best AI Tutor App in India 2026 — Free for All Students

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer