NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

arXiv cs.CL / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces NeuronMLP, an LLM inference optimization for AWS Trainium that combines Singular Value Decomposition (SVD) compression with tiling tailored to Trainium’s heterogeneous, systolic-array architecture.
  • It applies Trainium-specific techniques such as kernel fusion and new caching strategies to reduce costly data movement, better utilize SRAM bandwidth, and avoid expensive matrix transposes.
  • The approach is focused on accelerating multi-layer perceptron (MLP) layers, which are a key computation bottleneck for LLM inference on Trainium.
  • Experiments across nine datasets and six recent LLMs show NeuronMLP delivers average 1.35× speedups at the matmul-kernel level, translating to 1.21× end-to-end LLM inference speedups at a 0.05 compression ratio versus AWS NKI-based kernels.

Abstract

Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we propose NeuronMLP, an efficient LLM inference method based on Singular Value Decomposition (SVD) compression and tiling on AWS Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. The proposed method is specifically optimized for multi-layer perceptron (MLP) layers in LLMs, which serve as a critical computational kernel for inference on Trainium. Evaluating on nine datasets and six recent LLMs, we show that NeuronMLP significantly outperforms the state-of-the-art Neuron Kernel Interface (NKI)-based matrix multiplication (matmul) kernel implemented by AWS on Trainium: at the kernel level, it achieves an average 1.35x speedup, which translates to an average 1.21x speedup for end-to-end LLM inference, under a compression ratio of 0.05.