NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

arXiv cs.CL / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces NeuronMLP, an LLM inference optimization for AWS Trainium that combines Singular Value Decomposition (SVD) compression with tiling tailored to Trainium’s heterogeneous, systolic-array architecture.
It applies Trainium-specific techniques such as kernel fusion and new caching strategies to reduce costly data movement, better utilize SRAM bandwidth, and avoid expensive matrix transposes.
The approach is focused on accelerating multi-layer perceptron (MLP) layers, which are a key computation bottleneck for LLM inference on Trainium.
Experiments across nine datasets and six recent LLMs show NeuronMLP delivers average 1.35× speedups at the matmul-kernel level, translating to 1.21× end-to-end LLM inference speedups at a 0.05 compression ratio versus AWS NKI-based kernels.

Abstract

Emerging AI accelerators have started to gain attention and offer new opportunities for efficient inference of large language models (LLMs). Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we propose NeuronMLP, an efficient LLM inference method based on Singular Value Decomposition (SVD) compression and tiling on AWS Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. The proposed method is specifically optimized for multi-layer perceptron (MLP) layers in LLMs, which serve as a critical computational kernel for inference on Trainium. Evaluating on nine datasets and six recent LLMs, we show that NeuronMLP significantly outperforms the state-of-the-art Neuron Kernel Interface (NKI)-based matrix multiplication (matmul) kernel implemented by AWS on Trainium: at the kernel level, it achieves an average 1.35x speedup, which translates to an average 1.21x speedup for end-to-end LLM inference, under a compression ratio of 0.05.

The Open Source AI Studio That Nobody's Talking About

Dev.to

How I Built a 10-Language Sports Analytics Platform with FastAPI, SQLite, and Claude AI (As a Solo Non-Technical Founder)

Dev.to

The five loops between AI coding and AI engineering

Dev.to

A Machine Learning Model for Stock Market Prediction

Dev.to

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

MarkTechPost

NeuronMLP: Efficient LLM Inference via Singular Value Decomposition Compression and Tiling on AWS Trainium

Key Points

Abstract

Related Articles

The Open Source AI Studio That Nobody's Talking About

How I Built a 10-Language Sports Analytics Platform with FastAPI, SQLite, and Claude AI (As a Solo Non-Technical Founder)

The five loops between AI coding and AI engineering

A Machine Learning Model for Stock Market Prediction

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer