Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

arXiv cs.LG / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper tackles the often-underexplored gap between AI model development and real-world deployment by analyzing a BentoML-based inference system for scalable model serving.
It evaluates baseline performance using a pre-trained RoBERTa sentiment model across three workload scenarios (steady, bursty, and high-intensity) generated from gamma and exponential traffic patterns.
The study measures key metrics such as latency percentiles and throughput to pinpoint bottlenecks across the inference pipeline.
It then applies optimization strategies at multiple layers of the serving stack, re-runs the same tests, and uses statistical comparisons to quantify improvements.
The results also examine how latency and throughput scale with load and how running the system on a single-node K3s cluster affects resilience during disruptions.

Abstract

AI research often emphasizes model design and algorithmic performance, while deployment and inference remain comparatively underexplored despite being critical for real-world use. This study addresses that gap by investigating the performance and optimization of a BentoML-based AI inference system for scalable model serving developed in collaboration with graphworks.ai. The evaluation first establishes baseline performance under three realistic workload scenarios. To ensure a fair and reproducible assessment, a pre-trained RoBERTa sentiment analysis model is used throughout the experiments. The system is subjected to traffic patterns following gamma and exponential distributions in order to emulate real-world usage conditions, including steady, bursty, and high-intensity workloads. Key performance metrics, such as latency percentiles and throughput, are collected and analyzed to identify bottlenecks in the inference pipeline. Based on the baseline results, optimization strategies are introduced at multiple levels of the serving stack to improve efficiency and scalability. The optimized system is then reevaluated under the same workload conditions, and the results are compared with the baseline using statistical analysis to quantify the impact of the applied improvements. The findings demonstrate practical strategies for achieving efficient and scalable AI inference with BentoML. The study examines how latency and throughput scale under varying workloads, how optimizations at the runtime, service, and deployment levels affect response time, and how deployment in a single-node K3s cluster influences resilience during disruptions.