Accelerating PayPal's Commerce Agent with Speculative Decoding: An Empirical Study on EAGLE3 with Fine-Tuned Nemotron Models
arXiv cs.LG / 4/23/2026
📰 NewsDeveloper Stack & InfrastructureIndustry & Market MovesModels & Research
Key Points
- The study evaluates speculative decoding using EAGLE3 to accelerate PayPal’s Commerce Agent, leveraging a fine-tuned llama3.1-nemotron-nano-8B-v1 model.
- On identical 2xH100 hardware, vLLM-based EAGLE3 is benchmarked against NVIDIA NIM across 40 configurations covering speculative token counts, concurrency (1–32), and sampling temperatures.
- With gamma=3, the approach delivers 22–49% higher throughput and 18–33% lower latency while keeping acceptance rate roughly stable at 35.5% across conditions.
- Increasing to gamma=5 shows diminishing returns, with acceptance rate dropping to around 25%.
- Output quality is reported as preserved by an LLM-as-Judge evaluation, and speculative decoding on a single H100 can match or exceed NIM on two H100s, enabling about 50% GPU cost reduction.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
