Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

arXiv cs.CL / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces Hierarchical Policy Optimization (HPO) for simultaneous speech translation (SST) that optimizes both translation quality and low-latency behavior.
  • It addresses the high compute cost of LLM-based SST by building on dialogue-style SST that reuses the LLM’s KV cache, reducing redundant computation.
  • Unlike prior dialogue reformulations that depend on scarce, high-quality supervised fine-tuning (SFT) annotations, HPO post-trains from imperfect SFT data using a hierarchical reward scheme.
  • Experiments for English→Chinese/German/Japanese report improvements of over +7 COMET and +1.25 MetricX at a target latency of 1.5 seconds, supported by ablation studies.
  • The authors provide code at GitHub, enabling reproducibility and further development of the HPO approach for SST.
  • Point 1
  • Point 2
  • Point 3

Abstract

Simultaneous speech translation (SST) generates translations while receiving partial speech input. Recent advances show that large language models (LLMs) can substantially improve SST quality, but at the cost of high computational overhead. To reduce this cost, prior work reformulates SST as a multi-turn dialogue task, enabling full reuse of the LLM's key-value (KV) cache and eliminating redundant feature recomputation. However, this approach relies on supervised fine-tuning (SFT) data in dialogue form, for which few human annotations exist, and existing synthesis methods cannot guarantee data quality. In this work, we propose a Hierarchical Policy Optimization (HPO) approach that post-train models trained on imperfect SFT data. We introduce a hierarchical reward that balances translation quality and latency objectives. Experiments on English to Chinese/German/Japanese demonstrate improvements of over +7 COMET score and +1.25 MetricX score at a latency of 1.5 seconds. Comprehensive ablation studies further validate the effectiveness of different quality rewards, hierarchical reward formulations, and segmentation strategies. Code can be found here https://github.com/owaski/HPO