WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

arXiv cs.CL / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • WhisperPipe is a new streaming ASR architecture designed to balance transcription accuracy and computational efficiency for large transformer models like Whisper.
  • It uses a hybrid VAD pipeline (Silero VAD plus energy-based filtering) to reduce false activations by 34%, helping improve real-time reliability.
  • A dynamic buffering mechanism with overlapping context windows prevents information loss at segment boundaries while keeping memory usage bounded.
  • In experiments on 2.5 hours of diverse audio, WhisperPipe reaches a median 89 ms end-to-end latency and reduces peak GPU memory usage by 48%, with stable memory behavior over 150 minutes.
  • The system achieves competitive accuracy (WER within 2% of offline Whisper) while delivering 3–5x lower latency than prior streaming approaches and supports modular deployment from edge to cloud.

Abstract

Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. Evaluated on 2.5 hours of diverse audio data, WhisperPipe demonstrates a median end-to-end latency of 89ms (90th percentile: 142ms) while consuming 48% less peak GPU memory and 80.9% lower average GPU utilization compared to baseline Whisper implementations. The system maintains stable memory usage over extended sessions, with zero growth rate across 150-minute continuous operation. Comparative analysis against related work shows that WhisperPipe achieves competitive accuracy (WER within 2% of offline Whisper) while operating at 3-5x lower latency than existing streaming solutions. The architecture's modular design enables deployment across resource-constrained environments, from edge devices to cloud infrastructure. Our results demonstrate that careful architectural design can reconcile the competing demands of real-time responsiveness and model sophistication in production ASR systems.