Seeking the Absolute Lowest Latency for Qwen 3.5 9B: Best Inference Engine for 1-Stream Real-Time TTS?

Reddit r/LocalLLaMA / 3/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The poster is evaluating Qwen 3.5 9B with FP16 and EXL3 quantization on an RTX 3090 Ti to minimize latency in a single-stream real-time TTS pipeline (TTFT and TPS as core metrics).
Current TTFT is about 120-170 ms and TPS around 100-120 tokens/sec, with a target total latency of roughly 500-700 ms for generating ~100 tokens.
They are exploring low-latency inference techniques and flags (such as Flash Attention and cache optimizations) and considering speculative decoding with a smaller draft model to assess potential gains vs overhead.
They seek to identify the "gold standard" backend/inference engine configuration for ultra-low-latency, single-stream generation with Qwen 3.5 9B.

Hi everyone,

I'm building a real-time voice chat pipeline (STT -> LLM -> TTS) and I’m hitting a bottleneck in the "Time to Sentence" part. My goal is to minimize the total latency for generating a 100-token response.

My Requirements:
* Model: Qwen 3.5 9B (currently testing FP16 and EXL3 quants).
* Hardware: 1x NVIDIA RTX 3090 TI.
* Metric: Lowest possible TTFT (Time To First Token) + Highest TPS (Tokens Per Second) for a single stream (Batch Size 1).
* Target: Total time for ~100 tokens should be as close to 500-700ms as possible or lower.

Current Benchmarks (Single Stream):
I've been testing a few approaches and getting roughly:
* TTFT: ~120ms - 170ms
* TPS: ~100 - 120 tokens/sec
(Testing on a single Nvidia RTX 3090 TI)

For this single-user, real-time use case, I’m trying to find what is currently considered the "gold standard" for low-latency inference. I’ve experimented with several different backends, but it’s been challenging to find the right balance between minimal TTFT and high TPS. While
some engines excel at sustained generation once they get going, their initial overhead often makes the total response time higher than I’d like for a conversational interface.

I’m particularly interested in any specific flags or low-latency modes, such as Flash Attention or optimized cache configurations, that could shave off those crucial milliseconds. I’ve also been considering speculative decoding with a smaller draft model like a tiny Qwen or Gemma,
but I’m unsure if the overhead would actually provide a net gain for a 9B model or just eat into the performance.

Thanks for any insights!

submitted by /u/Nasa1423
[link] [comments]