Recently, I learned about the concept of continuous batching, where multiple users can interact with a single loaded LLM without significantly decreasing tokens per second. The primary limitation is the KV cache.
I am wondering if it is possible to apply continuous batching to a single-user workflow. For example, if I ask an AI to analyze 10 different sources, it typically reads them sequentially within a 32k context window, which is slow.
Instead, could we use continuous batching to initiate 10 parallel process each with a 3.2k context window to read the sources simultaneously? This would theoretically reduce waiting time significantly.
Is this approach possible, and if so, could you please teach me how to implement it?
[link] [comments]




