AI Navigate

Can we use continuous batching to create agent swarm for local LLMs?

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The post asks whether continuous batching (serving multiple users on one loaded LLM) can be adapted to a single-user workflow that needs parallel processing over multiple documents/sources.

Recently, I learned about the concept of continuous batching, where multiple users can interact with a single loaded LLM without significantly decreasing tokens per second. The primary limitation is the KV cache.

I am wondering if it is possible to apply continuous batching to a single-user workflow. For example, if I ask an AI to analyze 10 different sources, it typically reads them sequentially within a 32k context window, which is slow.

Instead, could we use continuous batching to initiate 10 parallel process each with a 3.2k context window to read the sources simultaneously? This would theoretically reduce waiting time significantly.

Is this approach possible, and if so, could you please teach me how to implement it?

submitted by /u/9r4n4y
[link] [comments]