Are there ways to set up llama-swap so that competing model requests are queued ?

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post asks whether LlamaSwap (running on a 48GB workstation) can be configured so that when GPU/host memory is exhausted, incoming model inference requests are queued rather than failing.
  • The author intends to keep using LiteLLM as the front-end API layer while delegating the actual model hosting/swapping behavior to a LlamaSwap instance.
  • It seeks guidance on how to support multiple models behind the same API endpoint so students can request whichever model they want.
  • The author also asks whether using AMD hardware introduces additional complications for LlamaSwap/LiteLLM integration or performance.
  • Overall, the request is focused on operational behavior (request handling and concurrency) and deployment considerations for an educational/student-access setup.

Hello everyone:) as the title says, I am looking to provide a 48gb workstation to students as an API endpoint. I am using litellm currently and want to keep using it but under the hood I would love to get a llama swap instance to run so that I can offer different models and students can just query the one they want. But if no memory is left I would like the job to be queued is there a functionality like that ?

Also I am running on AMD does that introduce any further problems?

submitted by /u/Noxusequal
[link] [comments]