Previously a model could only be present in a single group. Now you can create whatever groups you want: one for big models that should run on their own, a group for STT + bigger model, a group for RAG usages, etc. It'll intelligently unload models based on "cost" of doing so.
Check out the config: llama-swap/config.example.yaml at main · mostlygeek/llama-swap
# ============================================================================= # matrix: run concurrent models with a solver-based swap DSL # ============================================================================= # # Note: # A config must use either a matrix or legacy groups, not both. A configuration error # will occur if both are defined. Configuration examples for legacy Groups can be found: # https://github.com/mostlygeek/llama-swap/blob/40e39f7/config.example.yaml#L334-L396 # # The matrix declares valid combinations of models that can run concurrently. # When a model is requested, the solver finds the cheapest way to make it # available by evicting as few (and least costly) running models as possible. # # Solver behavior: # 1. Request arrives for model X # 2. If X is already running, forward immediately. Done. # 3. Find all sets containing X # 4. For each candidate set, compute cost: sum of evict_costs for # every running model NOT in that set # 5. Pick lowest cost candidate. Ties broken by definition order. # 6. Evict what needs to stop. Start X. Forward request. # # Subset semantics: a set [a, b, c] means any subset is valid. # Only the requested model is started — others are not preloaded. # # A model not appearing in any set can only run alone. # matrix: # vars: short names for models (alphanumeric, 1-8 chars) # - required for sets and evict_costs settings # - each entry is a short name to a real model ID. Do not use an alias # - used to keep set DSL logic short and easier to read # - sets and evict_costs only use identifiers defined in vars vars: g: gemma-model q: qwen-model m: mistral-model v: voxtral-model e: reranker-model L: llama-70B sd: stable-diffusion # evict_costs: relative cost of losing a running model (default: 1) evict_costs: v: 50 # vllm backend, slow cold start L: 30 # 70B weights, slow to load # sets: named sets of concurrent model combinations # Values are DSL strings with operators: # & AND (models run together) # | OR (alternatives) # () grouping # +ref inline another set's expression # # Expansion examples: # "L" → [L] # "a & b" → [a, b] # "a | b" → [a], [b] # "(a | b) & c" → [a, c], [b, c] # "(a | b) & (c | d)" → [a,c], [a,d], [b,c], [b,d] # "+llms & v" → expands llms inline, then applies & v sets: # LLM + TTS: switching between g/q/m won't evict v # expands to: [g,v], [q,v], [m,v] standard: "(g | q | m) & v" # LLM + TTS + reranker # expands to: [g,v,e], [q,v,e] with_rerank: "(g | q) & v & e" # LLM + image generation, no TTS # expands to: [g,sd], [q,sd] creative: "(g | q) & sd" # 70B model uses all GPUs, can only run alone # expands to: [L] full: "L" [link] [comments]




