using all 31 free NVIDIA NIM models at once with automatic routing and failover

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The post describes a LiteLLM-based proxy/router that automatically fans out requests across all 31 NVIDIA NIM free-tier models instead of manually choosing a single model.
  • It uses latency-based routing to send each request to the fastest currently available model and implements retry-and-failover when a model hits rate limits or goes down.
  • The setup verifies which models are live on the API, applies cooldown windows for unhealthy models (e.g., 60 seconds), and automatically recovers routing afterward.
  • It defines multiple model groups (e.g., nvidia-auto, nvidia-coding, nvidia-reasoning, nvidia-general, nvidia-fast) and supports cross-tier fallbacks such as coding → reasoning → general.
  • The router exposes an OpenAI-compatible endpoint (e.g., localhost:4000), and the author shares a GitHub repo along with guidance for installing dependencies and running the config.
using all 31 free NVIDIA NIM models at once with automatic routing and failover

been using nvidia NIM free tier for a while and the main annoyance is picking which model to hit and dealing with rate limits (~40 RPM per model).

so i wrote a setup script that generates a LiteLLM proxy config to route across all of them automatically:

  • validates which models are actually live on the API
  • latency-based routing picks the fastest one each request
  • rate limited? retries then routes to next model
  • model goes down? 60s cooldown, auto-recovers
  • cross-tier fallbacks (coding -> reasoning -> general)

31 models right now - deepseek v3.2, llama 4 maverick/scout, qwen 3.5 397b, kimi k2, devstral 2, nemotron ultra, etc.

5 groups u can target:

  • nvidia-auto - all models, fastest wins
  • nvidia-coding - kimi k2, qwen3 coder 480b, devstral, codestral
  • nvidia-reasoning - deepseek v3.2, qwen 3.5, nemotron ultra
  • nvidia-general - llama 4, mistral large, deepseek v3.1
  • nvidia-fast - phi 4 mini, r1 distills, mistral small

add groq/cerebras keys too and u get ~140 RPM across 38 models.. all free.

openai compatible so works with any client:

client = openai.OpenAI(base_url="http://localhost:4000", api_key="sk-litellm-master")
resp = client.chat.completions.create(model="nvidia-auto", messages=[...])

setup is just:

pip install -r requirements.txt
python setup.py
litellm --config config.yaml --port 4000

github: https://github.com/rohansx/nvidia-litellm-router

curious if anyone else is stacking free providers like this. also open to suggestions on which models should go in which tier. 🚀

submitted by /u/synapse_sage
[link] [comments]