using all 31 free NVIDIA NIM models at once with automatic routing and failover

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post describes a LiteLLM-based proxy/router that automatically fans out requests across all 31 NVIDIA NIM free-tier models instead of manually choosing a single model.
It uses latency-based routing to send each request to the fastest currently available model and implements retry-and-failover when a model hits rate limits or goes down.
The setup verifies which models are live on the API, applies cooldown windows for unhealthy models (e.g., 60 seconds), and automatically recovers routing afterward.
It defines multiple model groups (e.g., nvidia-auto, nvidia-coding, nvidia-reasoning, nvidia-general, nvidia-fast) and supports cross-tier fallbacks such as coding → reasoning → general.
The router exposes an OpenAI-compatible endpoint (e.g., localhost:4000), and the author shares a GitHub repo along with guidance for installing dependencies and running the config.

using all 31 free NVIDIA NIM models at once with automatic routing and failover

been using nvidia NIM free tier for a while and the main annoyance is picking which model to hit and dealing with rate limits (~40 RPM per model).

so i wrote a setup script that generates a LiteLLM proxy config to route across all of them automatically:

validates which models are actually live on the API
latency-based routing picks the fastest one each request
rate limited? retries then routes to next model
model goes down? 60s cooldown, auto-recovers
cross-tier fallbacks (coding -> reasoning -> general)

31 models right now - deepseek v3.2, llama 4 maverick/scout, qwen 3.5 397b, kimi k2, devstral 2, nemotron ultra, etc.

5 groups u can target:

nvidia-auto - all models, fastest wins
nvidia-coding - kimi k2, qwen3 coder 480b, devstral, codestral
nvidia-reasoning - deepseek v3.2, qwen 3.5, nemotron ultra
nvidia-general - llama 4, mistral large, deepseek v3.1
nvidia-fast - phi 4 mini, r1 distills, mistral small

add groq/cerebras keys too and u get ~140 RPM across 38 models.. all free.

openai compatible so works with any client:

client = openai.OpenAI(base_url="http://localhost:4000", api_key="sk-litellm-master")
resp = client.chat.completions.create(model="nvidia-auto", messages=[...])

setup is just:

pip install -r requirements.txt
python setup.py
litellm --config config.yaml --port 4000

github: https://github.com/rohansx/nvidia-litellm-router

curious if anyone else is stacking free providers like this. also open to suggestions on which models should go in which tier. 🚀

submitted by /u/synapse_sage
[link] [comments]