[$50k–$150k Budget] Production Local LLM System (~50 Users, RAG + Fine-Tuning) Hardware + Model Advice

Key Points

The post seeks practical guidance for building an on-prem LLM infrastructure with a $50k–$150k budget to support internal tools for roughly 50 concurrent users using RAG over private documents and potentially fine-tuning.

Hi all,

I’m working on bringing LLM infrastructure in-house for a business use case and would really appreciate input from anyone running production setups.

Budget: $50k to $150k USD

Deployment: On-prem (data sensitivity)

Use case: Internal tools + RAG over private documents + fine-tuning

Scale:

∙ Starting with a handful of users

∙ Planning to scale to ~50 concurrent users

Requirements:

∙ Strong multi user inference throughput

∙ Support modern open weight models (dense + MoE)

∙ Long context support (32k to 128k+ baseline, curious how far people are actually pushing context lengths in real multi user setups without killing throughput)

∙ Stability and uptime > peak performance

Current direction:

∙ Leaning toward a 4× RTX Pro 6000 Max-Q as the main option

∙ Also considering Apple hardware if it’s actually competitive for this kind of workload

Questions (Hardware):

Any hardware setups people would recommend specifically for the models they’re running?
Should I be prioritizing NVLink at this scale, or is it not worth it?
For a build like this, what do you recommend for: CPU, motherboard (PCIe lanes / layout), RAM, storage (NVMe, RAID, etc.), power supply?
Any real world lessons around reliability / failure points?

Questions (Models):

What models are people actually running locally in production right now?
For RAG + internal tools, what’s working best in practice?
Any “sweet spot” models that balance: quality, VRAM usage, throughput under load?

Serving stack:

Is vLLM still the best default choice for multi-user production setups at this scale?

Architecture question:

For business use cases like this, are people mostly seeing success with strong RAG + good base models first, then adding fine-tuning later for behavior/style, or is fine-tuning becoming necessary earlier in real deployments?

Open to:

∙ Used/refurb enterprise hardware

∙ Real world configs + benchmarks

∙ “What I wish I knew” lessons

Trying to make a solid, production ready decision here, really appreciate any insights.

Thanks!

submitted by /u/MorningCrab
[link] [comments]