Hi all,
I’m working on bringing LLM infrastructure in-house for a business use case and would really appreciate input from anyone running production setups.
Budget: $50k to $150k USD
Deployment: On-prem (data sensitivity)
Use case: Internal tools + RAG over private documents + fine-tuning
Scale:
∙ Starting with a handful of users
∙ Planning to scale to ~50 concurrent users
Requirements:
∙ Strong multi user inference throughput
∙ Support modern open weight models (dense + MoE)
∙ Long context support (32k to 128k+ baseline, curious how far people are actually pushing context lengths in real multi user setups without killing throughput)
∙ Stability and uptime > peak performance
Current direction:
∙ Leaning toward a 4× RTX Pro 6000 Max-Q as the main option
∙ Also considering Apple hardware if it’s actually competitive for this kind of workload
Questions (Hardware):
- Any hardware setups people would recommend specifically for the models they’re running?
- Should I be prioritizing NVLink at this scale, or is it not worth it?
- For a build like this, what do you recommend for: CPU, motherboard (PCIe lanes / layout), RAM, storage (NVMe, RAID, etc.), power supply?
- Any real world lessons around reliability / failure points?
Questions (Models):
- What models are people actually running locally in production right now?
- For RAG + internal tools, what’s working best in practice?
- Any “sweet spot” models that balance: quality, VRAM usage, throughput under load?
Serving stack:
Is vLLM still the best default choice for multi-user production setups at this scale?
Architecture question:
For business use cases like this, are people mostly seeing success with strong RAG + good base models first, then adding fine-tuning later for behavior/style, or is fine-tuning becoming necessary earlier in real deployments?
Open to:
∙ Used/refurb enterprise hardware
∙ Real world configs + benchmarks
∙ “What I wish I knew” lessons
Trying to make a solid, production ready decision here, really appreciate any insights.
Thanks!
[link] [comments]



