| We built a corpus of 500,000 documents simulating a real company, and then let RAG systems compete to find out which one is the best. Introducing EnterpriseRAG-Bench, a benchmark for testing how well RAG systems work on messy, enterprise-scale internal knowledge. Most RAG benchmarks are built on public data: Wikipedia, web pages, papers, forums, etc. That’s useful, but it doesn’t really match what a lot of people are building against in practice: Slack threads, email chains, tickets, meeting transcripts, PRs, CRM notes, docs, and wikis. So we tried to generate a synthetic company that behaves more like a real one. The released dataset simulates a company called Redwood Inference and includes about 500k documents across:
The part we spent the most time on was not just “generate a lot of docs.” It was the methodology for making the docs feel like they belong to the same company. At a high level, the generation pipeline works like this:
A couple baseline findings from the paper:
The repo includes the dataset, generation framework, evaluation harness, and leaderboard: https://github.com/onyx-dot-app/EnterpriseRAG-Bench Would love feedback from other people building RAG/search systems over internal company data. In particular, I’m curious what retrieval setups people think would do best here: hybrid search, rerankers, agents, metadata filters, query rewriting, graph-style traversal, etc. [link] [comments] |
An Open Benchmark for Testing RAG on Realistic Company-Internal Data
Reddit r/LocalLLaMA / 5/6/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The article introduces EnterpriseRAG-Bench, an open benchmark designed to evaluate RAG systems on realistic, messy enterprise internal knowledge rather than public data.
- It releases a synthetic “Redwood Inference” company corpus containing about 500,000 documents drawn from common internal tools and systems (e.g., Slack, Gmail, Jira, Confluence, GitHub, Google Drive, CRM notes).
- The dataset’s key differentiator is its generation methodology, which first defines the company via human-in-the-loop inputs, then creates shared scaffolding (team/terminology, directory structures, and per-source documentation guidelines).
- Project documents are generated with cross-document awareness, producing realistic links and dependencies across sources such as PRDs, meeting notes, tickets, PRs, and customer notes.
- To keep document diversity high and avoid repetitive themes, the benchmark uses cheaper topic scaffolding by source type for the large majority of the corpus, addressing duplication observed in naive generation.
Related Articles
Vibe coding and agentic engineering are getting closer than I'd like
Simon Willison's Blog

Enterprise Low-Code Intelligence | Azure AI x Power Platform | R.A.H.S.I. Framework™
Dev.to

AI Harness Engineering: The Missing Layer Behind Reliable LLM Applications
Dev.to
Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM
Reddit r/LocalLLaMA
AI boom pushes Samsung to $1T
TechCrunch