Querying Structured Data Through Natural Language Using Language Models
arXiv cs.CL / 4/6/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes an open-source method to query structured, non-text datasets using natural language by training an LLM to generate executable queries.
- It argues that standard RAG approaches can struggle with numerical and highly structured data, and instead uses a pipeline that generates synthetic training question–answer pairs reflecting both user intent and dataset semantics.
- The authors fine-tune a compact DeepSeek R1 Distill 8B model with QLoRA and 4-bit quantization, aiming for deployment on commodity hardware rather than relying on large proprietary LLMs.
- Experiments on accessibility data for essential services in Durangaldea, Spain show high accuracy across monolingual, multilingual, and unseen-location scenarios, indicating strong generalization for query generation.
- The results suggest small, domain-specific models can achieve high precision and be adapted to broader multi-dataset systems, supporting use in resource-constrained environments.




