Querying Structured Data Through Natural Language Using Language Models

arXiv cs.CL / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an open-source method to query structured, non-text datasets using natural language by training an LLM to generate executable queries.
  • It argues that standard RAG approaches can struggle with numerical and highly structured data, and instead uses a pipeline that generates synthetic training question–answer pairs reflecting both user intent and dataset semantics.
  • The authors fine-tune a compact DeepSeek R1 Distill 8B model with QLoRA and 4-bit quantization, aiming for deployment on commodity hardware rather than relying on large proprietary LLMs.
  • Experiments on accessibility data for essential services in Durangaldea, Spain show high accuracy across monolingual, multilingual, and unseen-location scenarios, indicating strong generalization for query generation.
  • The results suggest small, domain-specific models can achieve high precision and be adapted to broader multi-dataset systems, supporting use in resource-constrained environments.

Abstract

This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems.