DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona
arXiv cs.CL / 3/25/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces DALDALL, a persona-based data augmentation framework designed to improve legal information retrieval in low-resource settings where data scarcity persists.
- Instead of generating large volumes of synthetic queries with generic prompting, DALDALL uses domain-specific professional personas (e.g., attorneys, prosecutors, judges) to produce synthetic queries with higher lexical and semantic diversity.
- Experiments on the CLERC and COLIEE benchmarks show that persona-based augmentation improves lexical diversity (via Self-BLEU) while maintaining semantic fidelity to the original queries.
- Dense retrievers fine-tuned on persona-augmented data achieve competitive or better recall than retrievers trained on original data or using generic augmentation strategies.
- Overall, the work positions persona-based prompting as an effective approach for creating higher-quality training data for specialized legal IR tasks.
Related Articles
5 Signs Your Consulting Firm Needs AI Agents (Not More Staff)
Dev.to
AgentDesk vs Hiring Another Consultant: A Cost Comparison
Dev.to
"Why Your AI Agent Needs a System 1"
Dev.to
When should we expect TurboQuant?
Reddit r/LocalLLaMA
AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia
Dev.to