LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources
arXiv cs.CL / 4/9/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes the Guardian Parser Pack, an LLM-enhanced parsing and normalization pipeline that converts heterogeneous missing-person case documents into a single schema-compliant format for operational review and downstream spatial modeling.
- It combines multi-engine PDF extraction with OCR fallback, source-specific parsers, and schema-first harmonization with validation, with an optional LLM-assisted pathway for higher-quality field extraction.
- On a manually aligned subset of 75 cases, the LLM-assisted extraction substantially outperforms a deterministic approach, achieving higher F1 scores (0.8664 vs. 0.2578).
- Across 517 parsed records per pathway, the LLM-assisted method improves aggregate key-field completeness (96.97% vs. 93.23%) while still passing initial schema validation, indicating validator-guided repair helps preserve auditability.
- The deterministic pipeline is far faster (0.03 s/record vs. 3.95 s/record for the LLM pathway), highlighting a speed/quality tradeoff for high-stakes investigative workflows.
Related Articles

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial
Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to
Moving from proof of concept to production: what we learned with Nometria
Dev.to
Frontend Engineers Are Becoming AI Trainers
Dev.to