LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

arXiv cs.CL / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes the Guardian Parser Pack, an LLM-enhanced parsing and normalization pipeline that converts heterogeneous missing-person case documents into a single schema-compliant format for operational review and downstream spatial modeling.
  • It combines multi-engine PDF extraction with OCR fallback, source-specific parsers, and schema-first harmonization with validation, with an optional LLM-assisted pathway for higher-quality field extraction.
  • On a manually aligned subset of 75 cases, the LLM-assisted extraction substantially outperforms a deterministic approach, achieving higher F1 scores (0.8664 vs. 0.2578).
  • Across 517 parsed records per pathway, the LLM-assisted method improves aggregate key-field completeness (96.97% vs. 93.23%) while still passing initial schema validation, indicating validator-guided repair helps preserve auditability.
  • The deterministic pipeline is far faster (0.03 s/record vs. 3.95 s/record for the LLM pathway), highlighting a speed/quality tradeoff for high-stakes investigative workflows.

Abstract

Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97\% vs. 93.23\%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.