LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

arXiv cs.CL / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes the Guardian Parser Pack, an LLM-enhanced parsing and normalization pipeline that converts heterogeneous missing-person case documents into a single schema-compliant format for operational review and downstream spatial modeling.
It combines multi-engine PDF extraction with OCR fallback, source-specific parsers, and schema-first harmonization with validation, with an optional LLM-assisted pathway for higher-quality field extraction.
On a manually aligned subset of 75 cases, the LLM-assisted extraction substantially outperforms a deterministic approach, achieving higher F1 scores (0.8664 vs. 0.2578).
Across 517 parsed records per pathway, the LLM-assisted method improves aggregate key-field completeness (96.97% vs. 93.23%) while still passing initial schema validation, indicating validator-guided repair helps preserve auditability.
The deterministic pipeline is far faster (0.03 s/record vs. 3.95 s/record for the LLM pathway), highlighting a speed/quality tradeoff for high-stakes investigative workflows.

Abstract

Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97\% vs. 93.23\%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer