AXE: Low-Cost Cross-Domain Web Structured Information Extraction
arXiv cs.CL / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- AXE (Adaptive X-Path Extractor) proposes reframing web structured data extraction as pruning the HTML DOM tree to remove boilerplate and irrelevant nodes before generating structured outputs.
- The pipeline distills high-density, grounded context so a small 0.6B LLM can produce precise extractions with state-of-the-art zero-shot results on the SWDE dataset.
- AXE adds Grounded XPath Resolution (GXR) to ensure outputs are traceable back to specific source DOM nodes, improving reliability and auditability versus purely heuristic or ungrounded approaches.
- Despite its low compute footprint, AXE reports an F1 score of 88.1% and outperforms several larger fully trained alternatives, suggesting a cost-effective path for large-scale web extraction.
- The authors release specialized adaptors and code publicly to enable practical adoption for web information extraction workloads.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




