AXE: Low-Cost Cross-Domain Web Structured Information Extraction

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

AXE (Adaptive X-Path Extractor) proposes reframing web structured data extraction as pruning the HTML DOM tree to remove boilerplate and irrelevant nodes before generating structured outputs.
The pipeline distills high-density, grounded context so a small 0.6B LLM can produce precise extractions with state-of-the-art zero-shot results on the SWDE dataset.
AXE adds Grounded XPath Resolution (GXR) to ensure outputs are traceable back to specific source DOM nodes, improving reliability and auditability versus purely heuristic or ungrounded approaches.
Despite its low compute footprint, AXE reports an F1 score of 88.1% and outperforms several larger fully trained alternatives, suggesting a cost-effective path for large-scale web extraction.
The authors release specialized adaptors and code publicly to enable practical adoption for web information extraction workloads.

Abstract

Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction. Our code and adaptors are publicly available at https://github.com/abdo-Mansour/axetract.