AXE: Low-Cost Cross-Domain Web Structured Information Extraction

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • AXE (Adaptive X-Path Extractor) proposes reframing web structured data extraction as pruning the HTML DOM tree to remove boilerplate and irrelevant nodes before generating structured outputs.
  • The pipeline distills high-density, grounded context so a small 0.6B LLM can produce precise extractions with state-of-the-art zero-shot results on the SWDE dataset.
  • AXE adds Grounded XPath Resolution (GXR) to ensure outputs are traceable back to specific source DOM nodes, improving reliability and auditability versus purely heuristic or ungrounded approaches.
  • Despite its low compute footprint, AXE reports an F1 score of 88.1% and outperforms several larger fully trained alternatives, suggesting a cost-effective path for large-scale web extraction.
  • The authors release specialized adaptors and code publicly to enable practical adoption for web information extraction workloads.

Abstract

Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction. Our code and adaptors are publicly available at https://github.com/abdo-Mansour/axetract.