Information Extraction from Electricity Invoices with General-Purpose Large Language Models

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The study benchmarks general-purpose LLMs (Gemini 1.5 Pro and Mistral-small) for extracting structured data from semi-structured Spanish electricity invoices without task-specific fine-tuning.
By varying 19 parameter configurations and 6 prompting strategies on a subset of the IDSEM dataset, the researchers treat prompt engineering as the main experimental variable.
Results show prompt quality outweighs hyperparameter tuning: F1 differences across configurations are small, while the best few-shot methods outperform zero-shot by more than 19 percentage points.
The top approach (few-shot with cross-validation) reaches very high F1-scores—97.61% for Gemini and 96.11% for Mistral-small—suggesting that invoice template structure is the biggest factor affecting extraction difficulty.
The paper provides an empirical framework indicating that careful prompt design is the key lever for improving fidelity in LLM-based business document automation.

Abstract

Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.