Information Extraction from Electricity Invoices with General-Purpose Large Language Models

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The study benchmarks general-purpose LLMs (Gemini 1.5 Pro and Mistral-small) for extracting structured data from semi-structured Spanish electricity invoices without task-specific fine-tuning.
  • By varying 19 parameter configurations and 6 prompting strategies on a subset of the IDSEM dataset, the researchers treat prompt engineering as the main experimental variable.
  • Results show prompt quality outweighs hyperparameter tuning: F1 differences across configurations are small, while the best few-shot methods outperform zero-shot by more than 19 percentage points.
  • The top approach (few-shot with cross-validation) reaches very high F1-scores—97.61% for Gemini and 96.11% for Mistral-small—suggesting that invoice template structure is the biggest factor affecting extraction difficulty.
  • The paper provides an empirical framework indicating that careful prompt design is the key lever for improving fidelity in LLM-based business document automation.

Abstract

Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electricity invoices without task-specific fine-tuning. Using a subset of the IDSEM dataset, we benchmark two architecturally distinct models, Gemini 1.5 Pro and Mistral-small, across 19 parameter configurations and 6 prompting strategies. Our experimental framework treats prompt engineering as the primary experimental variable, comparing zero-shot baselines against increasingly sophisticated few-shot approaches and iterative extraction strategies. Results demonstrate that prompt quality dominates over hyperparameter tuning: the F1-score variation across all parameter configurations is marginal, while the gap between zero-shot and the best few-shot strategy exceeds 19 percentage points. The best configuration (few-shot with cross-validation) achieves an F1-score of 97.61% for Gemini and 96.11% for Mistral-small, with document template structure emerging as the primary determinant of extraction difficulty. These findings establish that prompt design is the critical lever for maximizing extraction fidelity in LLM-based document processing, thereby providing an empirical framework for integrating general-purpose LLMs into business document automation.