How to Extract Structured Data from Indian Invoice Scans and Images

Dev.to / 3/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The article explains how to extract clean, validated structured JSON from messy Indian invoice and receipt scans by using the BharatParse API instead of generic OCR.
  • It highlights key domain-specific logic such as interpreting handwritten fuel quantities, validating GSTIN checksums, and ignoring irrelevant receipt sections like telecom tables.
  • BharatParse supports 13 predefined document schemas (e.g., gst_invoice, fuel, telecom, travel, utility, medical, ecommerce, rent, bank_statement, credit_card) and can auto-detect an input document type using a single POST request.
  • The guide notes broad input compatibility (PDF and common image formats including JPEG/PNG/WebP/TIFF) and claims a fast Python integration setup (under 5 minutes).
  • The primary workflow target is integrating invoice processing into expense management, accounting integrations, and GST reconciliation systems for Indian businesses.
  • summary_detail: - Point 1
  • Point 2
  • Point 3

How to Extract Structured Data from Indian Invoices Using Python (GST, Fuel, Telecom, IRCTC)

If you've ever built an expense management tool, accounting integration, or GST reconciliation system for Indian businesses, you know the problem: Indian invoices are a mess.

A Jio bill is 7 pages long but has only one useful page. A petrol pump receipt has handwritten amounts in blue ink over a printed template. An IRCTC ticket has the GST invoice buried on page 2. A Starbucks receipt is a blurry photo taken at an angle on a phone.

Traditional OCR tools like AWS Textract or Google Vision extract raw text — but they don't understand that 94:14 written on a fuel receipt means 94.14 litres, or that a GSTIN has a checksum you can validate, or that you should ignore the 80-row data usage table in a Jio bill and focus on the summary.

That's the problem I built BharatParse to solve — an API that turns any Indian invoice, bill, or receipt into clean, validated JSON with a single POST request.

In this article I'll show you how to integrate it in Python in under 5 minutes.

What BharatParse Handles

The API supports 13 document schemas out of the box:

Schema Examples
gst_invoice B2B tax invoices with GSTIN validation
restaurant Starbucks, Zomato, local restaurant bills
fuel Handwritten BPCL, HPCL, IOC pump receipts
telecom Jio Fiber, Airtel, BSNL, Vi monthly bills
travel IRCTC e-tickets, train ERS
utility BESCOM, MSEDCL, Mahanagar Gas bills
medical Pharmacy bills, hospital invoices
ecommerce Amazon, Flipkart, Meesho invoices
rent Rent receipts with landlord PAN extraction
bank_statement HDFC, SBI, ICICI, Axis statements
credit_card Credit card monthly statements
auto Auto-detects the document type
generic Any other Indian bill or receipt

Input formats supported: PDF, JPEG, PNG, WebP, TIFF — phone photos, scanner output, WhatsApp-shared images all work.

Getting Started

1. Get your free API key

Sign up at RapidAPI — the free tier gives you 50 extractions/month, no credit card needed.

2. Install requests

pip install requests

3. Make your first call

import base64
import requests
import json

def extract_invoice(file_path, schema="auto"):
    """
    Extract structured data from any Indian invoice.

    Args:
        file_path: Path to PDF or image file
        schema: Document type hint (default: auto-detect)

    Returns:
        dict: Extracted data with confidence score
    """
    # Determine file type from extension
    ext = file_path.rsplit(".", 1)[-1].lower()

    # Read and encode the file
    with open(file_path, "rb") as f:
        file_b64 = base64.b64encode(f.read()).decode()

    # Call BharatParse API
    response = requests.post(
        "https://bharatparse-indian-invoice-bill.p.rapidapi.com/v1/extract",
        headers={
            "X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY",
            "X-RapidAPI-Host": "bharatparse-indian-invoice-bill.p.rapidapi.com",
            "Content-Type": "application/json"
        },
        json={
            "file_b64": file_b64,
            "file_type": ext,
            "schema": schema,
            "country": "IN"
        }
    )

    return response.json()

# Test it
result = extract_invoice("invoice.pdf")
print(json.dumps(result, indent=2))

Real Examples

Example 1 — Restaurant Bill (Starbucks photo)

result = extract_invoice("starbucks_receipt.jpg", schema="restaurant")

Response:

{
  "schema_detected": "restaurant",
  "confidence": 0.92,
  "data": {
    "restaurant_name": "Starbucks",
    "hsn_code": "996331",
    "line_items": [
      {
        "name": "Tall Cold Coffee",
        "quantity": 1,
        "unit_price": 320.0,
        "total": 320.0
      }
    ],
    "taxable_value": 320.0,
    "cgst_rate": 2.5,
    "cgst_amount": 8.0,
    "sgst_rate": 2.5,
    "sgst_amount": 8.0,
    "grand_total": 336.0,
    "payment": {
      "mode": "starbucks_card",
      "card_last4": "1821"
    }
  },
  "warnings": ["Invoice date not visible in scan"],
  "processing_ms": 1843
}

Notice it correctly identified HSN 996331 (restaurant services), extracted CGST + SGST at 2.5% each, and even identified the payment was a Starbucks loyalty card with last 4 digits.

Example 2 — Handwritten Fuel Receipt (BPCL pump memo)

This is where BharatParse really earns its value. Generic OCR tools fail on these.

result = extract_invoice("fuel_receipt.jpg", schema="fuel")

Response:

{
  "schema_detected": "fuel",
  "confidence": 0.90,
  "data": {
    "dealer_name": "N. M. Shamsuddin & Sons",
    "oil_company": "BPCL",
    "invoice_date": "2025-06-05",
    "fuel_items": [
      {
        "fuel_type": "Speed",
        "litres": 94.14,
        "rate_per_litre": 21.24,
        "amount": 2000.0
      }
    ],
    "total_amount": 2000.0
  },
  "warnings": [
    "Litres value '94:14' is handwritten and interpreted as 94.14"
  ],
  "processing_ms": 8598
}

It correctly interpreted 94:14 (written with a colon) as 94.14 litres, identified the fuel type as Speed (BPCL's premium petrol brand), and flagged the handwritten interpretation in warnings.

Example 3 — Jio Fiber Bill (7-page PDF)

result = extract_invoice("jio_bill.pdf", schema="telecom")

Response:

{
  "schema_detected": "telecom",
  "confidence": 1.0,
  "data": {
    "provider": "Jio",
    "customer_name": "Mr. Shyam Arjandas Warialani",
    "account_number": "411252569305",
    "due_date": "2025-09-30",
    "plan_name": "Postpaid_399_6M: Unlimited Data @ 30 Mbps",
    "vendor_gstin": "24AABCI6363G1ZP",
    "gst_bill_number": "W241252611070283",
    "sac_code": "998422",
    "charges": {
      "current_taxable_charges": 399.0,
      "cgst_rate": 9.0,
      "cgst_amount": 35.91,
      "sgst_rate": 9.0,
      "sgst_amount": 35.91,
      "total_current_charges": 470.82
    },
    "total_payable": 470.82,
    "payments_this_period": [
      {"mode": "credit_card", "date": "2025-09-01", "amount": 394.89}
    ]
  },
  "warnings": [],
  "processing_ms": 24406
}

From a 7-page PDF, it extracted only the useful billing summary — ignoring 80 rows of itemised data usage and focusing on what any accounting system actually needs.

Example 4 — IRCTC Train Ticket

result = extract_invoice("irctc_ticket.pdf", schema="travel")

Response:

{
  "schema_detected": "travel",
  "confidence": 0.95,
  "data": {
    "pnr": "8543381796",
    "train_number": "82902",
    "train_name": "IRCTC TEJAS EXP",
    "journey_date": "2026-01-24",
    "from_station": "AHMEDABAD JN (ADI)",
    "boarding_station": "VADODARA JN (BRC)",
    "to_station": "BORIVALI (BVI)",
    "passengers": [
      {
        "name": "SHYAM WARIALANI",
        "age": 67,
        "gender": "M",
        "current_status": "WL/44",
        "catering": "VEG"
      }
    ],
    "fare": {
      "ticket_fare": 1680.0,
      "convenience_fee": 35.4,
      "total_fare": 1715.4
    },
    "gst": {
      "invoice_number": "PS26854338179611",
      "supplier_gstin": "27AAACI7074F1ZK",
      "sac_code": "996421",
      "igst_rate": 5.0,
      "igst_amount": 80.0,
      "total_tax": 80.0
    }
  },
  "warnings": [],
  "processing_ms": 11990
}

It extracted from both pages of the ERS — the ticket details from page 1 and the GST invoice from page 2.

Building a Simple Expense Processor

Here's a practical example — a script that processes a folder of mixed invoices and outputs a CSV for accounting:

import base64
import requests
import json
import csv
import os
from pathlib import Path

API_KEY = "YOUR_RAPIDAPI_KEY"
HEADERS = {
    "X-RapidAPI-Key": API_KEY,
    "X-RapidAPI-Host": "bharatparse-indian-invoice-bill.p.rapidapi.com",
    "Content-Type": "application/json"
}

def extract(file_path):
    ext = file_path.suffix.lstrip(".").lower()
    with open(file_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()

    r = requests.post(
        "https://bharatparse-indian-invoice-bill.p.rapidapi.com/v1/extract",
        headers=HEADERS,
        json={"file_b64": b64, "file_type": ext, "schema": "auto", "country": "IN"}
    )
    return r.json()

def get_total(data, schema):
    """Extract grand total from any schema"""
    d = data.get("data", {})
    for field in ["grand_total", "total_payable", "total_amount", "total_fare"]:
        if d.get(field):
            return d[field]
    totals = d.get("totals", {})
    return totals.get("grand_total") or totals.get("total")

def process_folder(folder_path, output_csv="expenses.csv"):
    folder = Path(folder_path)
    supported = {".pdf", ".jpg", ".jpeg", ".png", ".webp", ".tiff", ".tif"}
    files = [f for f in folder.iterdir() if f.suffix.lower() in supported]

    rows = []
    for file in files:
        print(f"Processing {file.name}...")
        try:
            result = extract(file)
            schema = result.get("schema_detected", "unknown")
            data = result.get("data", {})
            confidence = result.get("confidence", 0)
            warnings = result.get("warnings", [])

            rows.append({
                "file": file.name,
                "type": schema,
                "vendor": (data.get("restaurant_name") or 
                           data.get("dealer_name") or
                           data.get("provider") or
                           data.get("vendor", {}).get("name") or
                           data.get("bank_name") or ""),
                "date": (data.get("invoice_date") or 
                         data.get("bill_date") or
                         data.get("journey_date") or ""),
                "total": get_total(data, schema) or "",
                "gstin": (data.get("gstin") or
                          data.get("vendor_gstin") or
                          data.get("vendor", {}).get("gstin") or ""),
                "confidence": confidence,
                "warnings": "; ".join(warnings) if warnings else ""
            })
        except Exception as e:
            print(f"  Error: {e}")
            rows.append({"file": file.name, "type": "error", "error": str(e)})

    # Write CSV
    with open(output_csv, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=["file","type","vendor","date","total","gstin","confidence","warnings"])
        writer.writeheader()
        writer.writerows(rows)

    print(f"
Done. {len(rows)} invoices processed → {output_csv}")
    return rows

# Run it
results = process_folder("./invoices", "expenses.csv")

Drop any mix of PDFs and images into an invoices/ folder, run the script, and get a clean CSV with vendor, date, total, and GSTIN for every document.

Confidence Scores and Warnings

Every response includes a confidence score (0.0–1.0) and a warnings array:

result = extract_invoice("blurry_receipt.jpg")

if result["confidence"] < 0.70:
    print("Low confidence — recommend human review")
    print("Warnings:", result["warnings"])
elif result["warnings"]:
    print("Extracted successfully with notes:")
    for w in result["warnings"]:
        print(f"{w}")
else:
    print("Clean extraction — confidence:", result["confidence"])

This makes it easy to build a human-in-the-loop workflow — auto-approve high confidence extractions, flag low confidence ones for review.

GSTIN Validation

BharatParse automatically validates every GSTIN it extracts using full checksum verification. Invalid GSTINs are flagged in warnings rather than silently passed through:

result = extract_invoice("vendor_invoice.pdf", schema="gst_invoice")
vendor = result["data"].get("vendor", {})

print("GSTIN:", vendor.get("gstin"))
print("PAN:", vendor.get("pan"))

# Check for validation warnings
gstin_warnings = [w for w in result["warnings"] if "GSTIN" in w]
if gstin_warnings:
    print("GSTIN issue:", gstin_warnings[0])
else:
    print("GSTIN validated ✓")

Practical Use Cases

Expense management apps — automatically categorise and extract amounts from employee expense receipts. No manual data entry.

GST reconciliation — extract invoice numbers, GSTINs, and tax breakdowns for GSTR-2A matching.

Accounting integrations — push extracted data directly to Tally, Zoho Books, or QuickBooks India via their APIs.

Insurance claim processing — extract medical bills, pharmacy receipts, and hospital invoices for claim automation.

HRA compliance — extract rent receipts with landlord PAN for Form 16 and Section 10(13A) claims.

Corporate travel — extract IRCTC ticket details, journey dates, and GST invoices for travel expense reporting.

Pricing

The API is available on RapidAPI:

  • Free — 50 extractions/month, no credit card
  • Pro — $29/month — 500 extractions
  • Ultra — $79/month — 2,500 extractions
  • Mega — $199/month — 10,000 extractions

Full documentation at bharatparse.netlify.app.

Wrapping Up

Indian document extraction is a genuinely hard problem — not because the technology is complex, but because Indian documents are diverse, inconsistent, and often handwritten. A tool that understands Indian document structure rather than just reading raw text makes a real difference in production.

If you're building anything that touches Indian invoices, bills, or receipts — expense management, GST tools, accounting integrations, fintech — give BharatParse a try. The free tier is enough to validate your use case.

Questions or edge cases? Drop them in the comments — I'm actively improving the extraction prompts based on real-world documents.

Tags: python, india, api, webdev, productivity