Built a mortgage OCR system that hit 100% final accuracy in production (US/UK underwriting)

Reddit r/LocalLLaMA / 3/28/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article argues mortgage underwriting pipelines often fail due to unreliable document input, not underwriting logic, and describes a document processing OCR system now running in production for a US firm.
  • It reports 96% of underwriting fields extracted automatically with the remaining 4% handled via targeted human review, achieving 100% final accuracy at the output layer.
  • The core approach replaces generic OCR with underwriting-specific, document-type-aware extraction (e.g., Form 1003, W-2, pay stubs, bank statements, 1040 tax returns) plus field-level validation and source traceability.
  • The system emphasizes layout-aware extraction, confidence/override logging, and an auditable pipeline designed for compliance needs (SOC 2-aligned, HIPAA-style safeguards where needed, GLBA/lender requirements, deployable in VPC/on-prem).
  • Claimed outcomes include 65–75% fewer manual reviews, faster turnaround (24–48h to 10–30 minutes), substantial reductions in exceptions and ops headcount, and roughly $2M/year in cost savings versus generic OCR providers.

Most mortgage underwriting pipelines aren’t failing because of underwriting logic. They’re failing because the input data is unreliable.

I worked on a document processing system for a US mortgage underwriting firm that’s now live in production. Not a demo or benchmark.

What it does

  • 96% of fields extracted fully automatically
  • Remaining 4% resolved through targeted human review
  • 100% final accuracy at the output layer

Problem with typical setups
Most teams rely on generic OCR tools like Textract, Document AI, Azure, etc. In practice, extraction accuracy stalls around ~70%.

That leads to:

  • Constant manual corrections
  • Rework and delays
  • Large ops teams fixing data instead of underwriting

What changed
Instead of treating all documents the same, the system is built around underwriting-specific document types:

  • Form 1003
  • W-2
  • Pay stubs
  • Bank statements
  • 1040 tax returns
  • Employment/income verification docs

Each document type has its own extraction + validation logic.

System design

  • Layout-aware extraction (not plain OCR)
  • Field-level validation rules per document type
  • Every field traceable to source location
  • Confidence + override logging
  • Fully auditable pipeline

Compliance-ready

  • SOC 2 aligned (access control, audit logs, change tracking)
  • Handles sensitive financial/PII data (HIPAA-style safeguards where needed)
  • Compatible with GLBA + lender compliance requirements
  • Works in VPC / on-prem environments

Results

  • 65–75% reduction in manual review
  • Turnaround: 24–48h → 10–30 min per file
  • Field accuracy: ~70% → ~96% (pre-review)
  • 60%+ drop in exceptions
  • 30–40% lower ops headcount
  • ~$2M/year cost savings
  • 40–60% lower infra + OCR costs vs generic providers
  • Full auditability

Key insight
This isn’t an “AI model accuracy” problem. It’s a pipeline design problem.

If extraction is document-aware, validated, and auditable, the rest of underwriting becomes straightforward.

Post questions here or reach out via direct message. Open to general discussions and consultation inquiries.

submitted by /u/Fantastic-Radio6835
[link] [comments]