TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

arXiv cs.CL / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces TexOCR, a document OCR approach focused on reconstructing scientific PDFs into compilable LaTeX rather than extracting only plain text or Markdown.
It contributes TexOCR-Bench, a benchmark with multi-dimensional evaluation for transcription accuracy, structural fidelity, and end-to-end LaTeX compilability.
It also releases TexOCR-Train, a large-scale training corpus used to train a 2B-parameter TexOCR model via supervised fine-tuning and reinforcement learning.
The reinforcement learning component uses verifiable rewards from LaTeX unit tests to enforce compilability and referential integrity, improving results over SFT alone.
Experiments across 21 frontier models show many existing systems break important document invariants (section consistency, float placement, and label-reference links), limiting downstream reliability.

Abstract

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.

LLMs will be a commodity

Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

Dev.to

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Key Points

Abstract

Related Articles

LLMs will be a commodity

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dex lands $5.3M to grow its AI-driven talent matching platform

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer