Using PaddleOCR-VL-1.5 with llama-server for book OCR

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The article describes using PaddleOCR-VL-1.5 (a vision-language model) to perform OCR on book page images via llama.cpp’s llama-server.
  • It reports strong handling of complex page layouts, including tables and mixed text/figure regions, producing structured Markdown with HTML tables.
  • The proposed pipeline is: layout detection → region-level OCR → conversion to Markdown/HTML for tables, enabling end-to-end processing of an entire folder of page photos.
  • A working setup is shared, specifying PaddleOCR-VL-1.5-GGUF with mmproj.gguf and using a Vulkan backend on Windows, along with a reference repository for the workflow.
  • The post ends by inviting others to share their experiments with vision-language models for OCR.
Using PaddleOCR-VL-1.5 with llama-server for book OCR

I've been running PaddleOCR-VL-1.5 via llama.cpp's server for OCR on book pages. It handles complex layouts, tables, and mixed text/figure pages surprisingly well.

Setup:
- Model: PaddleOCR-VL-1.5-GGUF + mmproj.gguf
- Backend: llama-server (Vulkan on Windows)
- Pipeline: layout detection → region OCR → Markdown with HTML tables

The pipeline can process an entire folder of page photos end-to-end. You can basically digitalise a book with a single command.

Repo: https://github.com/akmalayari/ocr-book

Has anyone else experimented with vision-language models for OCR?

submitted by /u/Final-Frosting7742
[link] [comments]