| I've been running PaddleOCR-VL-1.5 via llama.cpp's server for OCR on book pages. It handles complex layouts, tables, and mixed text/figure pages surprisingly well. Setup: The pipeline can process an entire folder of page photos end-to-end. You can basically digitalise a book with a single command. Repo: https://github.com/akmalayari/ocr-book Has anyone else experimented with vision-language models for OCR? [link] [comments] |
Using PaddleOCR-VL-1.5 with llama-server for book OCR
Reddit r/LocalLLaMA / 4/26/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The article describes using PaddleOCR-VL-1.5 (a vision-language model) to perform OCR on book page images via llama.cpp’s llama-server.
- It reports strong handling of complex page layouts, including tables and mixed text/figure regions, producing structured Markdown with HTML tables.
- The proposed pipeline is: layout detection → region-level OCR → conversion to Markdown/HTML for tables, enabling end-to-end processing of an entire folder of page photos.
- A working setup is shared, specifying PaddleOCR-VL-1.5-GGUF with mmproj.gguf and using a Vulkan backend on Windows, along with a reference repository for the workflow.
- The post ends by inviting others to share their experiments with vision-language models for OCR.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

Survey finds Claude's weekly active users in the US skew far wealthier than any rival AI assistant
THE DECODER

Why Traditional Mobile Vendors Fail at AI Feature Delivery: 2026 Analysis for US Enterprise
Dev.to

Why Mobile AI Projects Fail When the Board Says Add AI: 2026 Analysis for US Enterprise
Dev.to