SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author released SGOCR, an open-source dataset pipeline and V1 dataset designed to generate spatially grounded, OCR-focused VQA training tuples with rich metadata for diverse VLM training approaches.
  • The pipeline emphasizes grounding text in images rather than making models reason only about text or the broader scene, addressing a perceived gap in existing visual datasets.
  • Development used a multi-stage process (OCR extraction, anchor discovery/labeling, and verification) with model ensemble/validation experiments before selecting a more efficient final stack involving Nemotron-OCR-v2, Gemma4/Qwen3-VL fallback, and Gemini-2.5-Flash.
  • An agentic loop plus a dataset review frontend and an optimization loop based on sweep-style autoresearch were used to iteratively improve quality and reduce the risk of promising ideas being discarded early.
  • The project is shared for community feedback and to invite others building similar VLMs or datasets to discuss and collaborate.

Hello everyone!

I've been independently researching & developing small-but-powerful vision-language models (VLMs) and noticed a gap in visual datasets - none were teaching my model to simply ground text in imagery, but trying to get it to reason about the text or about the scene itself. This lead me down a 2 week side-side-project to create SGOCR, an open source dataset pipeline for generating spatially-grounded, OCR-focused VQA tuples with tons of rich metadata to support diverse VLM training strategies.

Code

v1 dataset

My development began with simply prompting Qwen2.5-VL locally and grew into a multi-stage beast. At one point, my OCR-stage looked for concensus between 3 text recognition models (Parseq), my anchor stage did the same between GroundingDino, Florence 2, and SAM 3.1, and verification required passes from both Gemini 3.1 Pro & ChatGPT 5.3 Codex to pass. I discovered that less is more in this case, and landed on using Nvidia's nemotron-ocr-v2 for text extraction, a combination of Gemma4 with a Qwen3-VL fallback for anchor discovery & labeling, and then gemini-2.5-flash as a teacher model with simple grounding checks for verification. I got away with using the smaller 2.5 Flash teacher model due to the highly grounded annotations provided in context allowing flash to focus on semantics.

I utilized an agentic loop for development after first creating a dataset review frontend that would store my personal accept/reject/maybe marks to be referenced as human-grounded context later. I bootstrapped this process into a quality score that reflected the aspects of questions I accepted, and from there the rest was much easier to automate. I run a custom optimization loop agent, based on Karpathy's autoresearch (which I found a bit too hyperparameter-searchy), that uses a sweep-based process that allows better holisitc observation, an oppurtunity to make code changes, and less risks of good ideas dying earlier due to their evals being slightly less than another variant's.

I'm looking for general feedback and interested if other people were looking for something like this, or building similar VLMs. Thanks for reading!

submitted by /u/Dreeseaw
[link] [comments]