Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample

Reddit r/LocalLLaMA / 3/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A developer is attempting to build an OCR proof of concept for handwritten and printed Hindi (Devanagari) by combining TrOCR’s vision encoder with a Google mT5 multilingual decoder for Hindi tokenization.
  • Despite matching hidden sizes and substituting the decoder, the combined model fails to overfit a single training example, with loss plateauing around 2–3 and outputs degenerating into repeated characters rather than coherent text.
  • They report trying typical training adjustments (learning rate changes and repetition penalties) but still cannot achieve overfitting, suggesting a fundamental mismatch or training/labeling issue in the encoder-decoder integration.
  • The request asks for guidance on better tokenizer/decoder options compatible with TrOCR’s encoder or for recommendations to fix the current TrOCR encoder + mT5 decoder setup so it can learn Hindi outputs.
  • The discussion centers on practical troubleshooting for seq2seq OCR architecture compatibility, tokenization, and decoder conditioning rather than a new model release or result.
Looking for guidance. Trying to create a model with TrOCR's encoder + Google's mT5 multilingual decoder but model fails to overfit on a single data sample

Hi everyone,

I am working on building a proof of concept for OCR system that can recognize both handwritten and printed Hindi (Devanagari) text in complex documents. I’m trying to build on top of TrOCR (microsoft/trocr-base-handwritten) since it already has a strong vision encoder trained for handwriting recognition.

The core problem I’m running into is on the decoder/tokenizer side — TrOCR’s default decoder and tokenizer are trained for English only, and I need Hindi output.

What I’ve tried so far:

I replaced TrOCR’s decoder with google/mt5-small, which natively supports Hindi tokenization. The hidden sizes matched, so I expected this to work.

However, the model failed to overfit even on a single data point. The loss comes down but hovers at near 2-3 at the end, and the characters keep repeating instead of forming a meaningful word or the sentence. I have tried changing learning rate, introducing repetition penalty but overfitting just don’t happen.

https://preview.redd.it/wh6ucn1mncrg1.png?width=2064&format=png&auto=webp&s=e6cea11021aa84f0d67b74be3a9eb5ffe61c3a74

I need guidance as is their any other tokenizer out there that can work well with TrOCR’s encoder or can you help me improve in this current setup (TrOCR’s encoder+Decoder).

submitted by /u/ElectronicHoneydew86
[link] [comments]