Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Optical Chemical Structure Recognition (OCSR) は、文献中の2D分子図を機械可読な形式に変換する重要課題だが、Vision-Languageモデルをそのまま適用すると難しく、フルパラメータの教師あり微調整が不安定になりがちだ。
提案手法では DeepSeek-OCR-2 を分子画像から SMILES を生成する「画像条件付き SMILES 生成」として定式化し、学習不安定性を抑えるために LoRA から段階的に選択的なフルパラメータ微調整へ移行する2段階の progressive supervised fine-tuning を採用している。
学習データは PubChem の合成レンダリングと USPTO-MOL の実画像（特許由来）を組み合わせ、大規模かつ多様な分子表現でカバレッジと頑健性を高めている。
微調整後のモデル MolSeek-OCR は、厳密一致（exact matching）精度が既存の代表的 image-to-sequence と同等レベルを示す一方、image-to-graph 系の最先端にはまだ及ばない。
強化学習風の後処理やデータキュレーションによる改良も検討したが、SMILES の厳密なシーケンス整合性（sequence-level fidelity）の向上にはつながらなかった。

Abstract

Optical Chemical Structure Recognition (OCSR) is critical for converting 2D molecular diagrams from printed literature into machine-readable formats. While Vision-Language Models have shown promise in end-to-end OCR tasks, their direct application to OCSR remains challenging, and direct full-parameter supervised fine-tuning often fails. In this work, we adapt DeepSeek-OCR-2 for molecular optical recognition by formulating the task as image-conditioned SMILES generation. To overcome training instabilities, we propose a two-stage progressive supervised fine-tuning strategy: starting with parameter-efficient LoRA and transitioning to selective full-parameter fine-tuning with split learning rates. We train our model on a large-scale corpus combining synthetic renderings from PubChem and realistic patent images from USPTO-MOL to improve coverage and robustness. Our fine-tuned model, MolSeek-OCR, demonstrates competitive capabilities, achieving exact matching accuracies comparable to the best-performing image-to-sequence model. However, it remains inferior to state-of-the-art image-to-graph modelS. Furthermore, we explore reinforcement-style post-training and data-curation-based refinement, finding that they fail to improve the strict sequence-level fidelity required for exact SMILES matching.

Black Hat Asia

AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

MarkTechPost

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

The Register

I tested and ranked every ai companion app I tried and here's my honest breakdown

Reddit r/artificial

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

Key Points

Abstract

Related Articles

Black Hat Asia

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents

Chatbots are great at manipulating people to buy stuff, Princeton boffins find

I tested and ranked every ai companion app I tried and here's my honest breakdown

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer