GLM-OCR Technical Report

arXiv cs.CL / 3/12/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding, combining a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder.
To address inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, improving decoding throughput while keeping memory overhead low through shared parameters.
System-level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition.
Extensive evaluations on public benchmarks and industrial scenarios show GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction.
Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

Abstract

GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

Dev.to

v1.82.6.rc.1

LiteLLM Releases

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

Dev.to

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

Dev.to

GLM-OCR Technical Report

Key Points

Abstract

Related Articles

How to Enforce LLM Spend Limits Per Team Without Slowing Down Your Engineers

v1.82.6.rc.1

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reduce errores y costos de tokens en agentes con seleccion semantica de herramientas

How I Built Enterprise Monitoring Software in 6 Weeks Using Structured AI Development

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer