HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents

arXiv cs.CV / 3/31/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces HighlightBench, a diagnostic benchmark focused on how well multimodal LLMs interpret visual markup cues (e.g., highlights, underlines, bold) as logical directives for reasoning over scientific tables.
It addresses a key evaluation blind spot by separating failures due to “markup not being seen” versus failures in “reasoning with the markup,” using five task families.
The benchmark includes Markup Grounding, Constrained Retrieval, Local Relations, Aggregation & Comparison, and Consistency & Missingness to cover both perception and structured table reasoning behaviors.
A reference pipeline is provided that makes intermediate decisions explicit, enabling more reproducible baselines and more granular error attribution across the perception-to-execution chain.
Experimental results indicate that even strong models can be unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.

Abstract

Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \& Comparison, and Consistency \& Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents

Key Points

Abstract

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer