OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

arXiv cs.AI / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

OmniTrace is proposed as a lightweight, model-agnostic framework to attribute which multimodal inputs (text, image, audio, video) support each statement generated by omni-modal, decoder-only LLMs during generation time.
The method reframes attribution as a generation-time tracing problem over the causal decoding process, converting token-level attribution signals (e.g., attention or gradient-based scores) into coherent span-level, cross-modal explanations.
It aggregates traced signals into semantically meaningful spans using confidence-weighted and temporally coherent strategies, enabling concise supporting source selection without retraining or additional supervision.
Experiments on Qwen2.5-Omni and MiniCPM-o-4.5 for visual, audio, and video tasks show more stable and interpretable attribution than self-attribution and embedding-based baselines.
Results also indicate robustness across multiple underlying token-level attribution signals, supporting the idea that structured generation-time tracing can scale multimodal transparency.

Abstract

Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.

Introducing Claude Opus 4.7

Anthropic News

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

Dev.to

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators

Dev.to

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs

Dev.to

OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

Key Points

Abstract

Related Articles

Introducing Claude Opus 4.7

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer