LLMbench: A Comparative Close Reading Workbench for Large Language Models

arXiv cs.AI / 4/20/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

LLMbench is introduced as a browser-based workbench that enables side-by-side, close reading of large language model (LLM) outputs rather than focusing mainly on numerical rating metrics.
The system adds four analytical overlays—token-level log-probability inspection, word-level differences, Hyland-style tone/metadiscourse analysis, and sentence-level structure with discourse connective highlighting.
It includes six analytical modes (e.g., stochastic variation, temperature gradients, prompt sensitivity, token probabilities, and cross-model divergence) to make the probabilistic structure behind generation more interpretable at the token level.
The tool visualizes outputs as probability distributions (using heatmaps, entropy sparklines, pixel maps, and 3D probability “terrains”) to reveal counterfactual histories of how each word could have emerged.
The paper argues that log-probability data—currently underused in humanities and social-science readings—should be treated as a valuable resource for critical study of generative AI models.

Abstract

LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR's LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible at the token level. The tool treats the generated text as a research object in its own right from a probability distribution, a text that could have been otherwise, and provides visualisations including continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains, that show the counterfactual history from which each word emerged. This paper describes the tool's architecture, its six modes, and its design rationale, and argues that log-probability data, currently underused in humanistic and social-scientific readings of AI, is an important resource for a critical studies of generative AI models.

Black Hat USA

AI Business

Black Hat Asia

AI Business

Which Version of Qwen 3.6 for M5 Pro 24g

Reddit r/LocalLLaMA

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

LLMbench: A Comparative Close Reading Workbench for Large Language Models

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

Which Version of Qwen 3.6 for M5 Pro 24g

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer