LLMbench: A Comparative Close Reading Workbench for Large Language Models

arXiv cs.AI / 4/20/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • LLMbench is introduced as a browser-based workbench that enables side-by-side, close reading of large language model (LLM) outputs rather than focusing mainly on numerical rating metrics.
  • The system adds four analytical overlays—token-level log-probability inspection, word-level differences, Hyland-style tone/metadiscourse analysis, and sentence-level structure with discourse connective highlighting.
  • It includes six analytical modes (e.g., stochastic variation, temperature gradients, prompt sensitivity, token probabilities, and cross-model divergence) to make the probabilistic structure behind generation more interpretable at the token level.
  • The tool visualizes outputs as probability distributions (using heatmaps, entropy sparklines, pixel maps, and 3D probability “terrains”) to reveal counterfactual histories of how each word could have emerged.
  • The paper argues that log-probability data—currently underused in humanities and social-science readings—should be treated as a valuable resource for critical study of generative AI models.

Abstract

LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR's LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible at the token level. The tool treats the generated text as a research object in its own right from a probability distribution, a text that could have been otherwise, and provides visualisations including continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains, that show the counterfactual history from which each word emerged. This paper describes the tool's architecture, its six modes, and its design rationale, and argues that log-probability data, currently underused in humanistic and social-scientific readings of AI, is an important resource for a critical studies of generative AI models.