AI Navigate

Dropout Robustness and Cognitive Profiling of Transformer Models via Stochastic Inference

arXiv cs.LG / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes dropout-induced variability across 19 transformer models using Monte Carlo Dropout with 100 stochastic forward passes per sample to evaluate inference-time robustness.
  • It defines dropout robustness as maintaining high accuracy and stable predictions, quantifying stability with the standard deviation of per-run accuracies and a cognitive decomposition into memory and reasoning components.
  • In experiments across five dropout configurations, the study performs 95 unique evaluations on 1,000 samples, revealing substantial architectural variation in robustness that is not simply tied to model size.
  • Findings show smaller models have highly stable predictions, mid-sized models achieve the best overall accuracy, while larger models excel at memory tasks; importantly, 53% of models suffer severe accuracy degradation under baseline MC Dropout, with task-specific models losing up to 24 percentage points.
  • Memory tasks are disproportionately affected by dropout (memory accuracy decreases by 27 percentage points) whereas reasoning loses only 1 percentage point, and 84% of models display memory-biased performance, making this the first comprehensive MC Dropout benchmark for transformers and offering guidance for uncertainty-aware applications.

Abstract

Transformer-based language models are widely deployed for reasoning, yet their behavior under inference-time stochasticity remains underexplored. While dropout is common during training, its inference-time effects via Monte Carlo sampling lack systematic evaluation across architectures, limiting understanding of model reliability in uncertainty-aware applications. This work analyzes dropout-induced variability across 19 transformer models using MC Dropout with 100 stochastic forward passes per sample. Dropout robustness is defined as maintaining high accuracy and stable predictions under stochastic inference, measured by standard deviation of per-run accuracies. A cognitive decomposition framework disentangles performance into memory and reasoning components. Experiments span five dropout configurations yielding 95 unique evaluations on 1,000 samples. Results reveal substantial architectural variation. Smaller models demonstrate perfect prediction stability while medium-sized models exhibit notable volatility. Mid-sized models achieve the best overall performance; larger models excel at memory tasks. Critically, 53% of models suffer severe accuracy degradation under baseline MC Dropout, with task-specialized models losing up to 24 percentage points, indicating unsuitability for uncertainty quantification in these architectures. Asymmetric effects emerge: high dropout reduces memory accuracy by 27 percentage points while reasoning degrades only 1 point, suggesting memory tasks rely on stable representations that dropout disrupts. 84% of models demonstrate memory-biased performance. This provides the first comprehensive MC Dropout benchmark for transformers, revealing dropout robustness is architecture-dependent and uncorrelated with scale. The cognitive profiling framework offers actionable guidance for model selection in uncertainty-aware applications.