Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents

arXiv cs.AI / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Mini-BEHAVIOR-Gran, a new embodied AI benchmark designed to study how instruction granularity affects language-guided agent behavior under controlled conditions.
  • Unlike prior benchmarks that use a single static instruction per task, this benchmark provides multiple instruction variants per task, from high-level goals to step-by-step guidance.
  • The authors evaluate four metrics for quantifying cross-task granularity (token count, entity count, action-verb count, and planning-width) and find planning-width correlates most consistently with agent performance.
  • When training and evaluation are organized using planning-width, the relationship between instruction granularity and performance is non-monotonic, showing a U-shaped pattern with peaks at both very fine and very coarse extremes.
  • The coarse-granularity performance rebound is attributed to shallow grounding, where agents tend to learn vision-dominant policies rather than deeper instruction grounding.

Abstract

Instruction granularity is an important yet poorly controlled variable in language-guided embodied AI. Existing benchmarks typically pair each task with a single static instruction, making it difficult to study how agent behavior changes when the same task is described at different levels of detail. We introduce Mini-BEHAVIOR-Gran, a new benchmark for controlled studies of instruction granularity that extends Mini-BEHAVIOR with multiple instruction variants per task, ranging from high-level goal descriptions to step-by-step guidance. Using this benchmark, we compare four candidate metrics for cross-task granularity quantification: token count, entity count, action-verb count, and planning-width, and find that width correlates most consistently with agent performance. Using width to organize training and evaluation further reveals a non-monotonic U-shaped relationship between instruction granularity and performance, with peaks at both fine and coarse extremes. Further analysis suggests that the coarse-granularity performance rebound is associated with shallow grounding, where agents learn vision-dominant policies.