ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

arXiv cs.AI / 4/17/2026

📰 NewsModels & Research

Key Points

  • The paper argues that embodied agents should account for unexpected real-world conditions and missing affordance information rather than simply executing instructions.
  • It introduces DynAfford, a benchmark for dynamic environments where object affordances can change over time and are not specified in the instruction.
  • The benchmark requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly.
  • To support this, the authors propose ADAPT, a plug-and-play module that adds explicit affordance reasoning to existing planners.
  • Experiments show that ADAPT improves robustness and task success in both seen and unseen settings, and that a domain-adapted, LoRA-finetuned vision-language model can outperform GPT-4o for affordance inference.

Abstract

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.