Towards Unconstrained Human-Object Interaction

arXiv cs.CV / 4/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses human-object interaction (HOI) detection as a computer vision problem and argues that current methods are constrained by fixed interaction vocabularies used at both training and inference.
  • It proposes the new Unconstrained HOI (U-HOI) task, which removes the need for predefined interaction lists, targeting more realistic “in-the-wild” settings.
  • The authors leverage multimodal large language models (MLLMs) to perform interaction recognition in this open-ended setting, evaluating multiple MLLM options for the task.
  • They introduce a processing pipeline that includes test-time inference and language-to-graph conversion to extract structured interaction representations from free-form text.
  • The work releases code for the proposed approach and reports that existing HOI detectors have limitations, while MLLMs better support unconstrained HOI recognition.

Abstract

Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi