Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed

arXiv cs.CV / 5/1/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces CatSignal, a Bayesian-inspired probabilistic framework to infer intent for non-speaking agents using noisy, incomplete behavioral observations plus rich spatial context.
Instead of treating context as a normal input feature, the method uses a context-gated Product-of-Experts approach to produce posterior-like intent distributions from context, pose dynamics, and acoustic cues.
A household cat testbed serves as a proof of concept, demonstrating intent inference in a real-world-like embodied setting where goals cannot be communicated via language.
On a multimodal domestic cat dataset with leave-one-video-out evaluation, CatSignal reaches 77.72% overall accuracy, outperforming feature concatenation (71.83%) and late-fusion baselines.
Beyond accuracy gains, the approach significantly reduces failures caused by naive context usage, particularly in ambiguous cases where models may form brittle shortcut predictions.

Abstract

Many agents in real-world environments cannot reliably communicate their goals through language, including household pets, pre-verbal infants, and other non-speaking embodied agents. In such settings, intent must be inferred from incomplete behavioral observations in context-rich environments. This creates a core ambiguity: observable behavior is often noisy or underspecified, while context provides strong prior information but can also induce brittle shortcut predictions if used naively. We present CatSignal, a Bayesian-inspired probabilistic framework for multimodal intent inference that models spatial context as a prior-like constraint and behavioral observations as evidence. Rather than treating context as an ordinary input feature, our method uses a context-gated Product-of-Experts formulation to compute posterior-like intent distributions from context, pose dynamics, and acoustic cues. We instantiate this formulation in a household cat setting as a focused proof-of-concept for intent inference in non-speaking agents. Under Leave-One-Video-Out evaluation on a multimodal domestic cat dataset, the proposed prior-guided fusion achieves the best overall accuracy of 77.72%, outperforming feature concatenation (71.83%) and stronger late-fusion baselines. More importantly, it substantially reduces context-driven shortcut failures in ambiguous cases. While simpler fusion strategies remain competitive in Macro-F1 and selective prediction, the proposed model provides the strongest overall accuracy and the best suppression of context-based shortcut collapse.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Reddit r/artificial

Automating FDA Compliance: AI for Specialty Food Producers

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

I hate this group but not literally

Reddit r/LocalLLaMA

Context as Prior: Bayesian-Inspired Intent Inference for Non-Speaking Agents with a Household Cat Testbed

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!

Automating FDA Compliance: AI for Specialty Food Producers

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

I hate this group but not literally

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer