| Hi - Filip from Interhuman AI here 👋 We just release Inter-1, a model we've been building for the past year. I wanted to share some of what we ran into building it because I think the problem space is more interesting than most people realize. The short version of why we built this If you ask GPT or Gemini to watch a video of someone talking and tell you what's going on, they'll mostly summarize what the person said. They'll miss that the person broke eye contact right before answering, or paused for two seconds mid-sentence, or shifted their posture when a specific topic came up. Even the multimodal frontier models are aren't doing this because they don't process video and audio in temporal alignment in a way that lets them pick up on behavioral patterns. Behavoural science vs emotion AI Most models in this space are trained on basic emotion categories like happiness, sadness, anger, surprise, etc. Those were designed around clear, intense, deliberately produced expressions. They don't map well to how people actually communicate in a work setting. The model explains itself For every signal Inter-1 detects, it outputs a probability score and a rationale — which cues it observed, which modalities they came from, and how they map to the predicted signal. Benchmarks We tested against ~15 models, from small open-weight to the latest closed frontier systems. Inter-1 had the highest detection accuracy at near real-time speed. The gap was widest on the hard signals - interest, skepticism, stress and uncertainty - where even trained human annotators disagree with each other. The dataset problem The existing datasets in affective computing are built around basic emotions, narrow demographics, limited recording contexts. We couldn't use them, so we built our own. Large-scale, purpose-built, combining in-the-wild video with synthetic data. Every sample was annotated by both expert behavioral scientists and trained crowd annotators working in parallel. Building the dataset was by far the hardest part, along with the ontology. What's next Right now it's single-speaker-in-frame, which covers most interview/presentation/meeting scenarios. Multi-person interaction is next. We're also working on streaming inference for real-time. Happy to answer any questions here :) [link] [comments] |
Introducing Inter-1, multimodal model detecting social signals from video, audio & text
Reddit r/artificial / 4/16/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- Inter-1 is a newly released multimodal model from Interhuman AI that detects 12 social signals by analyzing video, audio, and text with temporal alignment to capture behavioral patterns beyond “what was said.”
- The approach shifts away from basic emotion-category labeling toward an ontology grounded in behavioral science, using observable nonverbal/paraverbal cues (e.g., gaze, posture, vocal prosody, speech rhythm, word choice).
- For each detected signal, Inter-1 provides probability scores plus human-checkable rationales detailing which cues and modalities supported the prediction.
- In blind evaluations with behavioral science experts, the rationales were preferred over a frontier multimodal model’s outputs 83% of the time.
- The model is positioned for practical use cases like analyzing interviews, training, and sales calls where communication dynamics matter as much as content.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs
Dev.to

🚀 ROSE: Rethinking Computer Vision as a Retrieval-Augmented 🤖 System
Dev.to