The Language of Touch: Translating Vibrations into Text with Dual-Branch Learning

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The paper tackles vibrotactile captioning by generating natural-language descriptions directly from vibrotactile signals, addressing a key gap in semantic interpretation for haptic data.
It introduces ViPAC, which uses a dual-branch learning strategy to separate periodic and aperiodic signal components and a dynamic fusion mechanism to integrate features adaptively.
The method adds training constraints—an orthogonality constraint and weighting regularization—to improve feature complementarity and consistency in the fused representation.
To enable evaluation, the authors build LMT108-CAP, the first vibrotactile-text paired dataset, generating multiple constrained captions per surface image using GPT-4o from the existing LMT-108 dataset.
Experiments indicate ViPAC outperforms baseline approaches adapted from audio/image captioning, improving both lexical fidelity and semantic alignment between signals and generated text.

Abstract

The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, {\it i.e.}, generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.