CGCMA: Conditionally-Gated Cross-Modal Attention for Event-Conditioned Asynchronous Fusion

arXiv cs.LG / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper studies asynchronous multimodal learning, where a continuous primary signal must be fused with delayed external context whose value depends on its arrival time and reliability.
  • It introduces CGCMA (Conditionally-Gated Cross-Modal Attention), which grounds event-relevant market states via text-attention and then uses a lag-aware gating mechanism to regulate (or suppress) residual cross-modal injection when web context is stale or contradictory.
  • The authors create CMI (Crypto Market Intelligence), an asynchronous evaluation dataset of 27,914 samples that pair high-frequency cryptocurrency price sequences with lagged real-news web intelligence.
  • On a short real-news evaluation set, CGCMA achieves the best mean downstream Sharpe ratio (+0.449 ± 0.257) under a shared zero-cost threshold-trading protocol, and ablations suggest the improvement is not due only to web scalar features or simple freshness heuristics.
  • Overall, the results provide evidence that asynchronous cross-modal fusion is a valid problem and that CGCMA yields promising gains on this stress-test setup.

Abstract

We study asynchronous alignment, a first-class multimodal learning setting in which a dense primary stream must be fused with sporadic external context whose value depends on when it arrives. Unlike standard multimodal benchmarks that assume structural synchrony, this setting requires models to reason explicitly about freshness and trust. We focus on the event-conditioned case in which continuous market states are paired with delayed web intelligence, and we use high-frequency cryptocurrency markets only as a timestamped, high-noise stress test for this broader problem. We propose CGCMA (Conditionally-Gated Cross-Modal Attention), whose central design principle is to separate text-conditioned grounding from lag-aware trust control. Text first attends over price sequences to identify event-relevant market states, after which a conditional gate uses modality agreement, web features, and lag \tau_{\mathrm{lag}} to regulate residual injection and fall back toward unimodal prediction when external context is stale or contradictory. We introduce CMI (Crypto Market Intelligence), an asynchronous evaluation corpus with 27,914 real-news samples pairing high-frequency price sequences with lagged web intelligence. On the current short real-news corpus, CGCMA attains the highest mean downstream Sharpe ratio (+0.449 \pm 0.257) among the evaluated baselines under a shared zero-cost threshold-trading evaluation on news-available bars. Additional controls show that the gain is not explained by web scalars alone and is not recovered by simple freshness heuristics. The resulting evidence supports problem validity and a promising asynchronous multimodal gain on this stress-test setting.