TriFit: Trimodal Fusion with Protein Dynamics for Mutation Fitness Prediction

arXiv cs.LG / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • TriFit is presented as a multimodal supervised framework for single amino-acid variant (SAV) mutation fitness prediction that explicitly incorporates protein dynamics alongside sequence and structure.
  • The model combines three embedding sources—ESM-2-based sequence embeddings, AlphaFold2-derived structural geometry embeddings, and Gaussian Network Model (GNM) dynamics features—fused via a four-expert Mixture-of-Experts (MoE) with trimodal cross-modal contrastive learning.
  • TriFit adaptively learns how to weight different modality combinations per protein using an MoE router, avoiding fixed assumptions about which modality matters most.
  • On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit reports AUROC of 0.897 ± 0.0002, surpassing prior supervised baselines and improving over the best listed zero-shot model.
  • Ablations indicate dynamics contributes the most additional gain beyond pairwise fusion, and the method produces well-calibrated probabilistic outputs without post-hoc calibration.

Abstract

Predicting the functional impact of single amino acid substitutions (SAVs) is central to understanding genetic disease and engineering therapeutic proteins. While protein language models and structure-based methods have achieved strong performance on this task, they systematically neglect protein dynamics; residue flexibility, correlated motions, and allosteric coupling are well-established determinants of mutational tolerance in structural biology, yet have not been incorporated into supervised variant effect predictors. We present TriFit, a multimodal framework that integrates sequence, structure, and protein dynamics through a four-expert Mixture-of-Experts (MoE) fusion module with trimodal cross-modal contrastive learning. Sequence embeddings are extracted via masked marginal scoring with ESM-2 (650M); structural embeddings from AlphaFold2-predicted C-alpha geometries; and dynamics embeddings from Gaussian Network Model (GNM) B-factors, mode shapes, and residue-residue cross-correlations. The MoE router adaptively weights modality combinations conditioned on the input, enabling protein-specific fusion without fixed modality assumptions. On the ProteinGym substitution benchmark (217 DMS assays, 696k SAVs), TriFit achieves AUROC 0.897 +/- 0.0002, outperforming all supervised baselines including Kermut (0.864) and ProteinNPT (0.844), and the best zero-shot model ESM3 (0.769). Ablation studies confirm that dynamics provides the largest marginal contribution over pairwise modality combinations, and TriFit achieves well-calibrated probabilistic outputs (ECE = 0.044) without post-hoc correction.