ImmSET: Sequence-Based Predictor of TCR-pMHC Specificity at Scale

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ImmSET(Immune Synapse Encoding Transformer)は、TCRとpMHCのような“可変長配列のセット間相互作用”をシーケンス情報だけで予測するための新しいトランスフォーマー型アーキテクチャを提案しています。
  • 先行するシーケンスベース手法にある評価上の失敗モード(過大評価につながる条件)を指摘し、より厳密な評価でもImmSETが堅牢であることを示しています。
  • 学習データ量を増やしたときのスケーリング挙動を体系的に検証し、複数データ型で一貫して性能がデータ量とともに伸びることを報告しています。
  • ESM2(タンパク質言語モデル)を同データでファインチューニングした場合と比較して好成績であり、さらに十分な学習データがある条件ではAlphaFold2/3ベースのパイプラインを上回る可能性も示しています。

Abstract

T cells are a critical component of the adaptive immune system, playing a role in infectious disease, autoimmunity, and cancer. T cell function is mediated by the T cell receptor (TCR) protein, a highly diverse receptor targeting specific peptides presented by the major histocompatibility complex (pMHCs). Predicting the specificity of TCRs for their cognate pMHCs is central to understanding adaptive immunity and enabling personalized therapies. However, accurate prediction of this protein-protein interaction remains challenging due to the extreme diversity of both TCRs and pMHCs. Here, we present ImmSET (Immune Synapse Encoding Transformer), a novel sequence-based architecture designed to model interactions among sets of variable-length biological sequences. We train this model across a range of dataset sizes and compositions and study the resulting models' generalization to pMHC targets. We describe a failure mode in prior sequence-based approaches that inflates previously reported performance on this task and show that ImmSET remains robust under stricter evaluation. In systematically testing the scaling behavior of ImmSET with training data, we show that performance scales consistently with data volume across multiple data types and compares favorably with the pre-trained protein language model ESM2 fine-tuned on the same datasets. Finally, we demonstrate that ImmSET can outperform AlphaFold2 and AlphaFold3-based pipelines on TCR-pMHC specificity prediction when provided sufficient training data. This work establishes ImmSET as a scalable modeling paradigm for multi-sequence interaction problems, demonstrated in the TCR-pMHC setting but generalizable to other biological domains where high-throughput sequence-driven reasoning complements structure prediction and experimental mapping.