Distilling Genomic Models for Efficient mRNA Representation Learning via Embedding Matching

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a distillation framework that transfers mRNA representation learning from a large genomic foundation model into a much smaller, mRNA-specialized model, targeting a ~200× parameter reduction.
  • It finds that embedding-level distillation is more effective than logit-based distillation, which the authors report as unstable.
  • Experiments on the mRNA-bench benchmark show the distilled model achieves state-of-the-art results among similarly sized models and can match or compete with larger architectures on mRNA-related tasks.
  • The authors argue that embedding-based distillation is an effective training strategy for biological foundation models, improving scalability when compute constraints make large models impractical.

Abstract

Large Genomic Foundation Models have recently achieved remarkable results and in-vivo translation capabilities. However these models quickly grow to over a few Billion of parameters and are expensive to run when compute is limited. To overcome this challenge, we present a distillation framework for transferring mRNA representations from a state of the art genomic foundation model into a much smaller model specialized for mRNA sequences, reducing the size by 200-fold. Embedding-level distillation worked better than logit based methods, which we found unstable. Benchmarking on mRNA-bench demonstrates that the distilled model achieves state-of-the-art performance among models of comparable size and competes with larger architectures for mRNA-related tasks. Our results highlight embedding-based distillation of mRNA sequences as an effective training strategy for biological foundation models. This enables similar efficient and scalable sequence modelling in genomics, particularly when large models are computationally challenging or infeasible.