In Search of Lost DNA Sequence Pretraining

arXiv cs.LG / 4/21/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that DNA sequence pretraining research has focused too heavily on scale and downstream evaluation datasets while overlooking key aspects of the pretraining paradigm.
It identifies three critical problems for DNA pretraining: using inappropriate downstream datasets, flaws in the neighbor-masking strategy, and insufficient analysis of vocabulary design.
The authors conduct systematic investigations and provide principled guidelines for selecting evaluation datasets, designing tasks, and analyzing vocabulary for DNA models.
Extensive experiments support the importance of these issues and validate the recommendations.
The work also introduces a standardized benchmarking testbed to enable reproducible and rigorous evaluation of DNA pretraining methods and advance genomic foundation models.

Abstract

DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in-depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and support the rationale behind our recommendations. Finally, we introduce a standardized testbed that enables reproducible and rigorous benchmarking of DNA pretraining methods to advance the development of genomic foundation models.

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

Dev.to

Kiwi-chan Devlog #007: The Audit Never Sleeps (and Neither Does My GPU)

Dev.to

Second-Order Injection: Attacking the Evaluator in LLM Safety Monitors

Dev.to

Note the new recommended sampling parameters for Qwen3.6 27B

Reddit r/LocalLLaMA

Qwen3.6 35B + the right coding scaffold got my local setup to 9/10 on real Go tasks

Reddit r/LocalLLaMA

In Search of Lost DNA Sequence Pretraining

Key Points

Abstract

Related Articles

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

Kiwi-chan Devlog #007: The Audit Never Sleeps (and Neither Does My GPU)

Second-Order Injection: Attacking the Evaluator in LLM Safety Monitors

Note the new recommended sampling parameters for Qwen3.6 27B

Qwen3.6 35B + the right coding scaffold got my local setup to 9/10 on real Go tasks

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer