Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees

arXiv cs.AI / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that while deep neural networks excel at DNA sequence prediction, they are often black boxes, motivating more interpretable approaches such as axis-aligned decision trees.
  • It identifies a key limitation of standard decision-tree splitting—treating raw features independently at each node—leading to overly deep trees that hurt interpretability and generalization.
  • The proposed DEFT framework adaptively generates higher-level, biologically informed sequence features during decision-tree construction rather than relying only on raw inputs.
  • DEFT uses large language models to propose candidate features based on the local sequence distribution at each node and then iteratively refines them using a reflection mechanism.
  • Experiments across multiple genomic tasks show that DEFT can produce human-interpretable features while maintaining strong predictive performance.

Abstract

The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.