Mining Negative Sequential Patterns to Improve Viral Genomic Feature Representation and Classification

arXiv cs.LG / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the limits of existing viral genome classifiers that rely mainly on composition or frequency features, which can reduce interpretability and accuracy on complex or imbalanced datasets.
  • It introduces GeneNSPCla, a viral classification framework that uses Negative Sequential Patterns (NSPs) to extract discriminative absence-based signals from RNA viral genomic sequences and converts them into feature vectors for multiple supervised classifiers.
  • The authors propose GONPM+, a genomic-adapted negative pattern mining algorithm designed to find longer, more biologically meaningful negative sequential patterns.
  • Experiments across 8 classifiers show that GONPM+ improves average accuracy by 10.03% over the original negative pattern mining method and by 24.75% over positive pattern mining.
  • Overall, the results suggest that incorporating absence-based sequential information provides a complementary and effective perspective for viral genome representation and classification.
  • .

Abstract

Viruses represent the most abundant biological entities on Earth and play a pivotal role in microbial ecosystems, yet, as prominent human pathogens, they are closely linked to human morbidity and mortality. Accurate identification of viral sequences from viral genome sequences is therefore essential, but existing genome-based classification models that largely relying on composition- or frequency-based subsequence features often suffer from limited interpretability and reduced accuracy, particularly on complex or imbalanced datasets. To address these limitations, we propose GeneNSPCla (Genomic Negative Sequential Pattern-based Classification), a novel viral classification framework based on Negative Sequential Patterns (NSPs) that extracts discriminative absence-based features from nucleotide sequences of RNA viral genomes. By transforming these NSPs into numerical feature vectors and integrating them into multiple supervised classifiers, GeneNSPCla effectively captures both presence and absence signals in viral sequences. Furthermore, we propose a negative pattern mining algorithm adapted for processing genomic data: GONPM+, which can discover longer and more biologically meaningful negative sequential patterns. The experimental results demonstrate that the average accuracy of GONPM+ in 8 classifiers has improved by 10.03% compared to the original negative pattern mining algorithm and by 24.75% compared to the positive pattern mining algorithm. These findings highlight the effectiveness of incorporating absence-based sequential information, providing a new and complementary perspective for viral genome analysis and classification.