Generating Synthetic Malware Samples Using Generative AI

arXiv cs.LG / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key cybersecurity challenge: malware datasets are hard to obtain and are often imbalanced, especially for new malware variants with limited training data.
It proposes a system that transforms malware binaries into “mnemonic opcode sequences,” using NLP to capture the contextual meaning of opcode features to better condition generative models.
Multiple generative approaches are evaluated, including GANs, WGAN-GP, and a modified diffusion model, with diffusion-based synthetic data showing the strongest benefit.
Experiments indicate that diffusion-generated samples improve performance for minority malware classes by up to an average of 60%, raising overall malware classification performance to 96% (an 8% gain).
The authors report that the synthetic malware data has high fidelity and robustness, enabling better detection even when the amount of known malware data is substantially small.

Abstract

Malware attacks have a significant negative impact on organizations of varied scales in the field of cybersecurity. Recently, malware researchers have increasingly turned to machine learning techniques to combat sophisticated obfuscation methods used in malware. However, collecting a diverse set of malware samples with various obfuscation techniques is challenging and often takes years, especially for newly developed malware. This issue is further compounded by a well-known limitation of machine learning models: their poor performance when training data is scarce. In this paper, we propose a new system for generating synthetic malware samples to augment imbalanced malware dataset. Our approach decomposes malware binary samples into mnemonic opcode sequences, leveraging natural language processing to extract contextual meaning behind malware opcode features to aid the learning of generative AI (GenAI) employed in this paper, Generative Adversarial Networks (GAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), and a modified Diffusion model. The experiment results show that augmenting training data with Diffusion-based synthetic data significantly improves classification performance for minor classes by up to 60% on average. This enhancement ultimately leads to an overall malware classification performance of 96%, an 8% improvement. These findings demonstrate the high quality and fidelity of the synthetic data, its robustness, and its potential applications in malware analysis. Specifically, synthetic malware data proves effective in improving the classification of minor malware classes and detection rates, even though the size of known malware data is significantly small.