EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

arXiv stat.ML / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Imbalanced fraud datasets often lead to biased classifiers, so the paper introduces EmDT to generate synthetic fraudulent transactions as a mitigation strategy.
EmDT uses UMAP clustering to identify distinct fraudulent patterns, then trains a diffusion model with a Transformer denoising network using sinusoidal positional embeddings to learn feature relationships during generation.
After generating synthetic samples, the method applies a standard tabular-friendly decision-tree classifier (such as XGBoost) for the final fraud prediction task.
Experiments on a credit card fraud dataset show that EmDT improves downstream classification performance over prior oversampling and generative approaches while keeping privacy protection comparable and preserving original feature correlations.

Abstract

Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this problem. In this work, we propose the Clustered Embedding Diffusion-Transformer (EmDT), a diffusion model designed to generate fraudulent samples. Our key innovation is to leverage UMAP clustering to identify distinct fraudulent patterns, and train a Transformer denoising network with sinusoidal positional embeddings to capture feature relationships throughout the diffusion process. Once the synthetic data has been generated, we employ a standard decision-tree-based classifier (e.g., XGBoost) for classification, as this type of model remains better suited to tabular datasets. Experiments on a credit card fraud detection dataset demonstrate that EmDT significantly improves downstream classification performance compared to existing oversampling and generative methods, while maintaining comparable privacy protection and preserving feature correlations present in the original data.

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

Dev.to

I deployed AI agents across AWS, GCP, and Azure without a VPN. Here is how it works.

Dev.to

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

Dev.to

AI made learning fun again

Dev.to

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

Dev.to

EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

Key Points

Abstract

Related Articles

Every handle invocation on BizNode gets a WFID — a universal transaction reference for accountability. Full audit trail,...

I deployed AI agents across AWS, GCP, and Azure without a VPN. Here is how it works.

Panduan Lengkap TestSprite MCP Server — Dokumentasi Getting Started dalam Bahasa Indonesia

AI made learning fun again

MCP, Skills, AI Agents, and New Models: The New Stack for Software Development

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer