Synthetic Tabular Generators Fail to Preserve Behavioral Fraud Patterns: A Benchmark on Temporal, Velocity, and Multi-Account Signals

arXiv cs.LG / 4/16/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper proposes “behavioral fidelity” as a new evaluation dimension for synthetic tabular data, focusing on whether generators preserve temporal, sequential, and structural fraud signals used in real detection systems.
  • It defines four behavioral fraud pattern types (P1–P4) including inter-event timing, burst structure, multi-account graph motifs, and velocity-rule trigger rates, along with a degradation-ratio metric calibrated to a real-data noise floor.
  • The authors prove that row-independent synthetic generators cannot reproduce multi-account graph motifs (P3) and yield non-positive within-entity inter-event-time autocorrelation, implying core burst/fraud fingerprints are unattainable regardless of model architecture or data size.
  • Benchmarks on IEEE-CIS Fraud Detection and the Amazon Fraud Dataset show multiple popular generators (CTGAN, TVAE, GaussianCopula, TabularARGN) fail badly, with degradation ratios up to ~39x on IEEE-CIS and 81.6–99.7x for row-independent methods on Amazon, while TabularARGN performs better (17.2x) but still degrades substantially.
  • The work releases an open-source evaluation framework and claims the P1–P4 behavioral-pattern framework generalizes to other domains with entity-level sequential tabular data (e.g., healthcare and network security).

Abstract

We introduce behavioral fidelity -- a third evaluation dimension for synthetic tabular data that measures whether generated data preserves the temporal, sequential, and structural behavioral patterns that distinguish real-world entity activity. Existing frameworks evaluate statistical fidelity (marginal distributions and correlations) and downstream utility (classifier AUROC on synthetic-trained models), but neither tests for the behavioral signals that operational detection and analysis systems actually rely on. We formalize a taxonomy of four behavioral fraud patterns (P1-P4) covering inter-event timing, burst structure, multi-account graph motifs, and velocity-rule trigger rates; define a degradation ratio metric calibrated to a real-data noise floor (1.0 = matches real variability, k = k-times worse); and prove that row-independent generators -- the dominant paradigm -- are structurally incapable of reproducing P3 graph motifs (Proposition 1) and produce non-positive within-entity IET autocorrelation (Proposition 2), making the positive burst fingerprint of fraud sequences unachievable regardless of architecture or training data size. We benchmark CTGAN, TVAE, GaussianCopula, and TabularARGN on IEEE-CIS Fraud Detection and the Amazon Fraud Dataset. All four fail severely: on IEEE-CIS composite degradation ratios range from 24.4x (TVAE) to 39.0x (GaussianCopula); on Amazon FDB, row-independent generators score 81.6-99.7x, while TabularARGN achieves 17.2x. We document generator-specific failure modes and their resolutions. The P1-P4 framework extends to any domain with entity-level sequential tabular data, including healthcare and network security. We release our evaluation framework as open source.