When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift

arXiv cs.LG / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • A new arXiv study re-tests widely cited claims that graph neural networks (GCN, GraphSAGE, GAT, EvolveGCN) outperform feature-only models for Bitcoin fraud detection on the Elliptic dataset under a leakage-free evaluation protocol.
  • Under a strictly inductive, seed-matched inductive-vs-transductive comparison, Random Forest using raw features achieves the best F1 score (0.821), outperforming all evaluated GNNs, with GraphSAGE reaching 0.689 ± 0.017.
  • The authors’ controlled experiment attributes a large (39.5-point) F1 gap to unintended training-time exposure to the test-period adjacency, highlighting a critical evaluation leakage risk.
  • Additional edge-shuffle ablations show that randomly rewired graphs can outperform the real transaction graph under temporal distribution shift, suggesting the dataset’s graph topology may be misleading.
  • Hybrid approaches that combine GNN embeddings with raw features yield only marginal improvements and still fall well below feature-only baselines, and the paper releases code/checkpoints plus a strict-inductive protocol for reproducible evaluation.

Abstract

The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset's topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.