Self-Reinforcing Controllable Synthesis of Rare Relational Data via Bayesian Calibration

arXiv cs.LG / 4/21/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes RDDG (Relational Data Generator with Dynamic Guidance) to synthesize relational/structured tabular data for improving downstream performance on imbalanced classification tasks.
RDDG uses a two-stage process: core set selection to pick representative samples, followed by in-context learning to infer attribute patterns and correlations from that core set.
It generates new tabular data while preserving constraints implied by the original relational structure and targeted properties for the task.
A key contribution is a self-reinforcing feedback mechanism that automatically evaluates the generated data quality and iteratively guides the generation process toward continuous improvement.
Experiments across multiple real and synthetic datasets show RDDG achieves better data fidelity and stronger gains in downstream imbalanced classification than prior methods, and the authors release code on GitHub.

Abstract

Imbalanced data is commonly present in real-world applications. While data synthesis can effectively mitigate the data scarcity problem of rare-classes, and LLMs have revolutionized text generation, the application of LLMs to relational/structured tabular data synthesis remains underexplored. Moreover, existing approaches lack an effective feedback mechanism that can guide LLMs towards continuously optimizing the quality of the generated data throughout the synthesis process. In this work, we propose RDDG, Relational Data generator with Dynamic Guidance, which is a unified in-context learning framework that employs progressive chain-of-thought (CoT) steps to generate tabular data for enhancing downstream imbalanced classification performance. RDDG first uses core set selection to identify representative samples from the original data, then utilizes in-context learning to discover the inherent patterns and correlations among attributes within the core set, and subsequently generates tabular data while preserving the aforementioned constraints. More importantly, it incorporates a self-reinforcing feedback mechanism that provides automatic assessments on the quality of the generated data, enabling continuous quality optimization throughout the generation process. Experimental results on multiple real and synthetic datasets demonstrate that RDDG outperforms existing approaches in both data fidelity and downstream imbalanced classification performance. We make our code available at https://github.com/cszhangLMU/RDDG.