Abstract
Non-asymptotic central limit theorem (CLT) rates play a central role in modern machine learning and operations research. In this paper, we study CLT rates for multivariate dependent data in Wasserstein-p (W_p) distance, for general p\ge 1. We focus on two fundamental dependence structures that commonly arise in practice: locally dependent sequences and geometrically ergodic Markov chains. In both settings, we establish the first optimal \mathcal O(n^{-1/2}) rate in W_1, as well as the first W_p (p\ge 2) CLT rates under mild moment assumptions, substantially improving the best previously known bounds in these dependent-data regimes. As an application of our optimal W_1 rate for locally dependent sequences, we further obtain the first optimal W_1-CLT rate for multivariate U-statistics.
On the technical side, we derive a tractable auxiliary bound for W_1 Gaussian approximation errors that is well suited for studying dependent data. For Markov chains, we further prove that the regeneration time of the split chain associated with a geometrically ergodic chain has a geometric tail without assuming strong aperiodicity or other restrictive conditions. These tools may be of independent interests and enable our optimal W_1 rates and underpin our W_p (p\ge 2) results.