Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

arXiv cs.LG / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces FastSinkhorn, a native CUDA implementation of the log-domain Sinkhorn algorithm for entropic regularized optimal transport (OT).
  • By performing all computations in the log-domain and using warp-level shuffle reductions plus shared-memory tiling, the method improves numerical stability for small regularization parameters (as low as ε = 1e−4).
  • Benchmarks on dense OT problems (n = m = 8192) show 12× speedups over the POT library and 5.9× over GPU-accelerated PyTorch baselines, while using only 256 MB of GPU memory.
  • The authors validate the approach on applications such as image color transfer and 3D point cloud matching, and provide convergence analysis to support practicality for large-scale OT.
  • Overall, the work argues that carefully engineered native GPU kernels can significantly reduce framework overhead while maintaining stable log-domain OT computations.

Abstract

Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^{-4} where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only 256 MB of GPU memory. We validate our solver on image color transfer, 3D point cloud matching, and convergence analysis, demonstrating that native CUDA kernels with careful numerical treatment provide a practical and efficient foundation for large-scale optimal transport computation.