Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

arXiv cs.LG / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces FastSinkhorn, a native CUDA implementation of the log-domain Sinkhorn algorithm for entropic regularized optimal transport (OT).
By performing all computations in the log-domain and using warp-level shuffle reductions plus shared-memory tiling, the method improves numerical stability for small regularization parameters (as low as ε = 1e−4).
Benchmarks on dense OT problems (n = m = 8192) show 12× speedups over the POT library and 5.9× over GPU-accelerated PyTorch baselines, while using only 256 MB of GPU memory.
The authors validate the approach on applications such as image color transfer and 3D point cloud matching, and provide convergence analysis to support practicality for large-scale OT.
Overall, the work argues that carefully engineered native GPU kernels can significantly reduce framework overhead while maintaining stable log-domain OT computations.

Abstract

Entropic regularized optimal transport (OT) via the Sinkhorn algorithm has become a fundamental tool in machine learning, yet existing implementations either suffer from numerical instability for small regularization parameters or incur significant overhead from deep learning frameworks. We present FastSinkhorn, a lightweight, native CUDA implementation of the log-domain Sinkhorn algorithm that combines warp-level shuffle reductions with shared-memory tiling to achieve high GPU utilization without sacrificing numerical stability. Our solver operates entirely in the log-domain, enabling robust computation for regularization parameters as small as epsilon = 10^{-4} where standard-domain methods fail. On dense OT problems with n = m = 8192, our implementation achieves 12x speedup over the widely-used POT library and 5.9x speedup over GPU-accelerated PyTorch baselines, while consuming only 256 MB of GPU memory. We validate our solver on image color transfer, 3D point cloud matching, and convergence analysis, demonstrating that native CUDA kernels with careful numerical treatment provide a practical and efficient foundation for large-scale optimal transport computation.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/5DailyView insight →

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Dev.to

Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Reddit r/LocalLLaMA

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Last Week in AI

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Reddit r/LocalLLaMA

Uber Shares What Happens When 1.500 AI Agents Hit Production

Reddit r/artificial

Fast Log-Domain Sinkhorn Optimal Transport with Warp-Level GPU Reductions

Key Points

Abstract

💡 Insights using this article

Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF

Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!

Uber Shares What Happens When 1.500 AI Agents Hit Production

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer