[P] Volga - Data Engine for Real-Time AI/ML

Reddit r/MachineLearning / 3/19/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

Volga is an open-source data engine for real-time AI/ML pipelines, positioned as a modern alternative to Flink, Spark, and Arroyo.
The project has been rewritten from a Python+Ray prototype to a native Rust core to deliver a standalone runtime without the traditional JVM infrastructure tax.
Built on Apache DataFusion and Apache Arrow, Volga offers a unified runtime for streaming, batch, and request-time compute tailored to AI/ML data workflows.
It introduces features such as SQL-based pipelines with an extended DataFusion planner, remote state storage with LSM-Tree-on-S3 via SlateDB, and ML-specific aggregations like topk, _cate, and _where, plus long-window tiling.
The author also shared a technical deep dive and links to the GitHub repo for Volga, enabling interested developers to review the design and contribute.

Hi all, wanted to share the project I've been working on:

Volga — an open-source data engine for real-time AI/ML. In short, it is a Flink/Spark/Arroyo alternative tailored for AI/ML pipelines, similar to systems like Chronon and OpenMLDB.

I’ve recently completed a full rewrite of the system, moving from a Python+Ray prototype to a native Rust core. The goal was to build a truly standalone runtime that eliminates the "infrastructure tax" of traditional JVM-based stacks.

Volga is built with Apache DataFusion and Arrow, providing a unified, standalone runtime for streaming, batch, and request-time compute specific to AI/ML data pipelines. It effectively eliminates complex systems stitching (Flink + Spark + Redis + custom services).

Key Architectural Features:

SQL-based Pipelines: Powered by Apache DataFusion (extending its planner for distributed streaming).
Remote State Storage: LSM-Tree-on-S3 via SlateDB for true compute-storage separation. This enables near-instant rescaling and cheap checkpoints compared to local-state engines.
Unified Streaming + Batch: Consistent watermark-based execution for real-time and backfills via Apache Arrow.
Request Mode: Point-in-time correct queryable state to serve features directly within the dataflow (no external KV/serving workers).
ML-Specific Aggregations: Native support for topk, _cate, and _where functions.
Long-Window Tiling: Optimized sliding windows over weeks or months.

I wrote a detailed architectural deep dive on the transition to Rust, how we extended DataFusion for streaming, and a comparison with existing systems in the space:

Technical Deep Dive: https://volgaai.substack.com/p/volga-a-rust-rewrite-of-a-real-time
GitHub: https://github.com/volga-project/volga

submitted by /u/saws_baws_228
[link] [comments]