Anomaly Detection Belongs in Your Database — built SIMD-accelerated isolation forests into Stratum's SQL engine [P]

Reddit r/MachineLearning / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • Stratum, Stratum’s columnar analytics SQL engine for the JVM, now supports native anomaly detection by training and scoring isolation forest models directly from SQL.
  • Users can run queries like filtering transactions by an ANOMALY_SCORE threshold without using Python or building an external export/scoring pipeline.
  • The implementation claims SIMD acceleration via Java’s Vector API, achieving around 6 microseconds per transaction and integrating scoring into the query execution path.
  • The write-up explains why anomaly detection belongs in the database and provides benchmarks comparing the approach against PyOD and scikit-learn.
  • Stratum is open source under Apache 2.0, and the post highlights performance benefits from optimizations such as zone map pruning and chunked streaming during execution.

We added native anomaly detection in Stratum, our columnar analytics engine for the JVM. Train and score isolation forest models entirely from SQL — no Python, no export pipeline:

SELECT * FROM transactions WHERE ANOMALY_SCORE('fraud_model') > 0.7; 

6 microseconds per transaction, SIMD-accelerated, runs inside the query engine. The full write-up covers why we built it, how isolation forests work, and benchmarks against PyOD/scikit-learn:

https://datahike.io/notes/anomaly-detection-in-your-database/

Stratum is open source (Apache 2.0): https://github.com/replikativ/stratum

Happy to answer questions about the implementation — the isolation forest is pure Java with Vector API SIMD, scoring is fused into the query execution pipeline so it benefits from zone map pruning and chunked streaming.

submitted by /u/flyingfruits
[link] [comments]