Isolation Forest + eBPF events to create a Linux based endpoint detection system [P]

Reddit r/MachineLearning / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The project “guardd” aims to build a Linux host-based anomaly detection system using Isolation Forest with execution and network events captured from the endpoint (via eBPF events).
  • It aggregates events into 60-second windows, converts them into feature vectors (event counts, unique processes/files/IPs/ports, parent-child patterns, and ratios), and also tracks “new vs baseline” entities and relationships.
  • Training is fully unsupervised: it collects baseline data, trains an Isolation Forest, scores samples during detection, and applies a threshold based on a percentile of the training score distribution.
  • A key challenge right now is high false-positive rates, especially for browsers and other high-variance behaviors that may appear anomalous depending on what was included in baseline training.
  • The author is exploring improvements such as adding time-of-day/activity features, better normalization, handling bursty behavior more robustly, and considering whether a more hybrid (semi-supervised) approach would reduce sensitivity to noise.

Hey everyone. I’ve been working on a machine learning project called guardd and wanted to get some feedback on the ML side of it.

It’s basically a host-based anomaly detection system for Linux using Isolation Forest. I’m collecting exec and network events, grouping them into 60 second windows, then turning that into feature vectors that get scored by the model.

Right now the features are things like counts of exec and network events, how many unique processes, files, IPs and ports show up in a window, some parent-child relationship patterns, a few simple ratios between features, and also some “new vs baseline” tracking like processes or relationships that weren’t seen during training.

Training is fully unsupervised. It collects baseline data, trains an Isolation Forest, then uses score_samples during detection. The threshold is just based on a percentile from the training score distribution.

The main issue right now is false positives, especially from stuff like browsers. Anything with a lot of variance can end up looking anomalous depending on what ended up in the baseline, so the model is pretty sensitive to training data.

Right now I’m looking at adding some time-based features like time of day or activity patterns, improving normalization a bit, and trying to handle bursty behavior better.

Curious what people think about feature design for this kind of data, how to make Isolation Forest less sensitive to noisy but normal behavior, and whether staying fully unsupervised makes sense here or if moving toward something more hybrid would be better.

Would appreciate any thoughts on the approach.

Repo is here: https://github.com/benny-e/guardd.git

submitted by /u/No-Insurance-4417
[link] [comments]