Isolation Forest + eBPF events to create a Linux based endpoint detection system [P]

Reddit r/MachineLearning / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The project “guardd” aims to build a Linux host-based anomaly detection system using Isolation Forest with execution and network events captured from the endpoint (via eBPF events).
It aggregates events into 60-second windows, converts them into feature vectors (event counts, unique processes/files/IPs/ports, parent-child patterns, and ratios), and also tracks “new vs baseline” entities and relationships.
Training is fully unsupervised: it collects baseline data, trains an Isolation Forest, scores samples during detection, and applies a threshold based on a percentile of the training score distribution.
A key challenge right now is high false-positive rates, especially for browsers and other high-variance behaviors that may appear anomalous depending on what was included in baseline training.
The author is exploring improvements such as adding time-of-day/activity features, better normalization, handling bursty behavior more robustly, and considering whether a more hybrid (semi-supervised) approach would reduce sensitivity to noise.

Hey everyone. I’ve been working on a machine learning project called guardd and wanted to get some feedback on the ML side of it.

It’s basically a host-based anomaly detection system for Linux using Isolation Forest. I’m collecting exec and network events, grouping them into 60 second windows, then turning that into feature vectors that get scored by the model.

Right now the features are things like counts of exec and network events, how many unique processes, files, IPs and ports show up in a window, some parent-child relationship patterns, a few simple ratios between features, and also some “new vs baseline” tracking like processes or relationships that weren’t seen during training.

Training is fully unsupervised. It collects baseline data, trains an Isolation Forest, then uses score_samples during detection. The threshold is just based on a percentile from the training score distribution.

The main issue right now is false positives, especially from stuff like browsers. Anything with a lot of variance can end up looking anomalous depending on what ended up in the baseline, so the model is pretty sensitive to training data.

Right now I’m looking at adding some time-based features like time of day or activity patterns, improving normalization a bit, and trying to handle bursty behavior better.

Curious what people think about feature design for this kind of data, how to make Isolation Forest less sensitive to noisy but normal behavior, and whether staying fully unsupervised makes sense here or if moving toward something more hybrid would be better.

Would appreciate any thoughts on the approach.

Repo is here: https://github.com/benny-e/guardd.git

submitted by /u/No-Insurance-4417
[link] [comments]

Just what the doctor ordered: how AI could help China bridge the medical resources gap

SCMP Tech

Why don't Automatic speech Recognition models use prompting? [D]

Reddit r/MachineLearning

Got into the Anthropic Claude Partner Network — have spots for people who want CCAF cert access

Reddit r/artificial

💎 Daily B2B Lead Report: Who's Hiring Now? (2026-04-25)

Dev.to

Automating Advanced Customization in Your Music Studio

Dev.to

Isolation Forest + eBPF events to create a Linux based endpoint detection system [P]

Key Points

Related Articles

Just what the doctor ordered: how AI could help China bridge the medical resources gap

Why don't Automatic speech Recognition models use prompting? [D]

Got into the Anthropic Claude Partner Network — have spots for people who want CCAF cert access

💎 Daily B2B Lead Report: Who's Hiring Now? (2026-04-25)

Automating Advanced Customization in Your Music Studio

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer