[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Reddit r/MachineLearning / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • A new log anomaly detection system built on Mamba-3/SSM reportedly achieved an F1 score of 0.9975 on the classic HDFS benchmark, slightly outperforming prior LogRobust results.
  • The author reports very high recall (~0.9973) and precision (~0.9976) on the benchmark’s anomalous vs. normal session sets, with only a handful of false alarms and misses.
  • Model efficiency is a key claim: the detector is small (~4.9M parameters), trains quickly (~36 minutes on an RTX 4090), fits in ~1GB GPU memory, and runs at sub-2ms inference (<500 log events/sec).
  • The improvement is attributed less to generic hyperparameter tuning and more to changing preprocessing: replacing BPE/NLP-style tokenization with template-based tokenization where each log template becomes a single event-type token.
  • The work emphasizes that treating logs as structured event templates rather than natural language sequences can unlock large gains when paired with newer SSM-based sequence models like Mamba-3.
[P] I trained a Mamba-3 log anomaly detector that hit 0.9975 F1 on HDFS — and I’m curious how far this can go

Experiment #324 ended well. ;)

This time I built a small project around log anomaly detection. In about two days, I went from roughly 60% effectiveness in the first runs to a final F1 score of 0.9975 on the HDFS benchmark.

Under my current preprocessing and evaluation setup, LogAI reaches F1=0.9975, which is slightly above the 0.996 HDFS result reported for LogRobust in a recent comparative study.

What that means in practice:

  • on 3,368 anomalous sessions in the test set, it missed about 9 (recall = 0.9973)
  • on roughly 112k normal sessions, it raised only about 3 false alarms (precision = 0.9976)

What I find especially interesting is that this is probably the first log anomaly detection model built on top of Mamba-3 / SSM, which was only published a few weeks ago.

The model is small:

  • 4.9M parameters
  • trains in about 36 minutes on an RTX 4090
  • needs about 1 GB of GPU memory
  • inference is below 2 ms on a single consumer GPU, so over 500 log events/sec

For comparison, my previous approach took around 20 hours to train.

The dataset here is the classic HDFS benchmark from LogHub / Zenodo, based on Amazon EC2 logs:

  • 11M+ raw log lines
  • 575,061 sessions
  • 16,838 anomalous sessions (2.9%)

This benchmark has been used in a lot of papers since 2017, so it’s a useful place to test ideas.

The part that surprised me most was not just the score, but what actually made the difference.

I started with a fairly standard NLP-style approach:

  • BPE tokenizer
  • relatively large model, around 40M parameters

That got me something like 0.61–0.74 F1, depending on the run. It looked reasonable at first, but I kept hitting a wall. Hyperparameter tuning helped a bit, but not enough.

The breakthrough came when I stopped treating logs like natural language.

Instead of splitting lines into subword tokens, I switched to template-based tokenization: one log template = one token representing an event type.

So instead of feeding the model something like text, I feed it sequences like this:

[5, 3, 7, 5, 5, 3, 12, 12, 5, ...]

Where for example:

  • "Receiving block blk_123 from 10.0.0.1" - Template #5
  • "PacketResponder 1 terminating" - Template #3
  • "Unexpected error deleting block blk_456" - Template #12

That one change did a lot at once:

  • vocabulary dropped from about 8000 to around 50
  • model size shrank by roughly 10x
  • training went from hours to minutes
  • and, most importantly, the overfitting problem mostly disappeared

The second important change was matching the classifier head to the architecture. Mamba is causal, so the last token carries a compressed summary of the sequence context. Once I respected that in the pooling/classification setup, the model started behaving the way I had hoped.

The training pipeline was simple:

  • Pretrain (next-token prediction): the model only sees normal logs and learns what “normal” looks like
  • Finetune (classification): the model sees labeled normal/anomalous sessions
  • Test: the model gets unseen sessions and predicts normal vs anomaly

Data split was 70% train / 10% val / 20% test, so the reported F1 is on sessions the model did not see during training.

Another useful thing is that the output is not just binary. The model gives a continuous anomaly score from 0 to 1.

So in production this could be used with multiple thresholds, for example:

  • > 0.7 = warning
  • > 0.95 = critical

Or with an adaptive threshold that tracks the baseline noise level of a specific system.

A broader lesson for me: skills and workflows I developed while playing with AI models for chess transfer surprisingly well to other domains. That’s not exactly new - a lot of AI labs started with games, and many still do - but it’s satisfying to see it work in practice.

Also, I definitely did not get here alone. This is a combination of:

  • reading a lot of papers
  • running automated experiment loops
  • challenging AI assistants instead of trusting them blindly
  • and then doing my own interpretation and tuning

Very rough split:

  • 50% reading papers and extracting ideas
  • 30% automated hyperparameter / experiment loops
  • 20% manual tuning and changes based on what I learned

Now I’ll probably build a dashboard and try this on my own Astrography / Astropolis production logs. Or I may push it further first on BGL, Thunderbird, or Spirit.

Honestly, I still find it pretty wild how much can now be done on a gaming PC if you combine decent hardware, public research, and newer architectures quickly enough.

Curious what people here think:

  • does this direction look genuinely promising to you?
  • has anyone else tried SSMs / Mamba for log modeling?
  • and which benchmark would you hit next: BGL, Thunderbird, or Spirit?

If there’s interest, I can also share more about the preprocessing, training loop, and the mistakes that got me stuck at 60-70% before it finally clicked.

P.S. I also tested its effectiveness and reproducibility across different seeds. On most of them, it actually performed slightly better than before.

https://preview.redd.it/3hrr4prgbzsg1.png?width=1794&format=png&auto=webp&s=d50ff21226e9aa97c2c0bbefed77be5dd8389cb8

submitted by /u/Adam_Jesion
[link] [comments]