Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that flattening encrypted traffic into byte sequences introduces inductive bias, with issues including unpredictable fields (e.g., ip.id), embedding confusion where distinct fields collapse into the same embedding space, and loss of capture-time metadata critical for temporal analysis.
It proposes a protocol-native paradigm that treats protocol-defined field semantics as architectural priors and reframes the task to align with the tabular data modality rather than extending sequence-based models.
It introduces FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs), featuring predictability-guided filtering, FSU-specific embeddings, and dual-axis attention to capture intra-packet and temporal patterns.
FlowSem-MAE significantly outperforms state-of-the-art across datasets, and with only half labeled data it surpasses many methods trained on full data.
The work points to a paradigm shift in encrypted-traffic classification, with potential benefits for labeling efficiency and practical deployment.

Abstract

Self-supervised masked modeling shows promise for encrypted traffic classification by masking and reconstructing raw bytes. Yet recent work reveals these methods fail to reduce reliance on labeled data despite costly pretraining: under frozen encoder evaluation, accuracy drops from greater than 0.9 to less than 0.47. We argue the root cause is inductive bias mismatch: flattening traffic into byte sequences destroys protocol-defined semantics. We identify three specific issues: 1) field unpredictability, random fields like ip.id are unlearnable yet treated as reconstruction targets; 2) embedding confusion, semantically distinct fields collapse into a unified embedding space; 3) metadata loss, capture-time metadata essential for temporal analysis is discarded. To address this, we propose a protocol-native paradigm that treats protocol-defined field semantics as architectural priors, reformulating the task to align with the data's intrinsic tabular modality rather than incrementally adapting sequence-based architectures. Instantiating this paradigm, we introduce FlowSem-MAE, a tabular masked autoencoder built on Flow Semantic Units (FSUs). It features predictability-guided filtering that focuses on learnable FSUs, FSU-specific embeddings to preserve field boundaries, and dual-axis attention to capture intra-packet and temporal patterns. FlowSem-MAE significantly outperforms state-of-the-art across datasets. With only half labeled data, it outperforms most existing methods trained on full data.

Astral to Join OpenAI

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Why Data is Important for LLM

Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

Dev.to

Where Do Flow Semantics Reside? A Protocol-Native Tabular Pretraining Paradigm for Encrypted Traffic Classification

Key Points

Abstract

Related Articles

Astral to Join OpenAI

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Why Data is Important for LLM

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.

YouTube's Deepfake Shield for Politicians Changes Evidence Forever

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer