Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

arXiv cs.LG / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Streaming continual learning benchmarks often create discrete tasks from a continuous stream via temporal partitioning, and this paper argues that such “temporal taskification” is not neutral but structurally affects the evaluation regime.
The authors propose a taskification-level framework (including plasticity/stability profiles, profile distance, and Boundary-Profile Sensitivity) to quantify how sensitive an induced regime is to small boundary perturbations before any model training.
Experiments on network traffic forecasting (CESNET-Timeseries24) keep the stream, model, and training budget fixed while varying only the temporal splits, and find substantial changes in forecasting error, forgetting, and backward transfer across different split lengths.
Shorter taskifications lead to noisier distribution-level patterns, larger structural differences, and higher boundary sensitivity, implying that benchmark outcomes can vary significantly due to evaluation setup choices.
The study concludes that benchmark conclusions in streaming CL depend not only on the learner and stream, but also on how the stream is taskified, motivating temporal taskification as a first-class evaluation variable.

Abstract

Streaming Continual Learning (CL) typically converts a continuous stream into a sequence of discrete tasks through temporal partitioning. We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark conclusions. To study this effect, we introduce a taskification-level framework based on plasticity and stability profiles, a profile distance between taskifications, and Boundary-Profile Sensitivity (BPS), which diagnoses how strongly small boundary perturbations alter the induced regime before any CL model is trained. We evaluate continual finetuning, Experience Replay, Elastic Weight Consolidation, and Learning without Forgetting on network traffic forecasting with CESNET-Timeseries24, keeping the stream, model, and training budget fixed while varying only the temporal taskification. Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation. We further find that shorter taskifications induce noisier distribution-level patterns, larger structural distances, and higher BPS, indicating greater sensitivity to boundary perturbations. These results show that benchmark conclusions in streaming CL depend not only on the learner and the data stream, but also on how that stream is taskified, motivating temporal taskification as a first-class evaluation variable.

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer