MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

arXiv cs.AI / 3/18/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

MedCL-Bench introduces a unified, task-diverse benchmark for evaluating continual learning in biomedical NLP, addressing the lack of standardized protocols.
It streams ten biomedical NLP datasets across five task families and evaluates eleven continual learning strategies over eight task orders, reporting retention, transfer, and GPU-hour cost.
Across backbones and task orders, direct sequential fine-tuning induces catastrophic forgetting, underscoring the need for continual learning approaches.
Among CL methods, parameter-isolation offers the best retention per GPU-hour, replay provides strong protection at higher compute cost, and regularization yields limited benefit.
Forgetting is task-dependent, with multi-label topic classification most vulnerable while constrained-output tasks are more robust; MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

Abstract

Medical language models must be updated as evidence and terminology evolve, yet sequential updating can trigger catastrophic forgetting. Although biomedical NLP has many static benchmarks, no unified, task-diverse benchmark exists for evaluating continual learning under standardized protocols, robustness to task order and compute-aware reporting. We introduce MedCL-Bench, which streams ten biomedical NLP datasets spanning five task families and evaluates eleven continual learning strategies across eight task orders, reporting retention, transfer, and GPU-hour cost. Across backbones and task orders, direct sequential fine-tuning on incoming tasks induces catastrophic forgetting, causing update-induced performance regressions on prior tasks. Continual learning methods occupy distinct retention-compute frontiers: parameter-isolation provides the best retention per GPU-hour, replay offers strong protection at higher cost, and regularization yields limited benefit. Forgetting is task-dependent, with multi-label topic classification most vulnerable and constrained-output tasks more robust. MedCL-Bench provides a reproducible framework for auditing model updates before deployment.

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

Dev.to

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

MedCL-Bench: Benchmarking stability-efficiency trade-offs and scaling in biomedical continual learning

Key Points

Abstract

Related Articles

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer