SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

arXiv cs.AI / 4/21/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

SkillFlow is a new arXiv benchmark that evaluates whether autonomous agents can discover, repair, and continuously evolve a reusable external skill library over time, not just use given skills.
The benchmark includes 166 tasks across 20 task families, all built on a Domain-Agnostic Execution Flow (DAEF) workflow framework to ensure consistent agent procedures.
Agents are tested with an Agentic Lifelong Learning protocol: starting with no skills, solving tasks sequentially within each family, creating “skill patches” from trajectories and rubrics, and carrying forward the updated library.
Experiments show sizable gaps in lifelong skill evolution quality, with Claude Opus 4.6 improving success from 62.65% to 71.08%, while other models show weaker or even negative outcomes despite high or low skill usage.
SkillFlow is positioned as a structured testbed plus an empirical analysis of skill discovery, patching, transfer, and the main failure modes under lifelong evaluation.

Abstract

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Reddit r/artificial

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Dev.to

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

The Register

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

Dev.to

Blaze Balance Engine SaaS

Dev.to

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Key Points

Abstract

Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Why I Built byCode: A 100% Local, Privacy-First AI IDE

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

Blaze Balance Engine SaaS

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer