ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep Analysis

Read original →

共有:

Key Points

ALTK‑Evolve proposes an “on‑the‑job learning” approach for AI agents, aiming to improve performance through learning during real task execution rather than only prior offline training.
The concept focuses on agent behavior that can adapt based on feedback and experience collected while operating in enterprise settings.
The article frames on-the-job learning as a practical pathway to make AI agents more robust to changing environments and task variations.
It positions ALTK‑Evolve as an enterprise-oriented research/engineering idea that could influence how teams deploy and continuously improve agent systems.

Back to Articles

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Enterprise Article Published April 8, 2026

Jayaram Radhakrishnan

TL;DR

Most AI agents re‑read transcripts instead of learning principles, so they repeat mistakes and don’t transfer lessons to new situations.
ALTK‑Evolve turns raw agent trajectories into reusable guidelines.
In benchmarks, the approach boosted reliability, especially on hard (Δ 14.2% on AppWorld), multi‑step tasks, without bloating context.

The “eternal intern” problem

Imagine a brilliant line cook who has memorized every cookbook but forgets your kitchen every morning. They don’t remember your oven runs hot, or that regulars like extra salt; they’ll follow a recipe card yet freeze when you’re out of lemons. That’s most AI agents: excellent at following prompts, poor at accumulating wisdom about your environment. Feeding yesterday’s logs back into the prompt just makes them re‑read history; it doesn’t help them generalize from it.

A junior needs different recipes for “vinaigrette” and “duck à l’orange.” A chef learns “acid balances fat” and applies it everywhere. Likewise, reliable agents should distill principles from experience and apply them to new tasks, not just near duplicates of old ones. This long‑term memory subsystem does exactly that: it converts interaction traces into candidate guidelines, filters for quality, and injects only relevant guidance at the moment of action. Agents need principles, not transcripts.

A recent MIT study found that 95% of pilots fail because agents don't adapt and learn on the job. ALTK-Evolve addresses this learning gap using long term episodic memory to help agents reason better.

Solution: long term memory with ALTK-Evolve

Evolve is a memory system for AI agents, that can help agents improve over time, learning from and using guidelines generated from previous executions.

Operationally, the system runs as a continuous loop:

Downward flow (observation & extraction): Capture full agent trajectories (user utterances, thoughts, tool calls, results) in an Interaction Layer (e.g., Langfuse or another OpenTelemetry‑based observability tool). Pluggable extractors mine traces for structural patterns and persist them as candidate entities.
Upward flow (refinement & retrieval): A background consolidate‑and‑score job merges duplicates, prunes weak rules, and boosts proven strategies, evolving a high‑quality library of entities such as guidelines, policies, and SOPs. Retrieval pulls only the relevant items via the Interaction Layer and injects them back into context at the Application Layer.

This approach works for a few key reasons:

Teaches judgment: Converts one‑off events into portable strategies that transfer across tasks.
Controls noise: Scoring keeps memory lean and useful, not a growing junk drawer.
Progressive Disclosure: Retrieval is just‑in‑time, not stuffing everything into the context.

Results: better reliability, especially on hard tasks

We evaluated the framework on AppWorld, where agents complete realistic multi‑step tasks via APIs, averaging 9.5 APIs across 1.8 apps, with hard cases requiring more complex control flow. A ReAct agent received the task instruction plus the top 5 retrieved guidelines generated on a prior run (train/dev) and tested on an unseen partition (test-normal). We report Scenario Goal Completion (SGC), a strict consistency metric requiring success across variants.

Difficulty	Baseline SGC	+ Memory	Δ
Easy	79.0%	84.2%	+5.2
Medium	56.2%	62.5%	+6.3
Hard	19.1%	33.3%	+14.2
Aggregate	50.0%	58.9%	+8.9

Here are some key conclusions from the evaluations:

Generalization: The agent improves on the unseen Test‑Normal tasks, evidence that it’s learning principles, not memorizing recipes.
Complexity scaling: The harder the task, the more the agent benefits from concise learned guidelines, with the largest lift on the more difficult tasks. The Hard tasks saw a 74% relative increase in success, where guidelines are useful to navigate the intricate control flows.
Consistency: SGC gains exceeded raw pass‑rate improvements, reducing “flaky” behavior across scenario variants. The guidelines don’t just help the agent solve tasks, they help them solve them reliably across variants.

Find more details about the experiments in the paper at https://arxiv.org/abs/2603.10600.

Getting started (choose your path)

You have a choice in how to integrate ALTK‑Evolve into your agent.

No‑code with Claude Code, Codex, and IBM Bob (Lite mode)

Install the plugin into Claude Code:

claude plugin marketplace add AgentToolkit/altk-evolve
claude plugin install evolve@evolve-marketplace

That’s it! The plugin extracts entities from trajectories and stores them as files on your filesystem. It uses Claude Code’s hooks for automatic retrieval.

Prefer to watch instead of read? See the short Evolve-Lite Claude Code walkthrough (video): Demo

Check out the walkthroughs here for examples of how to learn with Claude Code in Lite mode.

Lite mode is easy to test‑drive but has limitations. For example, it doesn’t glean insights from across agent sessions or perform consolidation and garbage collection of entities. The low‑code and pro‑code versions below address these limitations.

There are also one-step integrations with Codex and IBM Bob. Try them out!

Low‑code with a ReAct agent

Add a single altk_evolve.auto import and flip a flag to emit traces to an Arize Phoenix UI. Then sync traces to generate improvement guidelines without changing your current stack. It works with popular LLM clients and agent frameworks (e.g., OpenAI, LiteLLM, and Hugging Face agents), so you keep your current stack and simply gain visibility.

To see just how easily this fits into existing projects, explore our hands‑on examples showcasing different framework integrations. For full details on configuration and capabilities, read our low‑code tracing documentation.

Pro‑code with CUGA

We integrated ALTK‑Evolve directly into CUGA via MCP to create a tight, low‑overhead learning loop. Before each run, the get_guidelines MCP tool is called to surface task‑specific steering and reduce trial‑and‑error. After the run, CUGA sends back structured execution traces via save_trajectory, so Evolve can learn from what actually happened and improve future guidance. The result is an integration that gets better over time while staying transparent, composable, and easy to adopt.

Prefer a visual tour? Watch the CUGA integration walkthrough: video

Try it & tell us what your agent learned

Your agent shouldn’t wake up as an intern every morning. This approach helps it learn on the job. If you're using Claude Code, Codex, and IBM Bob, try it out in minutes and see how it improves your agent.

Star the repo, it helps others discover the project and directly guides what we build next.

Code: https://github.com/AgentToolkit/altk-evolve
Docs: https://agenttoolkit.github.io/altk-evolve
Quick start tutorials: https://agenttoolkit.github.io/altk-evolve/tutorials/
Feedback & ideas: Open a GitHub issue or join the discussions — concrete use cases, benchmarks, and integration requests are especially helpful.

Watch the demos

Claude Code walkthrough (video): Demo
OpenAI Codex walkthrough (video): Demo
IBM Bob demo walkthrough (video): Demo
CUGA integration walkthrough: video

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

February 18, 2026

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

January 21, 2026

Community

EditPreview

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Comment

· Sign up or log in to comment

Upvote

Black Hat Asia

AI Business

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Every AI Agent Registry in 2026, Compared

Dev.to

Google isn’t an AI-first company despite Gemini being great

Reddit r/artificial

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Key Points

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

TL;DR

The “eternal intern” problem

Solution: long term memory with ALTK-Evolve

Results: better reliability, especially on hard tasks

Getting started (choose your path)

No‑code with Claude Code, Codex, and IBM Bob (Lite mode)

Low‑code with a ReAct agent

Pro‑code with CUGA

Try it & tell us what your agent learned

Watch the demos

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

Community

Related Articles

Black Hat Asia

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Context Windows Are Getting Absurd — And That's a Good Thing

Every AI Agent Registry in 2026, Compared

Google isn’t an AI-first company despite Gemini being great

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer