Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

arXiv cs.AI / 4/29/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes Multi-Action Tangled Program Graph (MATPG), a genetic-programming approach that aggregates MAPLE-style agents and uses a control-flow mechanism to activate multiple behaviors for continuous-control tasks.
While MATPG was previously tested mainly on single-task reinforcement learning, the authors introduce a new multi-task benchmark using MuJoCo HalfCheetah with five randomly placed obstacles, each requiring distinct behaviors.
Experiments on the new continuous multi-task setting show that MATPG performs strongly, and the authors report superiority when MATPG is combined with lexicase selection.
The study also evaluates interpretability, finding that the evolved program graph’s decision flow is fully understandable, supporting explainable policy structure.
Overall, the work positions MATPG as an effective GP-based solution for continuous Multi-Task Reinforcement Learning and provides a new evaluation scenario to test such methods.

Abstract

Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (MTRL) environments have been introduced, requiring a single model to learn multiple behaviors. The Tangled Program Graph (TPG) algorithm is a Genetic Programming (GP) algorithm designed for discrete MTRL environments. Recently, the MAPLE algorithm has been proposed, as another GP algorithm that achieves high results in single task continuous RL environments. A variation of the TPG is proposed alongside MAPLE, named Multi-Action TPG (MATPG) that aggregates MAPLE agents, and creates a control flow to activate them. Initially tested on single task RL environments only, MATPG achieved similar results to MAPLE. In this work, we present a new benchmark based on the MuJoCo Half Cheetah from Gymnasium. This benchmark features five distinct obstacles that are randomly positioned in front of the agent, each of which demands a unique behavior. This benchmark serves as a use case for MATPG, to prove its ability as a GP solution for continuous MTRL environments. Our experiments demonstrate its superiority in this multi-task use case when combined with lexicase selection. Furthermore, we examine the interpretability of the evolved graph, revealing that the decision flow of the model is fully interpretable.