On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

arXiv cs.LG / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies energy-aware scheduling for cloud workflows represented as DAGs, using a GNN-based deep reinforcement learning scheduler to jointly target completion time and energy consumption in a single-workflow, queue-free setting.
  • It identifies specific out-of-distribution (OOD) scenarios where GNN-DRL schedulers fail, indicating that reliability can break when real conditions diverge from training assumptions.
  • The authors explain that the observed degradation comes from structural mismatches between training and deployment DAG environments, which disrupt GNN message passing and reduce policy generalization.
  • Controlled OOD experiments are used to validate that distribution shift effects are fundamentally tied to representation/structure rather than mere tuning or stochastic variation.
  • The work argues for more robust graph representations to improve scheduler performance under distribution shifts, pointing to limitations of current GNN-based scheduling approaches.

Abstract

Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.