TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning

arXiv cs.LG / 4/15/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes TCL, a compiler framework aimed at speeding up and improving the transferability of tensor program optimization across different CPU/GPU hardware without relying on large offline tuning datasets.
  • TCL reduces data collection costs by using the RDU Sampler, which selects only about 10% of tensor programs via active learning that jointly optimizes representativeness, diversity, and uncertainty.
  • It introduces a new Mamba-based cost model designed to capture long-range schedule dependencies with a favorable accuracy–efficiency trade-off through reduced parameterization and lightweight sequence modeling.
  • TCL also uses a continuous knowledge distillation approach to progressively transfer optimization knowledge across hardware platforms while avoiding issues like parameter explosion and data dependency common in traditional multi-task learning.
  • Experiments show TCL substantially improves tuning speed (16.8x on average for CPU and 12.48x for GPU) and modestly reduces inference latency versus Tenset-MLP (1.20x CPU and 1.13x GPU).

Abstract

Deep learning (DL) compilers rely on cost models and auto-tuning to optimize tensor programs for target hardware. However, existing approaches depend on large offline datasets, incurring high collection costs and offering suboptimal transferability across platforms. In this paper, we introduce TCL, a novel efficient and transferable compiler framework for fast tensor program optimization across diverse hardware platforms to address these challenges. Specifically, TCL is built on three core enablers: (1) the RDU Sampler, a data-efficient active learning strategy that selects only 10% of tensor programs by jointly optimizing Representativeness, Diversity, and Uncertainty, substantially reducing data collection costs while maintaining near-original model accuracy; (2) a new Mamba-based cost model that efficiently captures long-range schedule dependencies while achieving a favorable trade-off between prediction accuracy and computational cost through reduced parameterization and lightweight sequence modeling; and (3) a continuous knowledge distillation framework that effectively and progressively transfers knowledge across multiple hardware platforms while avoiding the parameter explosion and data dependency issues typically caused by traditional multi-task learning. Extensive experiments validate the effectiveness of each individual enabler and the holistic TCL framework. When optimizing a range of mainstream DL models on both CPU and GPU platforms, TCL achieves, on average, 16.8x and 12.48x faster tuning time, and 1.20x and 1.13x lower inference latency, respectively, compared to Tenset-MLP.