TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
arXiv cs.LG / 4/15/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes TCL, a compiler framework aimed at speeding up and improving the transferability of tensor program optimization across different CPU/GPU hardware without relying on large offline tuning datasets.
- TCL reduces data collection costs by using the RDU Sampler, which selects only about 10% of tensor programs via active learning that jointly optimizes representativeness, diversity, and uncertainty.
- It introduces a new Mamba-based cost model designed to capture long-range schedule dependencies with a favorable accuracy–efficiency trade-off through reduced parameterization and lightweight sequence modeling.
- TCL also uses a continuous knowledge distillation approach to progressively transfer optimization knowledge across hardware platforms while avoiding issues like parameter explosion and data dependency common in traditional multi-task learning.
- Experiments show TCL substantially improves tuning speed (16.8x on average for CPU and 12.48x for GPU) and modestly reduces inference latency versus Tenset-MLP (1.20x CPU and 1.13x GPU).
Related Articles
Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention
Dev.to
I scanned every major vibe coding tool for security. None scored above 90.
Dev.to
I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.
Dev.to
Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?
Reddit r/artificial
Give me your ideass [N]
Reddit r/MachineLearning