Algorithm Selection with Zero Domain Knowledge via Text Embeddings

arXiv cs.LG / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ZeroFolio, a feature-free algorithm-selection method that uses pretrained text embeddings instead of hand-crafted instance features.
ZeroFolio converts raw problem instances into plain text, embeds them with a pretrained model, and chooses an algorithm using weighted k-nearest neighbors over the embedding space.
The authors argue that pretrained embeddings can distinguish problem instances effectively even without any domain knowledge or task-specific training, enabling a reusable three-step pipeline across many domains.
Experiments on 11 ASlib scenarios across 7 domains show ZeroFolio beats a random-forest baseline trained on hand-crafted features in 10/11 scenarios (single configuration) and in all 11 scenarios when using two-seed voting.
Ablation results identify inverse-distance weighting, line shuffling, and Manhattan distance as key design choices, and they further find that soft-voting with hand-crafted features can help when both selectors are competitive.

Abstract

We propose a feature-free approach to algorithm selection that replaces hand-crafted instance features with pretrained text embeddings. Our method, ZeroFolio, proceeds in three steps: it reads the raw instance file as plain text, embeds it with a pretrained embedding model, and selects an algorithm via weighted k-nearest neighbors. The key to our approach is the observation that pretrained embeddings produce representations that distinguish problem instances without any domain knowledge or task-specific training. This allows us to apply the same three-step pipeline (serialize, embed, select) across diverse problem domains with text-based instance formats. We evaluate our approach on 11 ASlib scenarios spanning 7 domains (SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems). Our experiments show that this approach outperforms a random forest trained on hand-crafted features in 10 of 11 scenarios with a single fixed configuration, and in all 11 with two-seed voting; the margin is often substantial. Our ablation study shows that inverse-distance weighting, line shuffling, and Manhattan distance are the key design choices. On scenarios where both selectors are competitive, combining embeddings with hand-crafted features via soft voting yields further improvements.