Pack only the essentials: Adaptive dictionary learning for kernel ridge regression

arXiv cs.LG / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Kernel ridge regression (KRR) is limited by the O(n^2) space required to store and manipulate the full kernel matrix, making it impractical for large datasets.
While Nystrom approximations using uniform sampling reduce storage to O(nm), they can require m≈O(n) on datasets where the kernel has high coherence.
Sampling columns via ridge leverage scores (RLS) can achieve accurate Nystrom approximations with m scaling to the effective dimension, but computing exact RLS still costs O(n^2) space.
The paper proposes SQUEAK, an algorithm extending INK-Estimate that uses unnormalized RLS to simplify the procedure and avoid estimating the effective dimension for normalization, while keeping space complexity close to exact RLS sampling (up to a constant factor).

Abstract

One of the major limits of kernel ridge regression (KRR) is that storing and manipulating the kernel matrix K_n for n samples requires O(n^2) space, which rapidly becomes unfeasible for large n. Nystrom approximations reduce the space complexity to O(nm) by sampling m columns from K_n. Uniform sampling preserves KRR accuracy (up to epsilon) only when m is proportional to the maximum degree of freedom of K_n, which may require O(n) columns for datasets with high coherence. Sampling columns according to their ridge leverage scores (RLS) gives accurate Nystrom approximations with m proportional to the effective dimension, but computing exact RLS also requires O(n^2) space. (Calandriello et al. 2016) propose INK-Estimate, an algorithm that processes the dataset incrementally and updates RLS, effective dimension, and Nystrom approximations on-the-fly. Its space complexity scales with the effective dimension but introduces a dependency on the largest eigenvalue of K_n, which in the worst case is O(n). In this paper we introduce SQUEAK, a new algorithm that builds on INK-Estimate but uses unnormalized RLS. As a consequence, the algorithm is simpler, does not need to estimate the effective dimension for normalization, and achieves a space complexity that is only a constant factor worse than exact RLS sampling.