Sparse-by-Design Cross-Modality Prediction: L0-Gated Representations for Reliable and Efficient Learning

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a unified, modality-agnostic sparsification method to make accuracy–efficiency trade-offs comparable across heterogeneous KDD modalities like graphs, text, and tabular data.
  • It introduces L0GM, which applies L0-style sparsity directly to learned, classifier-facing representations using feature-wise hard-concrete gating with an explicit knob controlling the active fraction of features.
  • An L0-annealing schedule is used to stabilize training and produce clearer, interpretable accuracy–sparsity Pareto frontiers.
  • Experiments on ogbn-products, Adult, and IMDB show competitive performance while activating fewer representation dimensions and improving probability calibration as measured by reduced Expected Calibration Error (ECE).

Abstract

Predictive systems increasingly span heterogeneous modalities such as graphs, language, and tabular records, but sparsity and efficiency remain modality-specific (graph edge or neighborhood sparsification, Transformer head or layer pruning, and separate tabular feature-selection pipelines). This fragmentation makes results hard to compare, complicates deployment, and weakens reliability analysis across end-to-end KDD pipelines. A unified sparsification primitive would make accuracy-efficiency trade-offs comparable across modalities and enable controlled reliability analysis under representation compression. We ask whether a single representation-level mechanism can yield comparable accuracy-efficiency trade-offs across modalities while preserving or improving probability calibration. We propose L0-Gated Cross-Modality Learning (L0GM), a modality-agnostic, feature-wise hard-concrete gating framework that enforces L0-style sparsity directly on learned representations. L0GM attaches hard-concrete stochastic gates to each modality's classifier-facing interface: node embeddings (GNNs), pooled sequence embeddings such as CLS (Transformers), and learned tabular embedding vectors (tabular models). This yields end-to-end trainable sparsification with an explicit control knob for the active feature fraction. To stabilize optimization and make trade-offs interpretable, we introduce an L0-annealing schedule that induces clear accuracy-sparsity Pareto frontiers. Across three public benchmarks (ogbn-products, Adult, IMDB), L0GM achieves competitive predictive performance while activating fewer representation dimensions, and it reduces Expected Calibration Error (ECE) in our evaluation. Overall, L0GM establishes a modality-agnostic, reproducible sparsification primitive that supports comparable accuracy, efficiency, and calibration trade-off analysis across heterogeneous modalities.