Abstract
Despite their dominance in vision and language, deep neural networks often underperform relative to tree-based models on tabular data. To bridge this gap, we incorporate five key inductive biases into deep learning: robustness to irrelevant features, axis alignment, localized irregularities, feature heterogeneity, and training stability. We propose \emph{LassoFlexNet}, an architecture that evaluates the linear and nonlinear marginal contribution of each input via Per-Feature Embeddings, and sparsely selects relevant variables using a Tied Group Lasso mechanism. Because these components introduce optimization challenges that destabilize standard proximal methods, we develop a \emph{Sequential Hierarchical Proximal Adaptive Gradient optimizer with exponential moving averages (EMA)} to ensure stable convergence. Across 52 datasets from three benchmarks, LassoFlexNet matches or outperforms leading tree-based models, achieving up to a 10\% relative gain, while maintaining Lasso-like interpretability. We substantiate these empirical results with ablation studies and theoretical proofs confirming the architecture's enhanced expressivity and structural breaking of undesired rotational invariance.