Second-Order, First-Class: A Composable Stack for Curvature-Aware Training

arXiv cs.LG / 3/30/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that second-order, curvature-aware training methods are underused because existing approaches are hard to implement, brittle to tune, and lack composable APIs.
It introduces Somax, an Optax-native “composable stack” that packages curvature-aware training into a single JIT-compiled step driven by a static execution plan.
Somax provides first-class, swappable modules for curvature operators, estimators, linear solvers, preconditioners, and damping policies while keeping integration with Optax via standard gradient transformations like momentum, weight decay, and learning-rate schedules.
By separating planning from execution, Somax reuses intermediate results and reduces per-step overhead compared with unplanned compositions that recompute redundantly.
Reported ablations show that composition decisions significantly influence scaling behavior and time-to-accuracy, and that the planning mechanism improves efficiency.

Abstract

Second-order methods promise improved stability and faster convergence, yet they remain underused due to implementation overhead, tuning brittleness, and the lack of composable APIs. We introduce Somax, a composable Optax-native stack that treats curvature-aware training as a single JIT-compiled step governed by a static plan. Somax exposes first-class modules -- curvature operators, estimators, linear solvers, preconditioners, and damping policies -- behind a single step interface and composes with Optax by applying standard gradient transformations (e.g., momentum, weight decay, schedules) to the computed direction. This design makes typically hidden choices explicit and swappable. Somax separates planning from execution: it derives a static plan (including cadences) from module requirements, then runs the step through a specialized execution path that reuses intermediate results across modules. We report system-oriented ablations showing that (i) composition choices materially affect scaling behavior and time-to-accuracy, and (ii) planning reduces per-step overhead relative to unplanned composition with redundant recomputation.