Beyond Gradient Descent: Scaling Deep Learning with Smarter Optimization

Author: Denis Avetisyan

This review examines how incorporating curvature information and adaptive techniques can significantly improve the training of large neural networks.

Adaptive optimization methods often exhibit oscillatory behavior along individual parameter dimensions, whereas full second-order methods, like Newton’s, efficiently converge to saddle points-a characteristic Sophia exploits through reliable local curvature estimation to achieve superior performance in these scenarios.

A comprehensive analysis of curvature-aware optimization, modular norms, and preconditioning strategies for enhanced generalization and scalability in deep learning.

Despite the empirical success of stochastic gradient descent in training deep neural networks, a principled understanding of its behavior in over-parameterized regimes remains elusive. This thesis, ‘Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale’, investigates this paradox by systematically exploring curvature-aware optimization techniques-from second-order approximations to adaptive preconditioning-and establishing a unifying framework based on modular norms. Our analysis reveals that incorporating curvature information not only accelerates convergence but also improves generalization performance across diverse datasets. Will these advancements pave the way for more robust and interpretable deep learning models capable of tackling increasingly complex challenges?

The Descent: From Simple Steps to Sophisticated Strategies

Machine learning algorithms frequently seek the best possible solution – the parameters that minimize a specific error function – and this pursuit fundamentally relies on iterative optimization. The most basic of these methods is Gradient Descent, a technique that repeatedly adjusts parameters in the direction of the steepest decrease in error. Imagine a landscape where the height represents the error; Gradient Descent is like rolling a ball downhill. While conceptually simple and guaranteed to eventually converge given certain conditions, standard Gradient Descent can be remarkably slow, particularly in high-dimensional spaces or when the landscape features long, narrow valleys. Each step taken is determined solely by the immediate gradient, meaning the algorithm can oscillate wildly or get stuck in local minima, requiring a small learning rate to prevent overshooting and ensuring stable, though often sluggish, progress towards an optimal solution. This inefficiency motivates the development of more sophisticated optimization techniques, building upon the foundation of this initial, albeit slow, approach.

The optimization process in machine learning often encounters jagged, oscillating paths towards a solution, slowing down convergence. Momentum addresses this by incorporating a fraction of the previous update vector into the current one, effectively adding ‘inertia’ to the descent. This allows the algorithm to continue moving in a generally consistent direction, smoothing out oscillations and accelerating progress, particularly in high-dimensional spaces or when gradients are noisy. However, momentum isn’t a panacea; excessive momentum can cause the algorithm to overshoot the optimal solution, leading to instability, while insufficient momentum may fail to dampen oscillations effectively. Careful tuning of the momentum parameter – often denoted as $\beta$ – is therefore crucial to balance the benefits of acceleration with the risk of divergence, and its effectiveness is also dependent on the learning rate chosen for the gradient updates.

Nesterov’s Accelerated Gradient represents a refinement of traditional momentum-based optimization, achieving faster convergence by proactively anticipating future position. Instead of calculating the gradient at the current parameter location, as standard momentum does, this method first takes a step in the direction of the accumulated momentum, then evaluates the gradient at this “look-ahead” position. This seemingly subtle change allows the algorithm to correct its course before overshooting the optimal point, effectively reducing oscillations and accelerating learning. The core innovation lies in incorporating future velocity into the current gradient calculation – $x_{t+1} = x_t – \eta \nabla F(x_t + v_t)$, where $v_t$ represents the momentum term – resulting in demonstrably faster convergence rates, particularly in scenarios with high curvature or noisy gradients. This predictive element makes Nesterov’s Accelerated Gradient a powerful technique for training complex machine learning models efficiently.

On linear representation learning, KFAC surpasses SGD, Adam, and even the natural gradient method in minimizing training loss, subspace distance, and transfer loss.

Beyond First-Order Thinking: The Allure of Curvature

Newton’s method, an iterative optimization algorithm, utilizes second-order derivative information – specifically, the Hessian matrix – to determine search direction and update model parameters. Unlike first-order methods which use only gradients, incorporating the Hessian allows Newton’s method to account for the curvature of the loss function, enabling faster convergence, particularly in regions with high curvature. The update rule involves calculating the inverse of the Hessian, multiplied by the gradient, to determine the step size and direction: $ \Delta \theta = -H^{-1}\nabla J(\theta)$, where $J(\theta)$ is the loss function and $\theta$ represents the model parameters. However, computing and inverting the Hessian matrix has a computational complexity of $O(n^3)$ and requires $O(n^2)$ memory, where $n$ is the number of parameters, rendering it impractical for training large-scale machine learning models with millions or billions of parameters.

Quasi-Newton methods address the computational expense of Newton’s method by approximating the inverse of the Hessian matrix, denoted as $H^{-1}$, without explicitly calculating it. These methods, such as BFGS and L-BFGS, build up an approximation of $H^{-1}$ iteratively using gradient information gathered during optimization. Specifically, they employ updates based on the difference between successive gradient evaluations and parameter changes. L-BFGS is particularly suited for large-scale optimization as it stores only a limited number of gradient history vectors, reducing memory requirements compared to BFGS which stores the full approximate Hessian. While not guaranteeing the quadratic convergence of Newton’s method, Quasi-Newton methods typically exhibit superlinear convergence and offer a practical balance between computational cost and convergence speed.

Curvature-Aware Optimization encompasses techniques designed to improve the efficiency of iterative optimization algorithms by considering the second derivatives, or curvature, of the loss function. Traditional first-order methods, such as stochastic gradient descent, utilize only gradient information; however, curvature provides insights into the local shape of the loss surface, allowing for adaptive step size and direction adjustments. By exploiting this geometric information, these methods aim to accelerate convergence, particularly in regions with high or varying curvature. This can involve approximating the Hessian matrix, preconditioning the gradient, or directly incorporating curvature into the update rule, leading to more informed and efficient parameter updates compared to methods relying solely on first-order gradients.

The ellipses visually adapt to the loss function's local geometry, demonstrating a curvature-aware optimization process as illustrated in Boyd and Vandenberghe (2004). — The ellipses visually adapt to the loss function’s local geometry, demonstrating a curvature-aware optimization process as illustrated in Boyd and Vandenberghe (2004).

Adaptation is Key: The Rise of Smarter Optimizers

Adaptive gradient algorithms, such as AdaGrad and RMSProp, improve upon traditional stochastic gradient descent by individually adjusting the learning rate for each parameter during training. Fixed learning rates often struggle with sparse data or features exhibiting varying scales; parameters associated with infrequent updates may benefit from larger learning rates, while those receiving frequent updates may require smaller rates to prevent oscillation. AdaGrad addresses this by accumulating the sum of squared gradients for each parameter, effectively decreasing the learning rate for parameters with large cumulative gradients. RMSProp builds upon this by using a decaying average of past squared gradients, mitigating AdaGrad’s tendency for the learning rate to diminish too rapidly. This per-parameter adaptation allows for faster convergence and improved performance, particularly in non-convex optimization landscapes and with high-dimensional datasets, by tailoring the update step size to the historical gradient information of each individual parameter.

The Adam optimizer, short for Adaptive Moment Estimation, integrates the strengths of both momentum and adaptive learning rate methods. It computes adaptive learning rates for each parameter by estimating first and second moments of the gradients. Specifically, it maintains an exponentially decaying average of past gradients (momentum) and an exponentially decaying average of squared gradients to normalize the learning rate. This normalization addresses the issue of differing sensitivities to parameter updates encountered with simpler methods. The update rule incorporates bias correction to counteract the initialization bias of these estimates, particularly during initial optimization steps. Consequently, Adam has become a popular default optimizer in many deep learning applications due to its efficiency, robustness, and relatively low computational overhead, often requiring minimal hyperparameter tuning.

Learning rate schedules systematically alter the learning rate during training to improve convergence. Initial high learning rates accelerate progress at the beginning of training, while decreasing rates later on facilitate convergence and prevent oscillations as the model approaches a minimum. Common schedules include step decay, where the learning rate is reduced by a factor at predefined intervals; exponential decay, applying a multiplicative factor at each step; and cosine annealing, which follows a cosine function to smoothly decrease the rate. These schedules address the trade-off between rapid initial learning and stable convergence, often outperforming constant learning rates in both speed and final performance, particularly in deep neural networks where complex loss landscapes are common. The optimal schedule is problem-dependent and often determined empirically.

Comparative analysis of adaptive training algorithms on CIFAR-10 and PTB reveals AdaGrad instability and demonstrates that GGT does not substantially improve upon SGD or Adam in terms of test accuracy or validation perplexity.

Beyond Standard Gradients: Geometry and Preconditioning

Optimization algorithms traditionally treat all parameter updates identically, yet neural network layers exhibit vastly different geometries. The Modular Norm Framework addresses this by proposing that the choice of norm – the measure of a vector’s length – should align with the specific geometry of each layer. This isn’t simply about computational efficiency; it’s about respecting the intrinsic structure of the optimization landscape. By selecting norms that mirror the layer’s dimensionality and parameter distribution, the framework enables more efficient and stable training. For example, a layer with highly anisotropic parameter space might benefit from a norm that downweights updates along dominant directions, preventing oscillations and accelerating convergence. This systematic approach allows for the creation of optimizers tailored to individual layers, potentially surpassing the performance of one-size-fits-all methods and unlocking improvements in large-scale model training, especially when considering the impact of $L_p$ norms on gradient behavior.

Optimization in high-dimensional spaces often encounters challenges due to the loss function’s complex geometry. Dualized updates offer a solution by shifting the optimization process from the primal space – where parameters are directly adjusted – to the dual space, which represents gradients as fundamental quantities. This transformation allows the optimizer to inherently respect the curvature and structure of the loss landscape, leading to more efficient convergence. By operating directly on gradients, dualized methods effectively precondition the optimization problem, mitigating the ill-conditioning that frequently plagues large neural networks. The approach can be understood as finding optimal updates in terms of the gradient itself, rather than directly manipulating the parameters, resulting in a more geometrically informed and robust optimization trajectory. Ultimately, this yields faster training and improved generalization performance, particularly in scenarios with highly non-convex loss functions.

Shampoo represents a significant advancement in optimization techniques for large machine learning models by employing a preconditioning method rooted in approximations of the Fisher Information Matrix. Rather than directly optimizing parameters, Shampoo transforms the optimization landscape, effectively rescaling gradients to accelerate convergence. This is achieved by constructing a preconditioner from the outer products of gradients, allowing the optimizer to navigate highly curved loss surfaces more efficiently. The core innovation lies in its ability to estimate the inverse of the Fisher matrix without explicitly computing it, a process that would be computationally prohibitive for models with billions of parameters. By decoupling parameter updates across layers and utilizing second-order information via these gradient approximations, Shampoo demonstrably improves training speed and final performance, particularly in scenarios where standard gradient descent struggles with ill-conditioned optimization problems – essentially smoothing the path toward minimal loss and enabling more robust and faster model training.

Shampoo descent exhibits instability relative to SPlus across different cache durations, as illustrated in Figure 2 from Frans et al. (2025).

The Cutting Edge: Towards Global Minima and Efficient Training

The pursuit of optimal solutions in deep learning hinges on effectively navigating the complex, high-dimensional loss landscapes that define model training. Recent advancements have yielded Prodigy, a novel optimization algorithm demonstrating significant progress towards consistently locating the global minimum within these landscapes. Unlike traditional methods often trapped in local minima or slowed by plateaus, Prodigy employs a unique strategy to escape suboptimal regions, allowing it to more efficiently descend towards the lowest possible loss value. This is achieved through a combination of adaptive learning rates and a momentum-based approach that encourages exploration while maintaining stability. Initial results indicate that Prodigy not only converges faster but also achieves lower loss values compared to established optimizers on challenging benchmark problems, suggesting a potentially transformative step towards training more accurate and robust deep neural networks.

Recent advancements in optimization algorithms have yielded Muon, a technique demonstrating a significant leap forward in training speed for large-scale pretraining. Experiments reveal Muon achieves the fastest training times to date on a benchmark 3-layer Multilayer Perceptron (MLP) tasked with learning from a downsampled version of the CIFAR-10 dataset. This performance suggests a new frontier in efficiently navigating the complex parameter spaces inherent in deep learning models. By accelerating the learning process, Muon not only reduces computational costs but also opens doors to exploring more extensive and intricate model architectures, potentially unlocking further improvements in artificial intelligence capabilities. The observed speed gains are particularly promising for researchers and practitioners dealing with resource-intensive pretraining phases, a cornerstone of modern deep learning workflows.

The efficient training of deep neural networks frequently encounters challenges at saddle points – critical points in the loss landscape where the gradient is zero, but the point isn’t a local minimum. Unlike traditional local minima, these saddle points aren’t isolated; instead, they form extended, high-dimensional manifolds that can significantly slow down optimization algorithms. Recent research highlights the importance of techniques designed to escape these saddle points, such as momentum-based methods and adaptive learning rate algorithms, which allow for continued progress even when facing conflicting gradient signals. Successfully addressing saddle points is therefore not merely about finding the minimum, but about accelerating the journey through the complex topography of the loss landscape, ultimately enabling the training of more powerful and efficient deep learning models.

Head-to-head comparisons of Adam and AdamW optimizers training a 120M Llama 3 model on the FineWeb-Edu dataset, as shown in Defazio et al. (2025), reveal performance differences between the two.

The pursuit of scalable optimization, as detailed in this work regarding curvature-aware methods and modular norms, inevitably invites a certain skepticism. One anticipates the elegance of theory meeting the brutal reality of production logs. As John von Neumann observed, “Young man, unless you can explain these ideas to your grandmother, you don’t understand them.” This sentiment resonates deeply; the paper’s exploration of preconditioning and adaptive learning rates, while mathematically sound, will ultimately be judged by its resilience against unforeseen data distributions and hardware limitations. Better a well-understood, robust first-order method than a theoretically superior second-order one crippled by numerical instability or unforeseen edge cases.

The Road Ahead

This exploration of curvature-aware optimization and modular norms offers a neat, if predictably complex, framework. It’s a reminder that the pursuit of faster training invariably leads back to the second-order methods researchers abandoned decades ago – only now with more parameters and a fresh coat of theoretical paint. The claim of improved generalization remains, as always, empirically verified on a selection of benchmarks – which, one suspects, will soon require their own optimization algorithms to keep pace with model complexity.

The true challenge isn’t achieving faster convergence, but maintaining stability as these methods scale. Production environments have a knack for exposing edge cases that carefully constructed proofs conveniently overlook. Expect a proliferation of heuristics designed to ‘patch’ the inevitable divergence issues, disguised as clever engineering. It will be fascinating – and faintly depressing – to observe how quickly this elegant theory becomes another layer of technical debt.

Ultimately, this work is a refinement, not a revolution. Everything new is just the old thing with worse docs. The next step will undoubtedly involve adapting these techniques to whatever novel network architecture captures the zeitgeist, and the cycle will repeat. One anticipates a future where ‘optimization’ refers less to improving algorithms and more to managing their operational complexity.

Original article: https://arxiv.org/pdf/2512.18373.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/