The ReLU Effect: How Gradient Descent Shapes Neural Network Solutions

Author: Denis Avetisyan

New research reveals how the popular ReLU activation function subtly influences the solutions found by gradient descent in high-dimensional neural networks.

This paper provides a rigorous convergence analysis demonstrating that gradient descent training for ReLU networks approximates the minimum-ℓ2-norm interpolating solution.

Overparameterized neural networks present a paradox: while infinitely many solutions can perfectly fit training data, gradient descent consistently converges to a specific one. This work, ‘How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?’, rigorously characterizes this phenomenon-the implicit bias-for shallow ReLU networks trained with squared loss on high-dimensional random features. The authors demonstrate that, with high probability, gradient descent approximates the minimum- $\ell_2$ -norm solution among all interpolating solutions, differing by only $\Theta(\sqrt{n/d})$ , where n is the number of training examples and d is the feature dimension. Does this understanding of implicit bias pave the way for more predictable and robust training of deep neural networks?

The Illusion of Optimization: ReLU’s Hidden Preferences

Rectified Linear Unit (ReLU) networks, despite their straightforward activation function, demonstrate a consistent, yet often overlooked, ‘implicit bias’ during the training process. This means that when presented with multiple equally valid solutions that minimize the error on the training data, gradient descent doesn’t simply choose one at random; instead, it consistently gravitates towards solutions possessing specific characteristics – notably, those with smaller norms. Researchers have observed this tendency across diverse datasets and network architectures, suggesting it isn’t an artifact of particular configurations. This inherent preference isn’t explicitly programmed; rather, it emerges as a consequence of the ReLU’s behavior and the mechanics of gradient descent itself. Consequently, the learned weights aren’t merely a reflection of the data, but are also subtly shaped by this underlying bias, raising questions about the generalizability and true representational capacity of these foundational deep learning models.

The consistent preference for specific solutions by ReLU networks, even when numerous equally valid options exist, introduces a critical question regarding the trustworthiness of the learned weights. While gradient descent can find a solution that minimizes error, the observed tendency toward a particular outcome suggests the optimization process isn’t simply identifying a minimum, but is subtly guided toward a specific minimum. This isn’t a matter of finding the ‘best’ solution – multiple solutions may be equally good – but of consistently selecting one over others, potentially leading to models that perform predictably well on training data but generalize poorly to unseen data. The implications are significant, as this implicit bias could introduce systematic errors or vulnerabilities that aren’t readily apparent from standard performance metrics, demanding a deeper investigation into the factors influencing this preferential selection.

The prevalence of Rectified Linear Unit (ReLU) networks as a cornerstone of contemporary deep learning architectures necessitates a thorough understanding of their inherent behaviors. From image recognition systems to natural language processing tools, these networks underpin a vast array of applications impacting daily life. However, recent research highlights a subtle, yet significant, ‘implicit bias’ within ReLU networks-a tendency to favor specific solutions during the learning process, even when numerous equally valid options exist. Recognizing this bias isn’t merely an academic exercise; it’s crucial for building robust and reliable artificial intelligence. Failure to account for these predispositions could lead to unpredictable outcomes, compromised performance, and a diminished ability to generalize to new, unseen data. Therefore, investigating and mitigating this implicit bias is paramount to advancing the field and ensuring the trustworthy deployment of deep learning technologies.

Conventional machine learning theory posits that gradient descent, when minimizing empirical risk, should converge to a solution – but not necessarily a specific one, particularly when faced with non-convex optimization landscapes. Recent research demonstrates this isn’t always the case with ReLU networks; the algorithm consistently favors certain solutions over others, even in scenarios with multiple equally valid minima. This observed preference isn’t adequately explained by simply minimizing the error on the training data; something intrinsic to the network’s architecture and the gradient descent process itself is steering the optimization. The tendency towards specific solutions suggests an underlying inductive bias, influencing the learned weights and potentially impacting generalization performance – a phenomenon demanding further investigation to ensure the reliability and predictability of deep learning models built upon these networks.

Decoding Optimization: A Primal-Dual Perspective

Gradient descent, while commonly presented as a minimization algorithm, can be formally analyzed through a primal-dual formulation that recasts the optimization problem as a saddle-point problem. This involves introducing auxiliary ‘dual’ variables associated with the original ‘primal’ variables and constraints. By analyzing the saddle-point conditions – where the gradient with respect to the primal variables is zero and the gradient with respect to the dual variables is zero – the behavior of gradient descent can be rigorously studied. This approach transforms the original constrained optimization problem into an unconstrained optimization problem involving both primal and dual variables, allowing for a more complete characterization of the solution trajectory and convergence properties, particularly in non-convex settings. The resulting framework facilitates the derivation of analytical bounds on the optimization process and provides insights into the impact of hyperparameters like learning rate.

The introduction of auxiliary variables within a primal-dual formulation expands the scope of optimization analysis beyond the primary variables. These variables, often Lagrange multipliers, are not directly solved for but are integral to representing constraints and deriving optimality conditions. By incorporating them, the optimization problem is transformed into a saddle-point problem, allowing for a more nuanced understanding of the interplay between the objective function, constraints, and the solution’s trajectory. This expanded representation facilitates the analytical derivation of convergence rates, stability conditions, and bounds on the solution’s error, which are often intractable when considering only the primary variables. Specifically, the use of auxiliary variables allows for the characterization of the optimization process in terms of both primal feasibility and dual feasibility, providing a more complete picture of the optimization dynamics and enabling the assessment of how closely gradient descent approximates the optimal solution as defined by the Karush-Kuhn-Tucker (KKT) conditions.

The primal-dual formulation, influenced by the principles of mirror descent, facilitates the observation of optimization trajectory by introducing auxiliary variables that represent the dual problem. This allows for the tracking of both the primal solution and its associated dual variable over the course of gradient descent. By analyzing the interplay between these variables, it becomes possible to identify the forces influencing convergence, such as the gradient of the primal objective and the constraints represented in the dual space. This approach moves beyond simply observing the primal solution’s progress to provide a more nuanced understanding of the optimization dynamics and the factors affecting the rate and stability of convergence. Specifically, the rate of change in both primal and dual variables provides insight into the algorithm’s behavior at each iteration.

The Karush-Kuhn-Tucker (KKT) conditions are a set of necessary (and often sufficient) conditions for optimality in constrained optimization problems. These conditions specify that at the optimal solution, the gradient of the objective function must be expressible as a linear combination of the gradients of the constraint functions, with non-negative multipliers (Lagrange multipliers) associated with each constraint. Specifically, the KKT conditions involve primal feasibility (satisfying the constraints), dual feasibility (non-negativity of Lagrange multipliers), and complementary slackness – a condition relating the multipliers to the slack variables representing constraint violations. By providing a precise characterization of the optimal solution, the KKT conditions serve as a benchmark against which the outcome of iterative algorithms like gradient descent can be evaluated, allowing assessment of solution accuracy and identification of potential violations of optimality conditions.

The Curse of Dimensionality: Where Bias Takes Root

In high-dimensional data regimes, the implicit regularization induced by gradient descent training of ReLU networks becomes increasingly dominant and amenable to analytical characterization. As the dimensionality of the input space grows, gradient descent exhibits a strong preference for solutions with small $ℓ₂$ -norm, effectively minimizing model complexity even without explicit regularization terms in the loss function. This bias isn’t solely attributable to the form of the loss function; it’s a fundamental property arising from the optimization dynamics themselves. The ability to analyze this behavior in high dimensions facilitates a clearer understanding of how these networks generalize and the factors influencing their learned representations, particularly concerning the spectral norm of the data covariance matrix $‖λ‖₁$ .

Gradient descent, when applied to training neural networks, exhibits a tendency to converge on solutions possessing minimal $ℓ₂$ -norm. This behavior isn’t solely attributable to explicit regularization terms within the loss function; instead, it arises as an intrinsic property of the gradient descent optimization process itself. Minimizing the $ℓ₂$ -norm effectively promotes simplicity in the learned weights, acting as an implicit form of regularization. Solutions with smaller norms are favored because they require less computational effort to maintain during iterative updates, leading to faster convergence and a preference for parsimonious models even without explicit penalty terms.

The observed preference for minimal $ℓ₂$ -norm solutions in ReLU networks during gradient descent is not solely attributable to characteristics of the loss function being optimized. Analysis indicates this behavior arises directly from the mechanics of the gradient descent algorithm itself. While loss function regularization can encourage smaller weights, this tendency persists even without explicit regularization terms. The optimization process intrinsically favors simpler solutions, manifesting as a bias towards minimizing the $ℓ₂$ -norm of the weight vector, regardless of the specific loss function used. This is because gradient descent naturally converges towards solutions that require less computational effort to reach, and lower-norm solutions generally satisfy this criterion.

Analysis of gradient descent optimization in high-dimensional ReLU networks indicates the resulting solution converges near the minimum- $ℓ_2$ -norm solution. The quantifiable distance between the limiting solution and the true minimum- $ℓ_2$ -norm solution scales as $Θ(n/‖λ‖_1)$ , where $n$ represents the number of training examples. $‖λ‖_1$ denotes the spectral norm of the data covariance matrix; therefore, a larger dataset ( $n$ ) increases the divergence from the true minimum, while a higher spectral norm – indicating greater data variance – reduces this divergence. This scaling relationship provides a precise measure of the bias introduced by gradient descent during optimization.

Mitigating the Bias: The Importance of Initialization

The very first values assigned to a neural network’s connections, known as initial weights, exert a surprisingly strong influence on the entire learning process. These weights don’t just set a starting point; they fundamentally shape the ‘optimization trajectory’ – the path gradient descent takes through the complex landscape of possible solutions. A poorly chosen initialization can lead the network into unfavorable regions, causing slow convergence, getting stuck in local minima, or even diverging entirely. This effect isn’t a mere coincidence; it introduces an ‘implicit bias’, steering the network towards solutions that are more easily reachable from that particular starting point rather than necessarily being the globally optimal or most generalizable ones. Consequently, carefully designed initialization strategies are essential for ensuring the network learns effectively and avoids being predisposed to suboptimal outcomes.

The stability of training deep neural networks is heavily influenced by the scale of initial weights; excessively large values can lead to vanishing or exploding gradients, hindering convergence. Consequently, ‘small initialization’ – setting initial weights to values close to zero – and ‘positive initialization’ – ensuring all weights are initially positive – have emerged as crucial techniques for reliable training. These methods effectively constrain the optimization landscape, preventing drastic updates during early stages and fostering a more gradual descent towards a stable minimum. By carefully controlling the initial weight distribution, researchers can significantly improve the convergence rate and overall performance of neural networks, particularly in complex architectures where the optimization process is prone to instability.

Disjoint initialization demonstrates a notable advantage when training models employing the Rectified Linear Unit (ReLU) activation function, particularly those with two layers. This technique strategically initializes the weights such that different neurons activate for different input examples, effectively preventing all neurons from firing simultaneously on any given instance. This contrasts with typical initialization methods that can lead to redundant computations as many neurons process the same information. By promoting sparse activation patterns from the outset, disjoint initialization helps to accelerate learning and improve generalization performance in two-layer ReLU networks. The method mitigates the vanishing gradient problem often associated with deep networks by ensuring a more diverse and responsive network state during initial optimization steps, allowing for more efficient exploration of the parameter space and ultimately leading to a more robust and accurate model.

Effective neural network training isn’t solely reliant on optimization algorithms; the starting point – the initial weights – profoundly influences the learning process. Carefully chosen initialization strategies serve as a powerful mechanism to steer gradient descent, preventing it from getting trapped in unfavorable local minima or converging to suboptimal solutions. These techniques proactively address inherent biases that can arise from the network’s architecture and data distribution, ensuring a more stable and efficient convergence. By thoughtfully setting initial conditions, researchers and practitioners can subtly guide the network towards desired outcomes, effectively shaping the solution landscape and mitigating the risks associated with poorly conditioned optimization problems. This proactive approach represents a practical and increasingly refined method for achieving robust and reliable performance in deep learning models.

The pursuit of elegant theoretical guarantees in high-dimensional neural networks feels… optimistic. This paper attempts to nail down the implicit bias of gradient descent with ReLU networks, showing a tendency toward minimum-ℓ2-norm solutions. It’s a noble effort, truly. However, one suspects that production data, with its delightful quirks and edge cases, will inevitably introduce biases the researchers haven’t accounted for. As John Locke observed, “The mind is not furnished with ideas from birth.” Similarly, these models aren’t born with perfect optimization properties; they acquire them-and those acquisitions are always provisional. It’s a nice result, until it isn’t.

What’s Next?

This characterization of implicit bias in ReLU networks – a neat alignment with minimum-ℓ2-norm solutions – feels… suspiciously clean. As if the theoretical elegance will inevitably encounter a production dataset that delights in violating every assumption. The high-dimensional regime is, of course, where things get interesting, but it’s also a convenient place to hide the mess. One suspects that real-world data, with its peculiar correlations and lurking confounders, will force a reconsideration of this tidy picture. Perhaps the next step isn’t chasing ever more refined convergence proofs, but developing methods to diagnose when this implicit bias is actively working against generalization.

The focus on gradient descent is also telling. It’s a workhorse, certainly, but increasingly a legacy algorithm. The field races towards Adam, variants of SGD, and optimizers nobody can quite explain. Understanding the implicit bias of those systems feels critical. If a system crashes consistently, at least it’s predictable. These black boxes? They just accrue technical debt, disguised as innovation.

Ultimately, this work contributes to a growing library of theoretical results about how these networks ought to behave. It’s a valuable record, certainly. But one should remember: we don’t write code – we leave notes for digital archaeologists. The real challenge remains: building systems that are robust, reliable, and don’t require a PhD in optimization to debug.

Original article: https://arxiv.org/pdf/2603.04895.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Optimization: ReLU’s Hidden Preferences

Decoding Optimization: A Primal-Dual Perspective

The Curse of Dimensionality: Where Bias Takes Root

Mitigating the Bias: The Importance of Initialization

What’s Next?

See also: