The Deep Learning Scaling Puzzle: Why Bigger Isn’t Always Better

Author: Denis Avetisyan


New research reveals how the dynamics of feature learning in deep neural networks explain both the successes and limitations of simply scaling up model size.

Internal feature learning in deep residual networks collapses with increasing depth-at a rate of <span class="katex-eq" data-katex-display="false"> 1/\sqrt{L} </span>-but this degradation is rectified by a depth-aware learning rate, <span class="katex-eq" data-katex-display="false"> \eta_1 = \eta_c n \sqrt{L} </span>, which restores active learning across layers and enables consistent hyperparameter transfer and improved performance, as demonstrated by lower training and testing losses and higher accuracy even with varying network depths and widths.
Internal feature learning in deep residual networks collapses with increasing depth-at a rate of 1/\sqrt{L} -but this degradation is rectified by a depth-aware learning rate, \eta_1 = \eta_c n \sqrt{L} , which restores active learning across layers and enables consistent hyperparameter transfer and improved performance, as demonstrated by lower training and testing losses and higher accuracy even with varying network depths and widths.

A rigorous mathematical framework, Neural Feature Dynamics, demonstrates a depth-induced vanishing of forward-backward interactions, offering insights into scaling laws for ResNets.

Despite the empirical success of deep learning scaling laws, the underlying mechanisms governing feature learning at scale remain poorly understood, leading to instability and diminishing returns in very deep networks. This work, ‘Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics’, introduces Neural Feature Dynamics (NFD), a rigorous mathematical framework for analyzing feature learning in deep ResNets. NFD reveals a depth-induced vanishing of forward-backward interactions that explains both the successes and failures of current scaling laws, identifying a critical collapse in feature learning within residual blocks. Can this framework guide the development of more robust and scalable deep learning architectures that consistently benefit from increased depth and compute?


Beyond Scaling: A Framework for Understanding Feature Dynamics

Recent investigations into Neural Scaling Laws reveal a critical limitation in the prevailing approach to artificial intelligence development. While consistently increasing the size of neural networks – measured by parameter count – demonstrably improves performance on various tasks, the marginal gains diminish rapidly. This suggests that simply scaling up models indefinitely is not a sustainable pathway toward achieving general intelligence; the computational costs and data requirements escalate disproportionately to the performance benefits. Empirical evidence indicates a point of diminishing returns, where further increases in model size yield progressively smaller improvements, highlighting the need for fundamentally new architectures and learning algorithms that move beyond brute-force scaling to unlock true intelligence.

Current theoretical frameworks attempting to explain the behavior of deep neural networks, such as the Infinite-Width Neural Tangent Kernel, frequently depend on assumptions of infinite width or small depth to achieve mathematical tractability. While these approximations offer valuable initial insights, they ultimately falter when applied to the increasingly deep and complex architectures prevalent in modern machine learning. Specifically, these methods often assume that the network’s behavior remains relatively stable across layers – a simplification that breaks down as depth increases, leading to inaccurate predictions about feature learning and generalization. The core limitation lies in their inability to adequately capture the dynamic, non-linear transformations occurring within each layer of a deep network, hindering a comprehensive understanding of how features evolve and interact as data propagates through the system. Consequently, a new approach is needed to accurately model the feature learning process in deep networks without relying on these restrictive and ultimately unsustainable simplifying assumptions.

Neural Feature Dynamics presents a novel, mathematically grounded framework for dissecting how deep ResNets learn features as their depth approaches infinity (L→∞). This approach moves beyond perturbative analyses by explicitly tracking the evolution of features across layers, revealing a surprising degree of stability and predictability in their transformations. Unlike existing methods reliant on simplifying assumptions about network behavior, Neural Feature Dynamics employs tools from dynamical systems theory to characterize feature learning as a continuous flow, allowing researchers to precisely predict how features will change with increasing depth. The resulting framework not only provides a more accurate depiction of feature learning in extremely deep networks but also offers a pathway to designing architectures that learn more efficiently and generalize more effectively, potentially circumventing the limitations of simply scaling model size.

Increasing network depth with the <span class="katex-eq" data-katex-display="false">\mu\mu P</span> method restores gradient information across layers, enabling effective training and improved performance, unlike standard or residual networks which suffer from vanishing gradients or overfitting as depth increases.
Increasing network depth with the \mu\mu P method restores gradient information across layers, enabling effective training and improved performance, unlike standard or residual networks which suffer from vanishing gradients or overfitting as depth increases.

The Depth-μP Regime: A Point of Stable Dynamics

The Depth-μP scaling regime represents a critical operational point for deep neural networks, characterized by a specific relationship between network depth, the parameter count μ, and the overall network width P. This regime is identified as essential for achieving stable feature learning during training. Deviations from this scaling – either insufficient depth relative to μP or an imbalance in the scaling of these parameters – can lead to unstable training dynamics, including vanishing or exploding gradients, and ultimately hinder the network’s ability to learn effectively. Maintaining this regime ensures that the network’s capacity and expressive power are appropriately balanced with its ability to propagate gradients and update weights reliably.

The analysis reveals that within the Depth-μP regime, the dynamics of feature learning and gradient descent are accurately modeled as a Forward-Backward Stochastic System. This system characterizes the coupled evolution of feature vectors and their corresponding gradients during training. Specifically, the forward pass propagates information to compute features, while the backward pass calculates gradients used to update these features; the stochastic component arises from mini-batch gradient descent. This coupling is essential as the gradient directly influences feature updates, and the current features determine the subsequent gradient calculation, creating a feedback loop. The formulation as a stochastic system allows for the application of tools from stochastic analysis to study the convergence and stability of training dynamics within deep neural networks.

In the limit as network width n approaches infinity, the coordinates of both feature and gradient vectors exhibit asymptotic independence. This decoupling arises because, with infinite width, the influence of any single training example on a particular neuron diminishes, effectively randomizing the contributions across all neurons. Consequently, the correlation between different coordinate values within these vectors approaches zero, allowing for a significant simplification of the analytical models used to study training dynamics. This independence is crucial because it enables the application of techniques from random matrix theory and free probability, leading to tractable expressions for quantities governing convergence, such as the learning rate and the variance of feature activations.

Increasing network width to 256 stabilizes training dynamics for both standard and decoupled <span class="katex-eq" data-katex-display="false">\mu\mu P</span>-ResNets, further aligning their trajectories and supporting the theoretical prediction that larger widths restore gradient isotropy in deep networks.
Increasing network width to 256 stabilizes training dynamics for both standard and decoupled \mu\mu P-ResNets, further aligning their trajectories and supporting the theoretical prediction that larger widths restore gradient isotropy in deep networks.

Establishing Convergence: Rigorous Mathematical Foundations

Propagation of Chaos techniques, specifically employing Synchronous Coupling, provide a methodology for examining the collective dynamics of features within a deep neural network. This approach models the evolution of a large ensemble of interacting features as a stochastic process, allowing for the analysis of their aggregated behavior. Synchronous Coupling dictates that all features update simultaneously at each iteration, simplifying the analysis while still capturing the essential interactions. By studying the limiting behavior of this ensemble under specific conditions, we can gain insights into how individual feature interactions contribute to the overall network dynamics and learning process. This technique moves beyond analyzing individual feature behavior to understanding the emergent properties of the collective system, providing a more complete picture of network operation.

The mathematical proof of convergence is established through the application of Discrete Gronwall’s Inequality, a discrete-time analogue of the continuous Gronwall’s Inequality used to bound solutions to differential equations. This inequality is leveraged in conjunction with the property of Lipschitz Continuity, which constrains the rate of change of a function. Specifically, Lipschitz Continuity is applied to the feature propagation dynamics to ensure that the growth of feature magnitudes remains bounded. By satisfying the conditions required for both Discrete Gronwall’s Inequality and Lipschitz Continuity, the proof demonstrates the establishment of moment bounds – probabilistic constraints on the size of features – which are critical for proving the stability and convergence of the learning process.

Demonstrating convergence of the feature learning process under the Depth-μP regime is achieved through a mathematically rigorous analysis. This regime specifies that the network depth scales proportionally with a parameter μ, while the parameter P governs the precision of the learned features. Establishing convergence under these conditions-meaning the features approach a stable, optimal representation with sufficient training-is crucial for guaranteeing the performance of deep neural networks. This analytical foundation then enables a deeper understanding of the mechanisms by which deep networks learn complex patterns and generalize to unseen data, moving beyond empirical observations to verifiable theoretical results.

Training depth-μP ResNets on CIFAR-10 reveals that approximation error decreases with increasing network depth and width, following a rate of <span class="katex-eq" data-katex-display="false">\mathcal{O}(1/L+1/n)</span>, and demonstrates that these parameters are empirically interchangeable during both initialization and training.
Training depth-μP ResNets on CIFAR-10 reveals that approximation error decreases with increasing network depth and width, following a rate of \mathcal{O}(1/L+1/n), and demonstrates that these parameters are empirically interchangeable during both initialization and training.

Extending the Framework: Towards a Deeper Understanding

The foundation of the Neural Feature Dynamics framework lies in the formalism of Tensor Programs, a system designed to rigorously define and analyze the computations occurring within neural networks. Unlike traditional approaches that often treat forward and backward passes as monolithic operations, Tensor Programs decompose these processes into a series of precisely defined tensor contractions. This decomposition isn’t merely a mathematical exercise; it enables a detailed tracking of how information flows through the network during both training and inference. By representing computations as these explicit tensor operations, researchers gain the ability to analyze the dynamics of features learned by the network, and to predict how these features will evolve with changes in network architecture or training data. This precision is critical for moving beyond intuitive understandings of neural network behavior and towards a more mathematically grounded theory of deep learning, ultimately facilitating the development of more robust and interpretable models.

The behavior of neural networks in the infinite-width limit is rigorously understood through the application of the Master Theorem within the Tensor Programs framework. This theorem provides a precise method for tracking the evolution of key statistical quantities – such as the mean and variance – of variables within these programs as network width approaches infinity. Unlike previous analyses relying on approximations, the Master Theorem delivers exact, closed-form solutions, enabling researchers to predict network behavior without resorting to assumptions about parameter distributions or kernel properties. This capability is particularly impactful because it reveals how seemingly complex dynamics simplify dramatically in the infinite-width regime, offering a pathway to analyze and ultimately design more effective neural architectures and training procedures. The theorem’s predictive power extends beyond simple convergence, accurately describing the learning curves and generalization performance of these networks – a level of detail previously unattainable.

Recent analyses leveraging the Neural Feature Dynamics framework and Tensor Programs present a departure from conventional understandings of infinitely wide neural networks. These findings directly challenge the assumptions underpinning the Neural Tangent Kernel – a dominant paradigm suggesting that, in the infinite-width limit, neural networks behave as linear models governed by a fixed kernel. The framework demonstrates that this linear approximation breaks down, revealing a more complex dynamical system characterized by evolving feature spaces. Furthermore, the work casts doubt on the validity of Mean-Field Parameterizations, which rely on simplifying assumptions about the distribution of network parameters. By offering a more nuanced perspective on the infinite-width limit, this research suggests that the behavior of deep networks is far richer and more dynamic than previously appreciated, opening new avenues for theoretical exploration and potentially leading to improved network designs.

Training ResNets on CIFAR-10 with online SGD reveals that while minimum eigenvalues of the covariance matrices <span class="katex-eq" data-katex-display="false"> \bm{\Sigma}_{t}^{(k)} </span> and <span class="katex-eq" data-katex-display="false"> \bm{\Theta}_{t}^{(k)} </span> decrease over time, they remain positive and increase with network width, confirming Assumption 1 and highlighting the importance of sufficient network width to avoid eigenvalue collapse.
Training ResNets on CIFAR-10 with online SGD reveals that while minimum eigenvalues of the covariance matrices \bm{\Sigma}_{t}^{(k)} and \bm{\Theta}_{t}^{(k)} decrease over time, they remain positive and increase with network width, confirming Assumption 1 and highlighting the importance of sufficient network width to avoid eigenvalue collapse.

Future Directions: Towards Efficient and Sustainable Deep Learning

Recent advancements in deep network efficiency are being significantly shaped by the integration of sophisticated tools, notably large language models. These models are not merely assisting with the presentation of findings; they are actively accelerating the research process itself through automated identification of relevant prior work and the synthesis of complex information. This capability provides a novel perspective on how deep networks learn, allowing researchers to move beyond traditional architectural constraints and explore more resource-conscious designs. The use of these tools facilitates a broader, more interconnected understanding of the field, potentially unlocking new approaches to optimization and ultimately redefining the boundaries of what is computationally feasible in deep learning.

A deeper comprehension of how deep neural networks learn features is proving pivotal in the development of more efficient architectures. Rather than treating networks as monolithic blocks, researchers are now dissecting the feature learning process to identify redundancies and inefficiencies. This granular understanding allows for the design of networks that prioritize learning the most salient features first, thereby reducing computational demands and preventing overfitting. By strategically allocating resources to essential feature extraction, these optimized architectures demonstrate enhanced robustness to noisy data and require significantly fewer parameters to achieve comparable performance. This approach promises a future where deep learning models are not only powerful but also accessible and sustainable, even on resource-constrained devices.

The progression of this research extends beyond theoretical advancements, aiming for practical implementation in diverse, real-world applications. Future studies will concentrate on translating these understandings of efficient feature learning into tangible improvements across fields like computer vision, natural language processing, and robotics. Simultaneously, investigation will push the boundaries of deep learning itself, seeking to define its ultimate limitations and identify novel approaches to overcome them. This includes exploring alternative network architectures, optimization algorithms, and learning paradigms to establish whether fundamentally more efficient deep learning models are achievable, or if inherent computational costs represent an insurmountable barrier. The ultimate goal is not merely to refine existing techniques, but to redefine the possibilities within the field and unlock a new generation of intelligent systems.

Pre-activation ResNets demonstrate superior stability and faster convergence with lower variance compared to post-activation networks when trained on CIFAR-10, likely due to their ability to maintain consistent feature distributions across network depth.
Pre-activation ResNets demonstrate superior stability and faster convergence with lower variance compared to post-activation networks when trained on CIFAR-10, likely due to their ability to maintain consistent feature distributions across network depth.

The exploration of scaling laws in deep neural networks, as detailed within, highlights a fundamental principle: the interconnectedness of a system’s parts. This research, focusing on feature learning dynamics within ResNets, demonstrates how depth-induced vanishing interactions dictate overall performance-a concept resonant with the idea that structure dictates behavior. As Henri Poincaré observed, “It is through science that we arrive at truth, but it is imagination that leads us to it.” This holds true here; imagining the network’s feature space evolving allows for a mathematically rigorous understanding of these scaling laws, revealing why certain architectures succeed while others falter. The framework offered isn’t merely about isolated components, but how these components interact within the larger, evolving structure.

The Road Ahead

The framework presented here, Neural Feature Dynamics, offers a compelling account of scaling laws in deep ResNets, grounding observed phenomena in the interplay of forward and backward signal propagation. Yet, to suggest this resolves the broader question of generalization would be premature. The analysis, while mathematically rigorous within its assumptions, still operates on a simplified picture of feature space. The implicit assumption of gradient independence, while yielding tractable results, is almost certainly an approximation; dependencies, after all, are the true cost of freedom. Future work must grapple with the consequences of relaxing this constraint, even if it means trading analytical tractability for empirical relevance.

More fundamentally, the emphasis on depth-induced vanishing interactions highlights a critical point: success in deep learning often stems not from discovering something new, but from carefully managing what is lost. This suggests a shift in focus from ever-larger models to architectures that explicitly preserve signal integrity. The architecture itself, when functioning optimally, should be invisible until it breaks. A truly elegant solution will not require increasingly complex regularization schemes, but rather an inherent robustness built into its structure.

Ultimately, the field must resist the temptation to treat scaling laws as immutable decrees. They are, at best, contingent properties of a particular inductive bias – a snapshot in a vast landscape of possible architectures and training procedures. Simplicity scales; cleverness does not. The challenge lies in identifying the minimal sufficient structure needed to achieve robust generalization, and recognizing that the most powerful models may be those that do the least.


Original article: https://arxiv.org/pdf/2512.21075.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-26 19:20