Beyond the Gaussian Dream: How Neural Networks Learn Features

Author: Denis Avetisyan

New research reveals that Bayesian neural networks exhibit complex feature learning beyond simple Gaussian process behavior, offering a deeper understanding of their predictive power.

The study demonstrates that a large-deviation rate function-derived through a posterior lens-aligns with the quadratic posterior rate induced by Gaussian-process regression utilizing the Neural Network Gaussian Process (NNGP) kernel, with the relative operator-norm gap between the selected kernel and the NNGP kernel serving as a key metric for assessing this correspondence.

This paper leverages large-deviation principles to demonstrate emergent kernel structures driving feature learning in Bayesian neural networks and enabling better posterior concentration analysis.

While Bayesian neural networks (BNNs) are often analyzed through Gaussian process limits, this perspective obscures the rich feature learning occurring within wider networks. The work ‘Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks’ introduces a framework leveraging large-deviation theory to reveal emergent kernels and understand posterior concentration beyond these limits. By jointly optimizing predictors and internal kernels, the authors demonstrate that these networks exhibit a data-dependent feature selection process, accurately capturing finite-width effects and non-Gaussian posterior behavior. Can this approach provide a more complete understanding of generalization in deep learning and guide the development of more robust and interpretable models?

Scaling Limits: The Illusion of Infinite Learning

Contemporary machine learning systems routinely employ automatic feature learning, a process where algorithms independently identify relevant patterns within data – a significant departure from earlier methods requiring manual feature engineering. However, as these models are scaled to encompass increasingly complex datasets and larger network architectures, fundamental theoretical limitations begin to emerge. While performance often improves with scale, this progress isn’t unbounded; researchers are discovering points of diminishing returns and unexpected behaviors. These limitations aren’t simply practical concerns about computational resources, but rather stem from inherent properties of the learning process itself – for example, the tendency towards overfitting or the difficulty in generalizing to unseen data. Investigating these boundaries is crucial, as it necessitates a shift from purely empirical optimization towards a deeper theoretical understanding of what these models can and cannot learn, ultimately guiding the development of more reliable and efficient algorithms.

The increasing width of modern neural networks, while often improving performance, presents a significant analytical challenge. Conventional techniques in statistical mechanics and optimization, historically employed to understand complex systems, begin to falter when applied to these massively parameterized models. This isn’t simply a matter of computational cost; the fundamental assumptions underpinning these traditional methods break down in the high-dimensional limit. Consequently, predicting how a very wide network will generalize to unseen data, or controlling its internal representations, becomes increasingly difficult. Researchers find it hard to determine which features the network will learn, or how sensitive it is to adversarial inputs, effectively creating a ‘black box’ where behavior is observed but not fully understood. This limitation hinders the development of more reliable and efficient algorithms, pushing the field towards novel theoretical frameworks capable of addressing the complexities of extreme scale.

Establishing the theoretical limits of neural networks is not merely an academic exercise, but a fundamental necessity for progress in machine learning. As models grow increasingly complex – with billions of parameters becoming commonplace – their behavior often deviates from established understandings, leading to unpredictable performance and a lack of generalizability. Pinpointing these limits allows researchers to move beyond empirical scaling laws and develop algorithms grounded in solid mathematical principles. This theoretical framework enables the design of more robust networks, less susceptible to overfitting or adversarial attacks, and ultimately, more efficient in terms of both computational resources and data requirements. By understanding why certain architectures succeed or fail at scale, the field can transition from trial-and-error experimentation to informed, principled algorithm development, paving the way for genuinely intelligent systems.

A wide Gaussian neural network trained on a Heaviside target predicts test inputs with varying accuracy depending on the activation function, demonstrating smoother predictions with <span class="katex-eq" data-katex-display="false"> anh</span> activation compared to the more step-like outputs from ReLU. — A wide Gaussian neural network trained on a Heaviside target predicts test inputs with varying accuracy depending on the activation function, demonstrating smoother predictions with $anh$ activation compared to the more step-like outputs from ReLU.

The Allure of Gaussian Processes: A Network’s Hidden Simplicity

The behavior of infinitely wide neural networks approaches that of Gaussian processes (GPs). Specifically, as the number of hidden units in each layer tends towards infinity, the posterior distribution over the network’s functions, given training data, converges to a Gaussian process. This means that for any finite set of inputs, the outputs of the infinitely wide network are jointly Gaussian distributed. A Gaussian process is fully defined by its mean function and covariance kernel, allowing the function represented by the wide network to be characterized by these parameters. This convergence is not an approximation in the limit, but a formal mathematical result enabling the application of GP theory to analyze and predict the behavior of wide neural networks without directly computing the network’s parameters.

The Neural Network Gaussian Process (NNGP) limit establishes a theoretical equivalence between infinitely wide neural networks and Gaussian processes. This allows for the application of Gaussian process theory – including tools for regression, classification, and uncertainty quantification – to analyze the behavior of these networks. Specifically, the NNGP framework defines a probability distribution over functions represented by the network, with the mean and covariance functions determined by the network’s architecture and parameters. This enables the calculation of predictive distributions and the assessment of model uncertainty without requiring Monte Carlo sampling, as is often necessary with traditional neural networks. Furthermore, the NNGP allows for the derivation of analytical expressions for quantities such as generalization error and the effect of different network configurations, providing insights into the learning dynamics and performance characteristics of wide networks.

The Neural Network Gaussian Process (NNGP) framework enables a simplification of analyses previously intractable for very wide neural networks. By characterizing the function computed by an infinitely wide network as a Gaussian process, researchers can leverage established Gaussian process theory to predict network behavior without requiring computationally expensive training or forward passes. Specifically, the NNGP allows for the analytical calculation of quantities like the mean and covariance of the network’s output, providing insights into its response to different inputs. This capability extends to understanding generalization performance; the NNGP framework facilitates the derivation of generalization bounds and the analysis of how network architecture impacts its ability to perform well on unseen data. The resulting theoretical tools allow for efficient hyperparameter optimization and architectural design choices without full empirical evaluation.

For a ReLU network trained on a Heaviside target, <span class="katex-eq" data-katex-display="false">n=128</span> width posterior samples demonstrate that nn-tempered (LDP) scaling concentrates predictions around the LDP-MAP value, while standard NNGP scaling exhibits broader Gaussian fluctuations around the NNGP posterior mean. — For a ReLU network trained on a Heaviside target, $n=128$ width posterior samples demonstrate that nn-tempered (LDP) scaling concentrates predictions around the LDP-MAP value, while standard NNGP scaling exhibits broader Gaussian fluctuations around the NNGP posterior mean.

Kernel Methods: Decoding the Network’s Internal Logic

As the width of a neural network approaches infinity, its behavior converges to that of kernel regression. Specifically, in the infinite-width limit, the network’s function can be precisely described by a Gaussian process with a kernel known as the Neural Network Gaussian Process (NNGP) kernel. This NNGP kernel, defined by the network’s architecture and initialization, determines the similarity between data points as perceived by the infinite-width network. Consequently, training a wide neural network is mathematically equivalent to performing kernel regression using the NNGP kernel, allowing established kernel methods to be applied for analysis and prediction. The NNGP kernel is formally given by $K(x, x') = \mathbb{E}_{\theta} [ \frac{\partial f(x, \theta)}{\partial \theta} \frac{\partial f(x', \theta)}{\partial \theta} ]$ , where $f(x, \theta)$ represents the neural network function with parameters θ.

Establishing a link between wide neural networks and kernel regression permits the application of established theoretical frameworks from kernel methods to assess generalization error. Specifically, techniques for bounding the generalization gap – the difference between training and test error – developed for kernel machines, such as reproducing kernel Hilbert space (RKHS) norm bounds and VC-dimension analysis, become potentially applicable to wide neural networks operating with the Neural Tangent Kernel (NTK). This allows researchers to utilize existing bounds on $||f||_{RKHS}$ to derive corresponding bounds on the generalization error of the neural network, offering a pathway to understand and control overfitting. Furthermore, kernel methods provide tools for analyzing the complexity of the function class represented by the network, influencing the rate of convergence and the sample complexity required for effective learning.

Posterior concentration, achieved through Markov chain Monte Carlo (MCMC) methods such as Metropolis-Adjusted Langevin Algorithm (MALA) sampling, is critical for applying PAC-Bayes bounds to neural network generalization. These bounds relate generalization error to the complexity of the posterior distribution over network weights. Demonstrating posterior concentration allows for tighter bounds, and crucially, the existence of a non-zero kernel gap – the difference between the Reproducing Kernel Hilbert Space (RKHS) norm of the learned function and its empirical RKHS norm – indicates a capacity control mechanism beyond that inherent in the Neural Tangent Kernel (NTK) regime. A positive kernel gap signifies that the network learns a function that is not simply fitting the training data, but generalizes to unseen data by operating in a lower-complexity subspace, thus improving generalization performance as quantified by PAC-Bayes bounds.

Comparing large-deviation MAP predictions to the NNGP posterior mean reveals discrepancies quantified by the kernel gap, indicating deviations from the fixed NNGP kernel along the input grid.

The Landscape of Possibility: Rare Events and Network Stability

Neural networks, despite their success, operate within a complex probability landscape where rare events – unusual configurations or behaviors – can significantly impact performance and training stability. Large deviation theory offers a rigorous mathematical framework to analyze these unlikely occurrences, moving beyond simple average-case behavior. By characterizing the rate at which the probability of an event decays as its likelihood diminishes, this theory allows researchers to quantify the chances of observing specific, potentially problematic, network states. It provides insights into how network parameters – such as width and depth – influence the susceptibility to these rare events, enabling a deeper understanding of generalization capabilities and optimization dynamics within these powerful, yet often opaque, systems. This approach isn’t about predicting the most likely outcome, but rather defining the limits of possibility and assessing the robustness of a network against unexpected behavior, particularly crucial in sensitive applications.

Through large deviation theory, researchers can move beyond simply observing network behavior to precisely quantifying the likelihood of specific outcomes. This isn’t merely descriptive; it allows for a systematic investigation into how alterations in network parameters – such as learning rate, weight initialization, or network architecture – influence the probability of observing particular behaviors. For instance, the theory elucidates how changing the network depth $L$ impacts the chance of encountering unstable optimization, or how specific weight distributions affect the propensity for overfitting. By mapping the relationship between parameters and probabilities, this approach offers a powerful tool for designing more robust and predictable neural networks, moving beyond empirical trial-and-error to a more theoretically grounded understanding of network dynamics.

Analysis reveals a compelling relationship between network depth and optimization stability in wide neural networks. The rate function, which quantifies the probability of observing a particular network output, grows sublinearly with the magnitude of that output $|y|$ , specifically scaling as $|y|^2/(L+1)$ , where $L$ represents network depth. This sublinear growth is crucially linked to stable optimization, as demonstrated by consistently low gradient norms – remaining below 1e-3 throughout training. Deeper networks, therefore, exhibit a more forgiving probability landscape, reducing the likelihood of optimization getting ‘stuck’ or diverging, and suggesting that increasing depth can inherently contribute to more robust training dynamics.

Training a wide Gaussian neural network on a Heaviside target deforms the prior rate function <span class="katex-eq" data-katex-display="false">I_{\mathrm{prior}}(y)</span> into a posterior rate function <span class="katex-eq" data-katex-display="false">I_{\mathrm{post}}(y)</span> at a fixed input <span class="katex-eq" data-katex-display="false">x_{\mathrm{test}}=3</span>, with the specific deformation differing based on the network's activation function (ReLU vs. tanh). — Training a wide Gaussian neural network on a Heaviside target deforms the prior rate function $I_{\mathrm{prior}}(y)$ into a posterior rate function $I_{\mathrm{post}}(y)$ at a fixed input $x_{\mathrm{test}}=3$ , with the specific deformation differing based on the network’s activation function (ReLU vs. tanh).

Beyond Empiricism: A Future Guided by Theoretical Principles

A foundational principle in neural network design emerges from the realization that infinitely wide networks converge to Gaussian processes. This convergence isn’t merely a mathematical curiosity; it establishes a direct link between network architecture and kernel methods, offering a principled pathway for crafting effective models. By understanding this relationship, designers can move beyond ad-hoc adjustments and instead strategically select or construct kernels – functions that define similarity between data points – to explicitly control the network’s behavior and inductive biases. This approach allows for the systematic exploration of different functional forms and the tailoring of network properties to specific tasks, ultimately improving generalization performance and offering a more robust foundation for machine learning applications. The kernel, therefore, becomes a central element in network design, enabling a shift from empirical tuning to a theoretically grounded methodology.

Neural networks often struggle to generalize beyond the training data, exhibiting fragility in the face of even slight input perturbations. However, recent advancements demonstrate that carefully tailoring the kernel – the function defining similarity between data points – can significantly enhance both generalization and robustness. By strategically controlling the kernel’s properties, researchers can effectively regularize the network, preventing overfitting and encouraging the learning of more meaningful representations. Techniques like variational inference provide a practical means of optimizing these kernels, allowing for efficient exploration of the vast kernel space and identification of configurations that maximize performance on unseen data. This approach not only improves the network’s ability to accurately predict outcomes for novel inputs, but also increases its resilience to noisy or adversarial examples, ultimately leading to more reliable and trustworthy artificial intelligence systems.

The Metropolis-adjusted Langevin algorithm (MALA) demonstrated robust performance in sampling the posterior distribution of finite-width neural networks, consistently achieving a stable acceptance rate of 0.75. This reliable sampling capability is crucial, as it validates the theoretical framework and opens avenues for future investigations. Researchers can now confidently extend these findings to explore more intricate neural network architectures, moving beyond simple models while maintaining a firm grasp on the underlying statistical properties. Furthermore, this work encourages a deeper examination of the inherent connections between kernel methods and Bayesian neural networks, potentially leading to novel approaches that combine the strengths of both paradigms and ultimately enhance the generalization and robustness of machine learning models.

Comparing the large-deviation rate function with the NNGP kernel reveals that the variational approach effectively minimizes the operator-norm gap, indicating a closer alignment between the learned and ideal kernel representations.

The research dissects Bayesian neural networks not as monolithic entities, but as systems revealing internal structures through the lens of large deviations. This echoes a fundamental principle of reverse-engineering – to truly understand something, one must push it to its limits and observe the resulting behavior. As Ludwig Wittgenstein observed, “The limits of my language mean the limits of my world.” Similarly, this work demonstrates that by examining the network’s behavior in extreme scenarios – large deviations from the typical – the boundaries of its representational capacity, and the emergent kernels driving feature learning, become strikingly clear. The investigation isn’t simply about what the network learns, but how its limitations define the learned features themselves.

Unraveling the Network

The demonstration that Bayesian neural networks escape simple Gaussian process equivalences-that they learn kernels rather than merely embody them-feels less like an answer and more like a precisely defined question. The emergent kernels revealed through large deviations aren’t simply descriptive; they hint at a computational principle. The field now faces the task of actively dissecting these kernels. What biases are implicitly encoded in the network architecture that predispose it to construct these particular representations? Identifying those constraints is not about limiting the network, but about understanding the logic governing its generalization-the rules it’s reverse-engineering from data.

Current work understandably focuses on characterizing these kernels post-training. A more aggressive approach-and a true test of comprehension-will involve predicting the form of the emergent kernel based solely on the network’s initial configuration and the structure of the data. Can one anticipate the features the network will learn? Success would signify not merely observation, but a capacity to engineer inductive biases directly, crafting networks that learn what is useful, not simply what is possible.

Ultimately, this line of inquiry suggests a shift in perspective. Bayesian neural networks aren’t black boxes to be optimized; they are exploratory systems, probing the space of possible representations. The ‘posterior concentration’ isn’t merely a mathematical convenience, but evidence of an internal search for a compact, informative kernel. The goal, then, isn’t to achieve perfect prediction, but to map the landscape of learnable features-to understand what the network considers ‘simple’ and why.

Original article: https://arxiv.org/pdf/2602.22925.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/