Learning Without Forgetting: A New Approach to Sequential Bayesian Inference

Author: Denis Avetisyan

This research tackles the challenge of catastrophic forgetting in Bayesian neural networks by combining continual learning techniques to enable robust performance on evolving data streams.

Across a series of continual learning tasks <span class="katex-eq" data-katex-display="false"> (0-9) </span>, standard sequential learning (SB) and naive sequential calibration (SC) exhibited catastrophic forgetting, while continual learning methods demonstrated consistently improved performance, suggesting robust parameter estimation-as evidenced by reduced mean bias compared to estimates derived from Stan-is critical for retaining previously acquired knowledge. — Across a series of continual learning tasks $(0-9)$ , standard sequential learning (SB) and naive sequential calibration (SC) exhibited catastrophic forgetting, while continual learning methods demonstrated consistently improved performance, suggesting robust parameter estimation-as evidenced by reduced mean bias compared to estimates derived from Stan-is critical for retaining previously acquired knowledge.

The paper explores strategies – including self-consistency, episodic replay, and elastic weight consolidation – to improve amortized Bayesian inference in continual learning scenarios.

Amortized Bayesian Inference (ABI) offers efficient posterior estimation, yet its performance often degrades when facing model misspecification or non-stationary data. This work, ‘Unsupervised Continual Learning for Amortized Bayesian Inference’, addresses this limitation by introducing a continual learning framework that combines self-consistency training with strategies to mitigate catastrophic forgetting-specifically, episodic replay and elastic weight consolidation. Our approach enables robust ABI by sequentially adapting to new data without sacrificing performance on previously encountered tasks, yielding posterior estimates that rival those of Markov Chain Monte Carlo (MCMC) methods. Could this decoupling of simulation-based pre-training and unsupervised fine-tuning unlock trustworthy and adaptable Bayesian inference across a wider range of real-world applications?

Unveiling the Bayesian Bottleneck

Traditional Bayesian inference, while theoretically elegant, frequently encounters limitations when applied to models mirroring the complexity of real-world phenomena. The core difficulty lies in calculating the posterior distribution – the updated belief after observing data – which often involves multi-dimensional integrals that are analytically unsolvable. These integrals become exponentially more challenging as the number of model parameters increases, a situation known as the “curse of dimensionality.” Consequently, even with modest model complexity and reasonable datasets, direct computation of the posterior becomes computationally prohibitive, demanding excessive time and resources. This intractability motivates the development of approximate inference techniques, allowing researchers to sidestep the exact calculation and obtain solutions that, while not perfect, are sufficiently accurate for practical applications and statistical decision-making.

The practical application of Bayesian statistics to complex, real-world problems frequently hinges on the ability to effectively approximate posterior distributions. Calculating these distributions – which represent updated beliefs after observing data – is often analytically impossible, particularly as model dimensionality increases. Consequently, researchers employ a variety of approximation techniques, including Markov Chain Monte Carlo (MCMC) methods and variational inference, to obtain a computationally tractable representation of the posterior. These methods don’t yield the exact posterior, but instead provide a sufficiently accurate estimate that allows for meaningful inference, such as parameter estimation and model comparison. The quality of these approximations directly impacts the reliability of the conclusions drawn, making the development and refinement of robust approximation techniques a central focus within the field of Bayesian statistics. $\mathbb{P}(A|B) = \frac{\mathbb{P}(B|A)\mathbb{P}(A)}{\mathbb{P}(B)}$

The pursuit of scalable Bayesian inference methods stems from a fundamental challenge: many real-world models, encompassing numerous parameters and intricate relationships, render traditional posterior estimation techniques computationally prohibitive. As model complexity increases, the time and resources required to obtain the posterior distribution – the core of Bayesian analysis – grow exponentially. This scalability barrier necessitates the development of innovative approaches, such as variational inference and Markov Chain Monte Carlo (MCMC) methods with advanced sampling schemes, to approximate the posterior effectively. Researchers are actively exploring techniques like stochastic gradient methods and distributed computing to handle increasingly large datasets and high-dimensional parameter spaces, ultimately enabling the application of Bayesian statistics to complex problems in fields ranging from machine learning and genomics to finance and cosmology. The goal isn’t simply to estimate the posterior, but to do so with computational efficiency, unlocking the full potential of Bayesian modeling for complex systems.

Across continuous learning tasks, our proposed methods-including test-time sequential conditioning-mitigate catastrophic forgetting and achieve posterior estimates comparable to or exceeding those of simulation-based baselines and naive sequential conditioning, as demonstrated by <span class="katex-eq" data-katex-display="false">\log</span> MMD ratios consistently near parity (ratio = 1). — Across continuous learning tasks, our proposed methods-including test-time sequential conditioning-mitigate catastrophic forgetting and achieve posterior estimates comparable to or exceeding those of simulation-based baselines and naive sequential conditioning, as demonstrated by $\log$ MMD ratios consistently near parity (ratio = 1).

Mapping the Posterior: Neural Networks as Proxies

Amortized Bayesian Inference (ABI) addresses the computational expense of traditional Bayesian inference by employing a neural network to directly map samples from a prior distribution to samples from an approximate posterior distribution. This contrasts with methods requiring iterative optimization for each data point. By learning a function – parameterized by the neural network – that transforms prior samples, ABI avoids repeated calculations of the posterior. The network is trained to approximate the true posterior distribution, allowing for rapid generation of posterior samples once the network is trained. This process substantially reduces inference time, particularly in high-dimensional spaces or when dealing with large datasets, as it replaces per-data-point inference with a single forward pass through the neural network.

Normalizing Flows and Flow Matching represent advancements in Amortized Bayesian Inference (ABI) by addressing limitations in the expressiveness of the learned posterior distribution. Traditional ABI methods often rely on simple parametric distributions, such as Gaussians, to approximate the posterior, which can lead to inaccuracies. Normalizing Flows achieve improved accuracy by employing a series of invertible transformations to map a simple base distribution into a more complex and flexible posterior. Flow Matching builds upon this by framing the problem as learning a vector field that transports samples from the prior to the posterior, offering further enhancements in representational capacity and enabling efficient sampling. Both techniques allow the learned posterior to better capture the intricacies of the true posterior distribution, particularly in high-dimensional or multimodal scenarios, leading to more reliable uncertainty quantification and improved downstream performance.

Simulation-Based Training (SBT) enables Amortized Bayesian Inference (ABI) to function effectively even when the true likelihood function $p(x|\theta)$ is unknown or intractable. Instead of requiring explicit specification of this likelihood, SBT utilizes simulations from a generative model to approximate the posterior distribution. This is achieved by training the neural network to distinguish between data generated from the model with parameters θ and observed data $x$ . The network learns to map prior samples to posterior samples by maximizing a lower bound on the marginal likelihood, effectively sidestepping the need to directly evaluate or define the true likelihood function. Consequently, SBT is particularly valuable in scenarios where the underlying data generating process is complex or poorly understood, allowing for Bayesian inference with minimal assumptions about the likelihood.

Across continuous learning tasks, our proposed methods-including test-time sequential conditioning-mitigate catastrophic forgetting and yield more accurate posterior estimates <span class="katex-eq" data-katex-display="false">\log(MMD)</span> compared to simulation-based and naive sequential conditioning baselines. — Across continuous learning tasks, our proposed methods-including test-time sequential conditioning-mitigate catastrophic forgetting and yield more accurate posterior estimates $\log(MMD)$ compared to simulation-based and naive sequential conditioning baselines.

Fortifying the Posterior: Consistency and Resilience

Bayesian Self-Consistency (BSC) improves Approximate Bayesian Inference (ABI) by minimizing the discrepancy between the learned posterior distribution and the prior distribution defined by the generative model. This is achieved through an iterative refinement process where the posterior serves as a refined prior for subsequent inference steps. Specifically, BSC formulates an optimization objective that directly penalizes differences between the posterior samples and samples generated from the model using parameters estimated from the posterior. This enforcement of consistency reduces posterior collapse and improves the calibration of uncertainty estimates, leading to more reliable and accurate probabilistic predictions.

Catastrophic forgetting, the tendency of artificial neural networks to abruptly lose previously learned information when exposed to new data, is addressed through continual learning techniques such as Episodic Replay and Elastic Weight Consolidation. Episodic Replay stores a representative subset of past experiences and replays them during training on new tasks, effectively regularizing the learning process and preventing drastic weight changes that would erase prior knowledge. Elastic Weight Consolidation, conversely, identifies important weights for previous tasks and constrains their movement during subsequent learning, preserving performance on older tasks while allowing adaptation to new ones. These methods enable sequential learning without significant performance degradation on previously mastered tasks, facilitating model adaptation in dynamic environments.

Maintaining consistent performance is critical in non-stationary environments where data distributions shift over time. Our approach mitigates performance degradation by ensuring the learned model remains accurate despite these changes. Evaluation using the Maximum Mean Discrepancy (MMD) ratio demonstrates this robustness; a ratio of less than 1 (< 1) indicates that the distribution of the learned model’s outputs closely matches the true posterior distribution, exceeding the performance of a simulation-based baseline. This metric confirms the model’s ability to adapt to evolving data without experiencing substantial performance drops, a common issue in continual learning scenarios.

Experimental results demonstrate a consistent reduction in both absolute mean bias and absolute standard deviation bias within estimated posterior distributions when utilizing the proposed method, as compared to both naive self-consistency and simulation-based training approaches. Specifically, the method exhibits improved accuracy in representing the true posterior distribution, indicated by lower biases across evaluated parameters. This improvement is consistently observed across multiple experimental setups and datasets, suggesting a robust enhancement in the reliability and precision of posterior estimation. Quantitative analysis confirms the statistically significant reduction in bias metrics, validating the effectiveness of the approach in mitigating inaccuracies present in alternative estimation techniques.

Despite varying EWC hyperparameters λ, the MMD ratio remained stable across continual learning tasks, demonstrating that episodic replay, rather than EWC, primarily drives robustness improvements over a simulation-based baseline.

Beyond the Algorithm: Implications for Cognition and Intelligence

The intricacies of human decision-making are increasingly illuminated by computational models, notably the Racing Diffusion Model. This framework posits that choices emerge from a competitive process where evidence accumulates over time for different options, and the first to reach a threshold determines the response. Recent advancements integrate Maximum Mean Discrepancy (MMD), a statistical measure of distribution differences, to refine these models and better capture the nuances of cognitive processes. Specifically, research utilizing MMD has provided a powerful lens through which to understand phenomena like the Stroop Effect – the interference experienced when naming the color of a word that spells a different color. By quantifying the distributional differences in neural activity or behavioral responses, MMD-informed Racing Diffusion Models offer a more precise and insightful account of how cognitive interference arises and impacts decision speed and accuracy, ultimately bridging the gap between theoretical frameworks and empirical observations.

Traditional cognitive models often struggle with the inherent variability and complexity of human thought, frequently treating cognitive features as isolated entities. DeepSet architectures offer a powerful solution by enabling the representation of these features as sets, allowing models to consider all possible combinations and relationships between them. This approach, inspired by set theory, moves beyond simple averaging or summation, instead capturing the holistic nature of cognitive processing. By learning representations of entire sets of features – such as those involved in recognizing objects, recalling memories, or making decisions – DeepSet models can achieve greater flexibility and accuracy in simulating human cognition. The architecture’s permutation invariance also ensures that the order of features within a set doesn’t affect the model’s output, mirroring the brain’s ability to process information regardless of input sequence, ultimately leading to more robust and biologically plausible cognitive models.

The principles underpinning these computational models – initially developed to dissect cognitive phenomena – demonstrate a surprising versatility, extending far beyond the realm of psychology. Researchers are now applying similar methodologies to predict complex systems like air passenger traffic, leveraging linear regression and other statistical techniques to anticipate fluctuations and optimize resource allocation. Furthermore, the insights gained from modeling human decision-making are proving invaluable in the development of more robust and adaptive artificial intelligence. By incorporating principles of cognitive flexibility and error correction, AI systems can be engineered to handle unforeseen circumstances and learn from experience with greater efficiency, ultimately leading to more reliable and intelligent machines capable of navigating real-world complexities.

The experiment demonstrates that the racing diffusion model exhibits an absolute bias in its posterior mean estimates of marginal posteriors.

The pursuit of robust continual learning, as detailed in the study, inherently challenges established boundaries of model stability. It’s a process of deliberate disruption, testing the limits of neural networks against the relentless tide of sequential data. This aligns with John Stuart Mill’s assertion that “It is better to be a dissatisfied Socrates than a satisfied fool.” The research doesn’t seek perfect, static solutions, but rather systems capable of learning from – and adapting to – ongoing change, even if that means temporarily sacrificing initial certainty. The mitigation of catastrophic forgetting through techniques like episodic replay and elastic weight consolidation is not about preventing all errors, but about intelligently managing them – a sophisticated form of intellectual dissatisfaction driving continual improvement.

What Breaks Down Next?

The pursuit of continual learning in amortized Bayesian inference, as demonstrated, is less about achieving seamless knowledge accumulation and more about elegantly postponing the inevitable. The methodologies – episodic replay, elastic weight consolidation, self-consistency – function as increasingly elaborate scaffolding against the relentless erosion of previously learned representations. It begs the question: how much complexity can be layered before the structure collapses under its own weight? Future work will undoubtedly explore ever more sophisticated regularizations, but a more fruitful avenue may lie in deliberately embracing forgetting, treating it not as a failure, but as a feature. Perhaps true intelligence isn’t about retaining everything, but about efficiently discarding what is no longer relevant.

The current focus on mitigating catastrophic forgetting implicitly assumes a static ground truth – a fixed definition of “relevance.” Yet, the world rarely presents information as neatly packaged, immutable facts. A genuinely robust system should be able to revise its internal models, to reinterpret past data in light of new evidence, even if that requires discarding previously “consistent” beliefs. This demands a shift from simply preventing forgetting to actively managing belief revision – a messy, probabilistic process far removed from the tidy world of loss functions and optimization algorithms.

Ultimately, the field seems poised to confront a fundamental tension: the desire for stable, reliable inference versus the inherent plasticity required for adaptation. The solutions likely won’t lie in achieving perfect retention, but in developing systems that can gracefully navigate the boundary between remembering and letting go-systems that understand that forgetting, like breaking, is often a necessary step toward understanding how things truly work.

Original article: https://arxiv.org/pdf/2602.22884.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling the Bayesian Bottleneck

Mapping the Posterior: Neural Networks as Proxies

Fortifying the Posterior: Consistency and Resilience

Beyond the Algorithm: Implications for Cognition and Intelligence

What Breaks Down Next?

See also: