The Ghost in the Machine: What’s Really Erased by Unlearning?

Author: Denis Avetisyan

Despite promising results at the surface, many machine unlearning techniques fail to fully remove sensitive information from a model’s core knowledge, leaving a hidden trace.

The study reveals a systematic divergence between class means and classifier weights during the unlearning process, specifically for targeted classes-a misalignment absent in retained classes, suggesting that forgetting is not a symmetrical operation and impacts representational alignment within the classifier.

This review assesses machine unlearning methods through the lens of internal feature representations and proposes a new framework based on class-mean features to achieve true data removal.

Despite recent advances in machine unlearning (MU) methods demonstrating promising results in erasing unwanted data, a critical vulnerability remains-the potential for inadvertently reinstating forgotten concepts. This paper, ‘An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations’, investigates this paradox by analyzing the internal feature representations of unlearned models, revealing that many state-of-the-art techniques succeed primarily due to a misalignment between last-layer features and the classifier, leaving hidden features surprisingly discriminative. We demonstrate that adjusting only the classifier can achieve negligible forgetting while preserving retained knowledge, suggesting that true unlearning requires addressing representation-level information. Consequently, we propose MU methods based on a class-mean features classifier to enforce alignment, raising the question of how faithful representation-level evaluation can become standard in the field of privacy-preserving machine learning.

The Inevitable Fade: Confronting Machine Forgetfulness

Contemporary machine learning models excel at identifying patterns and making predictions, yet they face a fundamental challenge: the inability to efficiently ‘unlearn’ specific information. Unlike humans, these models don’t simply forget; removing the influence of a particular data point requires complex procedures, often approaching the computational cost of completely rebuilding the model from scratch. This isn’t merely an inconvenience; it represents a significant limitation in scenarios demanding data privacy, regulatory compliance – such as the ‘right to be forgotten’ – or adaptation to evolving datasets where corrections or removals are frequent. The persistence of learned information, even after attempts at removal, poses risks and hinders the deployment of these powerful technologies in dynamic, real-world applications where data is not static and demands ongoing refinement.

The rigidity of current machine learning models presents substantial challenges regarding data privacy and real-world applicability. Because models struggle to selectively ‘forget’ information, sensitive data inadvertently retained poses significant risks, potentially violating user privacy and regulatory requirements. Furthermore, this inflexibility hampers a model’s ability to adapt to evolving circumstances; as data drifts, becomes inaccurate, or requires correction, the inability to efficiently remove its influence necessitates costly and time-consuming retraining. This is particularly problematic in dynamic fields like financial modeling or medical diagnosis where ongoing data updates are crucial for maintaining accuracy and relevance, and where outdated or incorrect information can lead to flawed predictions and potentially harmful outcomes.

The escalating scale of modern machine learning models presents a substantial challenge when data needs to be revised or removed; completely retraining these models from scratch is often computationally prohibitive. As datasets grow to encompass billions of parameters, the resources – time, energy, and cost – required for full retraining become impractical for many real-world applications, particularly those operating with limited infrastructure or stringent latency requirements. This limitation motivates the development of ‘unlearning’ techniques – methods designed to selectively remove the influence of specific data points without necessitating a complete overhaul of the model. Efficient unlearning promises to not only address growing privacy concerns surrounding data retention but also enable models to adapt dynamically to evolving information landscapes, offering a more sustainable and responsive approach to machine learning.

Assessing the success of machine unlearning is a surprisingly nuanced endeavor, extending beyond simply verifying that a model no longer predicts the correct output for a removed data point. Many current unlearning techniques achieve what appears to be complete ‘output forgetting’ – the model no longer associates the removed data with a specific prediction – but fail to truly eliminate the data’s influence on the model’s internal feature representations. This means the ‘forgotten’ information can still subtly affect performance on other, unrelated tasks, creating potential vulnerabilities and undermining the goal of genuine data removal. Consequently, robust evaluation requires examining both output-level accuracy – does the model forget the specific data? – and feature-level changes – has the model genuinely unlearned the underlying patterns associated with that data? A truly effective unlearning method must demonstrate significant changes at both levels to ensure complete and reliable data removal.

Despite rapidly forgetting the targeted airplane class, the model maintains high accuracy when evaluated using linear probing and normalized cross-correlation, indicating successful unlearning without significant performance degradation on retained classes.

The Architecture of Memory: Feature Alignment and Unlearning

RepresentationAlignment, a critical factor in successful unlearning, describes the extent to which a neural network’s internal feature representations correspond to the inherent groupings within the training data’s class structure. A high degree of alignment indicates that features are meaningfully organized around class boundaries, facilitating targeted removal of information associated with specific classes or data points during the unlearning process. Conversely, poorly aligned representations necessitate more extensive model modification to effectively ‘forget’ data, potentially impacting performance on retained tasks. The effectiveness of unlearning algorithms is directly correlated with the pre-existing level of RepresentationAlignment within the model; models exhibiting strong alignment demonstrate improved precision and efficiency in removing unwanted information compared to those with weak or disordered feature spaces.

Neural Collapse (NC) is a phenomenon observed during the late stages of deep neural network training, characterized by an alignment of learned feature representations and the final classification layer. Specifically, features of different classes converge to the vertices of a regular simplex, and the classifier weights align with these feature vectors. While NC generally improves generalization performance, it introduces vulnerabilities during unlearning because the concentrated feature representations amplify the impact of removing data associated with specific classes. The alignment means that deleting data from one class necessitates a more substantial alteration of the shared feature space and classifier weights than would be required if features were less correlated. This sensitivity makes achieving complete and efficient unlearning more challenging in networks exhibiting strong Neural Collapse.

Utilizing ClassMeanFeatures as a technique to improve unlearning performance involves identifying and targeting the feature vectors that most strongly represent each class within the model. These class mean features, calculated as the average feature vector for all samples belonging to a specific class, serve as representative anchors for that class’s information. During unlearning, modifications are then focused on these identified feature vectors and associated weights, rather than applying uniform updates across the entire model. This targeted approach minimizes disruption to retained knowledge while efficiently removing information related to the data being forgotten, leading to improved unlearning accuracy and reduced performance degradation on retained tasks. The efficacy of this technique relies on the assumption that these class mean features encapsulate the most salient information for each class, and therefore represent the primary target for removal during the unlearning process.

Effective unlearning relies on the degree to which a model’s internal feature representations correspond to the data’s class structure; strong alignment enables more targeted removal of information related to forgotten data. Existing unlearning methods often exhibit inefficiencies due to a lack of precise targeting, leading to either incomplete removal or detrimental impacts on retained knowledge. Our framework addresses these limitations by prioritizing the achievement of robust feature alignment during the unlearning process. This prioritization allows for a more granular and efficient removal of outdated information, minimizing collateral damage to the model’s performance on remaining tasks and facilitating a more complete “forgetting” of the specified data.

t-SNE visualization reveals that CMF-based unlearning successfully separates the forgotten class (red) from retained classes, though some overlap remains, and this separation is achieved through reshaping of the overall feature distributions with an additional normalization step.

Strategies for Erasure: Navigating the Landscape of Unlearning Methods

Current machine unlearning research encompasses a range of algorithmic strategies differing in complexity and efficacy. Simpler methods, such as RandomLabel, which randomly re-labels data points to be forgotten, and RetainOnlyRetrain, which retrains the model solely on the remaining data, provide baseline unlearning capabilities. More advanced techniques, including SCRUB, which utilizes a scrubbing process to remove data influence, and UNSIR, which focuses on unlearning through influence suppression, employ more nuanced approaches to modify the model. These methods vary in computational cost and the degree to which they can successfully eliminate the effects of the forgotten data without significantly degrading performance on the retained data.

NegGradPlus and SalUn represent parameter-manipulation approaches to machine unlearning. NegGradPlus operates by reversing the gradients of the forgotten data during retraining, effectively nullifying their impact on model weights. This is achieved by accumulating negative gradients for the specific data points slated for removal. SalUn, or Salience-based Unlearning, identifies and diminishes the influence of salient features-those most strongly activated by the data to be forgotten-by scaling down corresponding weights. Both techniques aim to directly modify the model’s parameters to reduce the correlation between the removed data and the model’s predictions, without requiring access to the original training dataset or retraining on the entire dataset.

CMFUnlearning addresses the machine unlearning problem by explicitly utilizing ClassMeanFeatures (CMF) during the unlearning process. This method calculates the mean feature vector for each class and enforces alignment between the model’s feature representations and these class means after data removal. By minimizing the distance between updated feature representations and the corresponding CMF, CMFUnlearning effectively reduces the influence of forgotten data at the feature level. Empirical results demonstrate that this approach yields significantly lower feature-level forget accuracy compared to baseline unlearning methods, indicating a more complete removal of data traces without substantial performance degradation.

Current machine unlearning methods, despite utilizing diverse strategies like parameter manipulation or data retention/retraining, universally target the removal of specific data point influence without sacrificing overall model utility. However, research indicates that achieving complete unlearning necessitates a focus on the model’s underlying feature representations. Existing techniques often fail to adequately address the propagation of information embedded within these features, leading to residual traces of the forgotten data. Successfully removing these traces requires methods that explicitly modify or neutralize the learned feature representations to ensure complete removal of the target data’s influence, rather than solely focusing on direct parameter adjustments or data-level interventions.

Evaluating forgetting in a <span class="katex-eq" data-katex-display="false">Cifar100</span> classification task reveals that retaining information during training, especially when combined with a CMF classifier, significantly improves accuracy compared to standard and various memory-updating (MU) methods when assessed both at the output and feature levels using linear probes and nearest class center accuracy. — Evaluating forgetting in a $Cifar100$ classification task reveals that retaining information during training, especially when combined with a CMF classifier, significantly improves accuracy compared to standard and various memory-updating (MU) methods when assessed both at the output and feature levels using linear probes and nearest class center accuracy.

Architectural Agnosticism: Unlearning Across Diverse Neural Networks

Machine unlearning, the process of selectively removing the influence of specific data points from a trained model, isn’t confined to a single neural network design. Research demonstrates its principles and associated techniques-such as influence functions and approximate Newton methods-translate effectively across diverse architectures. Established convolutional neural networks, like ResNet, which excel at image recognition through hierarchical feature extraction, can be ‘unlearned’ using similar approaches as the more recently developed Vision Transformers (ViT). ViT, leveraging attention mechanisms, processes images as sequences, yet still accommodates unlearning strategies focused on modifying model parameters or activations. This architectural agnosticism is crucial, as it suggests a unified framework for data privacy and model adaptation can be applied broadly, irrespective of the underlying model’s complexity or specific design choices.

Despite the increasing diversity in machine learning model architectures – from the convolutional layers of ResNet to the attention mechanisms within ViT – the fundamental principle of machine unlearning remains steadfast: the precise and efficient removal of specific data’s influence. This consistent objective necessitates techniques capable of isolating and neutralizing the impact of designated information without disrupting the model’s broader knowledge base. While the implementation of unlearning diverges based on architectural nuances – requiring tailored approaches to gradient manipulation, weight adjustments, or influence function calculations – the core goal of selective information erasure provides a unifying framework for research and development across the entire landscape of neural networks. This allows for the creation of models that are not only powerful but also respectful of data privacy and adaptable to evolving requirements.

The increasing prevalence of machine learning in everyday life necessitates robust mechanisms for data control, making efficient unlearning a critical capability. In privacy-sensitive applications – such as healthcare, finance, and personalized recommendations – the right to be forgotten, or the ability to remove an individual’s data from a trained model, is becoming legally mandated and ethically essential. Beyond privacy, dynamic environments demand models that can adapt to changing circumstances, like evolving regulations or shifting user preferences; unlearning allows for the selective removal of outdated or irrelevant information without the costly and time-consuming process of full retraining. This adaptability extends to scenarios involving data corrections or the mitigation of biases, where targeted removal of problematic data points can significantly improve model fairness and reliability. Consequently, the development of scalable and efficient unlearning techniques is not merely an academic pursuit, but a fundamental requirement for responsible and sustainable machine learning deployment.

Despite the growing need for data privacy and model adaptability, machine unlearning remains an evolving field requiring targeted optimization. Current unlearning techniques, however, often prioritize suppressing outputs associated with removed data rather than genuinely erasing its influence on the model’s internal reasoning. Research indicates a significant disconnect between the classifier’s observable behavior and the underlying feature representations after unlearning, suggesting existing methods primarily focus on output-level corrections without addressing deeper, more fundamental alterations within the model itself. This misalignment raises concerns about potential vulnerabilities and the persistence of memorized information, highlighting the need for unlearning strategies that truly reshape internal representations and promote robust privacy protection.

Feature-classifier alignment (<span class="katex-eq" data-katex-display="false">\mathcal{NC}</span>3) demonstrates that single-class forgetting in CIFAR-10 is characterized by an increased distance between the means of forgotten class features and classifier weights, while retained classes maintain consistent alignment. — Feature-classifier alignment ( $\mathcal{NC}$ 3) demonstrates that single-class forgetting in CIFAR-10 is characterized by an increased distance between the means of forgotten class features and classifier weights, while retained classes maintain consistent alignment.

Beyond Accuracy: Towards a Nuanced Evaluation of True Unlearning

Evaluating whether a machine learning model has truly ‘unlearned’ specific data is more complex than simply measuring performance on retained information. While output-level metrics like overall accuracy provide a general sense of retained knowledge, a deeper understanding requires probing the model’s internal representations. Feature-Level Metrics, such as Linear Probing, directly assess what the model has learned about specific features, revealing whether the model has genuinely forgotten the target data or merely learned to suppress its influence on outputs. This technique involves training a simple classifier on the model’s internal activations to determine if information related to the unlearned data still persists within those representations; a persistent signal indicates memorization rather than true unlearning. Consequently, relying solely on output metrics can be misleading, as a model might maintain high accuracy by memorizing workarounds while still retaining sensitive information internally.

Traditional evaluations of unlearning often focus on observable outputs, such as maintained accuracy on non-target data, but these metrics can be misleading. A model might appear to have successfully ‘forgotten’ specific information while merely learning to suppress its influence on predictions – essentially memorizing a workaround rather than genuinely erasing the learned representation. To address this limitation, researchers are increasingly turning to FeatureLevelMetrics, like Linear Probes, which examine the internal representations within the model itself. These probes analyze whether the model’s underlying features have actually changed in response to the unlearning process, revealing if the targeted data has been truly removed from the model’s knowledge base, or if it remains encoded but masked by a learned suppression strategy. This deeper level of analysis is crucial for verifying genuine unlearning and ensuring that models are not simply concealing, but rather forgetting, sensitive information.

The current evaluation of unlearning techniques often relies on broad accuracy measurements, which can be misleading as they fail to reveal how a model forgets information. Future investigations must prioritize the development of more granular metrics capable of discerning the subtleties of this process. These advanced evaluations should move beyond simply verifying the removal of specific data points and instead focus on analyzing the model’s internal representations – its learned features – to confirm genuine ‘forgetting’ rather than superficial output suppression. Such metrics could assess the degree to which previously relevant features are attenuated or reorganized, providing a more complete understanding of a model’s behavioral changes post-unlearning and ultimately enabling the creation of more robust and privacy-respecting machine learning systems.

Effective model unlearning extends beyond simply maintaining performance on data the model should remember; it fundamentally concerns respecting data privacy and ensuring adaptable, responsible AI. Current unlearning techniques, however, often prioritize suppressing outputs associated with removed data rather than genuinely erasing its influence on the model’s internal reasoning. Research indicates a significant disconnect between the classifier’s observable behavior and the underlying feature representations after unlearning, suggesting existing methods primarily focus on output-level corrections without addressing deeper, more fundamental alterations within the model itself. This misalignment raises concerns about potential vulnerabilities and the persistence of memorized information, highlighting the need for unlearning strategies that truly reshape internal representations and promote robust privacy protection.

The pursuit of machine unlearning, as detailed in this study, highlights a curious phenomenon: the illusion of erasure. While models may convincingly appear to forget at the output level, the underlying representations often retain traces of past information. This echoes a sentiment articulated by Ken Thompson: “The best programs are small and simple, and they have very little to do.” The elegance of a truly ‘unlearned’ system lies not in complex procedures masking old data, but in a fundamental simplicity – a clean slate at every level. The research suggests that focusing on representation-level unlearning, specifically through techniques like classifier alignment, offers a path toward achieving this graceful decay, allowing systems to age without being burdened by persistent echoes of the past.

The Echo of What Remains

The pursuit of machine unlearning, as this work demonstrates, quickly reveals itself not as erasure, but as a carefully constructed forgetting. Output-level success, it appears, is a relatively shallow victory, obscuring the persistence of information within the network’s architecture. Every delay in achieving true representational unlearning is, in essence, the price of understanding just how deeply ingrained these patterns become. The tendency for existing methods to merely shift, rather than eliminate, traces of prior data suggests a fundamental limitation: the assumption that forgetting can be measured solely by what is expressed, rather than what exists.

Future efforts must necessarily move beyond superficial metrics and confront the question of internal state. A focus on class-mean features, as proposed, is a logical progression, yet it is unlikely to be a final solution. The architecture itself-the very scaffolding upon which learning occurs-retains an echo of all it has processed. This is not a flaw, but an inherent property of any system accumulating experience. The challenge, then, is not to eliminate the past entirely, but to manage its influence, to ensure it does not unduly constrain future adaptation.

Architecture without history is fragile and ephemeral. Complete erasure may prove not only impossible but undesirable. The field must begin to explore methods for controlled retention-for curating the past rather than attempting to obliterate it. Perhaps the true measure of success will not be how well a model forgets, but how gracefully it ages.

Original article: https://arxiv.org/pdf/2604.08271.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/