Forgotten Data Isn’t Really Gone: Unlearning’s Hidden Privacy Risks

Author: Denis Avetisyan

New research reveals that machine unlearning methods are vulnerable to attacks that can expose sensitive training data, even after it’s been ‘forgotten’.

The framework dissects the vulnerabilities of machine learning models by inverting them to infer sensitive label information, effectively demonstrating how a system’s predictive power can be subverted to reveal its underlying training data - a calculated breach of assumed security. — The framework dissects the vulnerabilities of machine learning models by inverting them to infer sensitive label information, effectively demonstrating how a system’s predictive power can be subverted to reveal its underlying training data – a calculated breach of assumed security.

Attackers can infer leaked labels by analyzing model parameters or inverting model predictions, bypassing the intended privacy protections of machine unlearning.

Despite growing concern for data privacy and the implementation of machine unlearning techniques, a fundamental vulnerability remains: the potential leakage of forgotten data labels. This paper, ‘Label Leakage Attacks in Machine Unlearning: A Parameter and Inversion-Based Approach’, investigates this threat by demonstrating how attackers can effectively infer the categories of data removed from a model through analysis of either model parameters or reconstructed data samples. Specifically, we introduce both parameter-based and model inversion-based attacks, leveraging techniques like k-means clustering and gradient optimization to expose these vulnerabilities, even with limited attacker knowledge. Given these findings, can truly privacy-preserving machine unlearning be achieved, and what novel defense mechanisms are required to mitigate these label leakage risks?

The Ghosts in the Machine: Why Forgetting is Hard

The proliferation of machine learning extends far beyond simple recommendations; models now underpin critical infrastructure in healthcare, finance, and even criminal justice. This widespread adoption, however, coincides with growing concerns regarding data privacy. Sensitive personal information is routinely used to train these algorithms, creating potential vulnerabilities should data be compromised or misused. Unlike traditional software where data is stored and managed with defined access controls, machine learning models learn from data, embedding patterns and potentially revealing individual attributes within their parameters. Consequently, a data breach doesn’t simply expose stored records; it risks exposing the very knowledge the model has acquired, demanding a new paradigm for data handling and model security that addresses the unique risks posed by learned information.

The escalating volume of data used to train modern machine learning models presents a significant challenge when modifications are required – particularly when adhering to evolving data privacy regulations like the ‘right to be forgotten’. Completely retraining a model from scratch with each data update, or upon individual request for data removal, proves computationally prohibitive and time-consuming, demanding vast resources and hindering real-time responsiveness. This traditional approach becomes especially impractical in dynamic environments where data is continuously streaming and models must adapt swiftly. The sheer scale of contemporary datasets and model complexity means that even incremental retraining can be extraordinarily expensive, making it untenable for many applications and creating a pressing need for more efficient alternatives that can selectively address data alterations without necessitating a complete overhaul of the learning process.

The proliferation of machine learning into areas handling personal and sensitive data necessitates the development of machine unlearning – techniques enabling the selective removal of an individual’s influence from a trained model. Unlike traditional retraining, which rebuilds a model from scratch, unlearning aims to efficiently ‘forget’ specific data points without incurring the substantial computational costs associated with complete model updates. This capability is not merely a performance optimization; it’s becoming a legal requirement, driven by regulations like the ‘right to be forgotten’ and increasing data privacy concerns. Effective machine unlearning promises a pathway to reconcile data-driven innovation with individual rights, offering a means to maintain model utility while respecting user autonomy and adhering to evolving data governance standards.

Current machine unlearning techniques, while attempting to erase the influence of specific data points, frequently introduce unacceptable trade-offs. Many methods demonstrably degrade the overall performance and accuracy of the machine learning model after data removal, rendering them impractical for real-world applications requiring sustained reliability. Perhaps more concerning is their susceptibility to privacy breaches; recent studies reveal that adversaries can reconstruct sensitive information from models even after unlearning has been applied. Data inversion techniques, specifically, have achieved alarmingly high Attack Success Rates – exceeding 70% in controlled experiments – indicating a significant vulnerability and emphasizing the critical need for the development of robust, privacy-preserving unlearning methodologies that truly safeguard data while maintaining model integrity.

Unlearning performance on the SVHN dataset demonstrates that increasing the number of unlearned classes degrades attack success rates under the re-training method.

Dissecting the Mechanisms of Oblivion

Fine-tuning and full retraining of machine learning models represent established unlearning techniques, but demand significant computational resources proportional to the model’s size and the volume of data being modified or removed; this is due to the necessity of updating all model weights through repeated gradient calculations. Conversely, random labeling, a more efficient approach, achieves unlearning by replacing sensitive data with random noise; while computationally inexpensive, this method demonstrably degrades model accuracy as the disrupted data contributes to inaccurate weight adjustments and reduced generalization performance. The trade-off between computational cost and model utility is therefore a key consideration when selecting an unlearning strategy.

Amnesiac unlearning operates by directly subtracting the contributions of specific data points from the model’s parameters, offering a more precise removal of information compared to methods like fine-tuning. This is achieved by identifying the influence of each training example on the model weights during the initial training phase and then reversing that influence during the unlearning process. However, successful implementation necessitates careful parameter management, including tracking per-example gradients and storing sufficient information to accurately reconstruct and subtract individual data contributions without impacting the performance on remaining data. Failure to properly manage these parameters can lead to instability, reduced model accuracy, or incomplete data removal.

Negative gradient descent operates by adjusting model weights in the direction opposite the gradient of the loss function with respect to a specific data point, effectively minimizing that data’s influence on the model’s output. While this allows for targeted removal of information, unconstrained application can lead to catastrophic forgetting, where the model loses previously learned knowledge due to significant weight alterations. Mitigation strategies involve implementing constraints such as weight clipping, regularization terms, or limiting the step size of the descent to prevent substantial deviations from the existing weight values and preserve generalizable patterns.

Current unlearning methods, encompassing techniques like fine-tuning, amnesiac unlearning, and negative gradient descent, each present inherent trade-offs between computational efficiency, retained model accuracy, and the level of privacy protection afforded. Despite advancements, all evaluated methods demonstrate significant vulnerability to data inversion attacks, consistently achieving Attack Success Rates (ASR) exceeding 70%. This indicates that even after applying unlearning procedures, sensitive training data can still be reconstructed from the modified model parameters with a high degree of probability, posing a substantial risk to data privacy and necessitating further research into more robust unlearning strategies.

The model demonstrates consistent unlearning performance across both single-class and multi-class scenarios, effectively removing learned information.

Tracing the Echoes of Lost Data

Methods for identifying forgotten classes, those for which a model’s predictive performance has degraded, utilize several quantifiable criteria. Threshold-based comparisons assess prediction probabilities; a class is flagged as forgotten if the maximum predicted probability for any instance falls below a predetermined threshold. Entropy measurements offer an alternative, calculating the uncertainty associated with predictions for each class – higher entropy suggests the model is less confident and potentially indicates a forgotten class. These methods operate on the principle that confidently incorrect predictions, or consistently uncertain predictions, signal a loss of learned representation for a particular class. The specific thresholds or entropy values used are often determined empirically through validation datasets and performance trade-offs.

Analysis of model parameter changes following data point removal provides insights into data influence. Specifically, calculating the difference in parameter values before and after retraining without a given data point reveals the magnitude of that point’s impact; larger differences suggest greater influence. Furthermore, the dot product similarity between the original model parameters and those retrained without a specific data point offers a normalized measure of influence, ranging from -1 to 1, where values closer to 1 indicate stronger influence and values closer to -1 suggest the data point negatively correlated with the model’s learned representation. These methods operate on the principle that data points with significant influence will induce larger alterations in the model’s parameters during the retraining process.

The Youden index, calculated as $max(TPR - FPR)$ where TPR is the True Positive Rate and FPR is the False Positive Rate, serves as a single statistic to evaluate the performance of methods designed to identify “forgotten” or influential training data. Unlike metrics focused solely on precision or recall, the Youden index balances sensitivity and specificity, providing a more comprehensive assessment of a method’s ability to correctly identify influential data points while minimizing false positives. A higher Youden index indicates a better ability to distinguish between data points that significantly impacted model training and those that did not, offering a robust measure for comparing the effectiveness of different identification techniques across varying datasets and model architectures. This metric is particularly valuable when assessing the reliability of identification methods prior to applying unlearning techniques, as accurate identification is crucial for preventing unintended harm to model performance.

While precise identification of forgotten classes is crucial for effective unlearning – ensuring only the influence of target data is removed without degrading general model performance – it is not a complete security solution. Research demonstrates that, even with accurate identification of affected parameters, model parameter-based attacks can still achieve Attack Success Rates (ASR) exceeding 90% in specific circumstances. These attacks exploit vulnerabilities in the model’s parameters themselves, circumventing the protections offered by accurate identification and successful unlearning of the initial data influence. This indicates a need for complementary defense mechanisms beyond accurate identification and unlearning to fully mitigate data extraction risks.

A white-box attack leverages the Youden’s index threshold and dot product of a fully connected layer to identify and exploit vulnerabilities.

The Persistence of Memory: Benchmarking Against Reality

The efficacy of unlearning methodologies and techniques designed to identify forgotten classes underwent rigorous testing across a suite of established benchmark datasets. Researchers utilized commonly employed image datasets-MNIST, a collection of handwritten digits; FashionMNIST, mirroring the structure of MNIST but featuring apparel items; CIFAR10, comprising labeled color images of common objects; and SVHN, consisting of real-world digit images cropped from house numbers-to provide a standardized basis for comparison. This selection allowed for a comprehensive assessment of how well these methods perform on diverse data distributions and complexities, ensuring the robustness and generalizability of the findings regarding data privacy and model security.

To rigorously evaluate the vulnerability of machine learning models to data reconstruction following unlearning, experiments utilized both LeNet and ResNet18 architectures. This deliberate choice in model complexity allowed researchers to assess whether the effectiveness of attacks varied with network depth and the number of parameters. LeNet, a relatively shallow network, provided a baseline for performance, while the deeper ResNet18 offered insights into how more sophisticated models fared against similar threats. By comparing results across these two architectures, the study demonstrated that even complex models remain susceptible to data inversion, highlighting a consistent vulnerability regardless of network size and suggesting that architectural improvements alone may not fully mitigate the risk of forgotten data exposure.

The efficacy of unlearning techniques was rigorously evaluated through attack success rate (ASR), a metric quantifying the potential for malicious reconstruction or inference of previously forgotten data. Investigations revealed that model parameter-based attacks, which target the learned weights of a neural network, can achieve remarkably high ASRs – exceeding 90% under certain conditions. This suggests that despite employing various unlearning algorithms, a significant risk remains that sensitive information, ostensibly removed from the model, can still be recovered by an adversary with access to the model’s parameters. The consistently high ASRs across different datasets and architectures underscores the need for more robust unlearning strategies that effectively mitigate this vulnerability and protect against data inversion attacks.

Despite advancements in data unlearning and techniques designed to mitigate privacy risks, evaluations on benchmark datasets reveal a persistent vulnerability to data inversion attacks. These attacks, aiming to reconstruct or infer the presence of previously removed data, consistently achieve Attack Success Rates (ASR) ranging from 70 to 90 percent across diverse datasets like MNIST, FashionMNIST, CIFAR10, and SVHN. This high success rate holds true even when employing optimized unlearning algorithms, suggesting that current methods offer limited protection against determined adversaries seeking to exploit residual information embedded within model parameters. The consistency of these results underscores a critical need for developing more robust and privacy-preserving machine learning techniques that effectively eliminate the traces of forgotten data and safeguard sensitive information.

The retrained model exhibits a distinct distribution of dot products when evaluated on the Fashion-MNIST dataset.

The research meticulously details how seemingly secure machine unlearning methods are susceptible to leakage attacks, revealing forgotten data through parameter analysis or model inversion. This echoes John von Neumann’s sentiment: “If you can’t break it, you don’t understand it.” The study doesn’t merely identify vulnerabilities; it actively breaks the assumed security of these systems, demonstrating that a comprehensive understanding necessitates rigorous testing and adversarial exploration. By successfully reconstructing labels from model outputs – or, conversely, from the model’s internal state – the work highlights the precarious nature of data privacy in machine learning, forcing a re-evaluation of existing unlearning protocols and a deeper appreciation for the system’s underlying weaknesses.

What’s Next?

The demonstrated susceptibility of machine unlearning to both parameter and inversion-based attacks suggests a fundamental disconnect between the intent of forgetting and the reality of model behavior. The work reveals that a model, even after diligent “unlearning,” doesn’t truly relinquish its memories-it merely rearranges them, leaving detectable traces for a sufficiently inquisitive adversary. One might posit that a bug is the system confessing its design sins, and here, the sin is the implicit assumption that erasure is equivalent to non-existence within a complex parameter space.

Future research must move beyond simply quantifying leakage; the focus should shift toward genuinely unlearning – developing algorithms that fundamentally alter model representations, rather than attempting to surgically remove data’s influence. The reliance on metrics like Youden’s Index, while useful, may prove insufficient to capture the nuances of information retention. A more fruitful avenue might involve exploring differential privacy techniques during training, embedding forgetfulness into the model’s very architecture.

Ultimately, the challenge isn’t just about hiding the past, but about accepting that every learned parameter is a testament to prior data. The pursuit of perfect unlearning may be a fool’s errand; a more realistic goal is to build models that are robust to information leakage, acknowledging that complete erasure is, perhaps, an unattainable ideal.

Original article: https://arxiv.org/pdf/2604.07386.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Ghosts in the Machine: Why Forgetting is Hard

Dissecting the Mechanisms of Oblivion

Tracing the Echoes of Lost Data

The Persistence of Memory: Benchmarking Against Reality

What’s Next?

See also: