Smarter Data Labeling: Boosting Deep Learning with Verified Adversarial Examples

Author: Denis Avetisyan

A new approach combines deep active learning with formal verification to generate targeted, diverse data for more efficient and robust model training.

Across three benchmarks, test accuracy improved with each iteration, demonstrating a consistent learning trajectory in the absence of adversarial perturbations.

This research demonstrates improved labeling efficiency and DNN robustness by augmenting active learning with adversarially generated examples verified using the Marabou tool.

Despite advances in deep learning, efficiently labeling data remains a significant bottleneck in model training. This challenge motivates research into deep active learning (DAL), and our work, ‘On Improving Deep Active Learning with Formal Verification’, investigates augmenting DAL with adversarial examples to enhance data efficiency. We demonstrate that adversarially perturbed inputs generated through formal verification-ensuring robustness constraint violations-yield substantially greater performance gains than those produced by conventional gradient-based attacks. By applying this approach to existing and a novel DAL technique, we achieve improved model generalization-but can these formally verified adversarial examples further unlock the potential of robust and sample-efficient deep learning?

The Fragility of Perception: Deep Networks and Adversarial Shadows

Despite achieving remarkable success in areas like image recognition and natural language processing, deep neural networks exhibit a surprising fragility. This vulnerability stems from their susceptibility to adversarial examples – subtly perturbed inputs, often imperceptible to humans, that consistently cause misclassification. These aren’t simply random errors; rather, they are carefully crafted inputs designed to exploit the decision boundaries learned by the network. A single pixel change, or a nearly inaudible alteration to an audio file, can reliably fool even the most accurate models. This poses a significant challenge, as these networks, while capable of high performance on standard datasets, lack the robustness expected in real-world applications where malicious actors might intentionally manipulate inputs to cause failures or compromise security. The existence of adversarial examples isn’t just a theoretical concern; it highlights a fundamental difference in how these networks ‘see’ the world compared to human perception.

The susceptibility of deep neural networks to adversarial examples presents a significant hurdle for their implementation in domains where reliability is paramount. Consider autonomous vehicles, where a subtly altered stop sign – imperceptible to a human – could be misinterpreted, leading to potentially catastrophic consequences. Similarly, in medical diagnosis, an adversarial perturbation to an X-ray image might cause a model to overlook a critical anomaly. This necessitates a shift beyond simply achieving high accuracy on standard datasets; instead, the focus must extend to verifying model behavior under a wide range of conditions, including those involving malicious inputs, and developing techniques to enhance robustness – the ability to maintain correct predictions even when faced with such challenges. Without rigorous verification and improved robustness, the deployment of these powerful models in safety-critical applications remains a considerable risk.

Conventional deep learning training prioritizes accuracy on clean data, inadvertently creating models susceptible to adversarial examples-inputs deliberately perturbed to cause misclassification. This stems from the models learning brittle, high-dimensional features rather than robust, semantically meaningful ones. Consequently, standard techniques like increasing training data or employing regularization often prove insufficient to bridge this robustness gap. Researchers are therefore actively exploring innovative strategies, including adversarial training – where models are trained with these deceptive inputs – and techniques that promote certified robustness, offering mathematical guarantees against certain types of attacks. These novel approaches aim to move beyond simply achieving high accuracy and instead focus on building models demonstrably reliable, even when confronted with malicious or unexpected inputs, which is crucial for real-world deployment in sensitive applications like autonomous driving and medical diagnosis.

Test accuracy improves with increasing iterations across all three benchmarks when subjected to adversarial inputs.

Strategic Sampling: The Efficiency of Active Learning

Training deep learning models typically requires vast datasets, but exhaustively labeling all possible inputs is computationally impractical. Deep Active Learning (DAL) addresses this limitation by strategically selecting a subset of unlabeled data points for annotation. Rather than random sampling, DAL algorithms prioritize instances expected to yield the greatest improvement in model performance. This is achieved by evaluating each candidate sample based on its potential to reduce model uncertainty or expose current weaknesses, effectively maximizing the information gained from each labeled example and significantly reducing the overall labeling effort required to achieve a target level of accuracy and robustness. The efficiency gains are particularly relevant in scenarios with limited labeling budgets or high data acquisition costs.

Active learning strategies prioritize labeling data points located in proximity to the decision boundary, as these samples are inherently the most informative for model refinement. This focus stems from the principle that examples correctly classified with high confidence contribute less to learning than those the model struggles to categorize. By concentrating labeling efforts on these challenging, boundary-adjacent instances, the model receives targeted feedback that directly reduces classification error and accelerates convergence. This approach is demonstrably more efficient than random sampling, requiring fewer labeled examples to achieve a comparable level of performance, particularly in the context of robust model training against adversarial perturbations or distributional shifts. The reduction in labeling cost is a key benefit, as obtaining labels is often the most significant bottleneck in machine learning pipelines.

Uncertainty Sampling prioritizes labeling data points for which the model exhibits the lowest confidence in its prediction, effectively targeting areas where the model is most likely to err. Common metrics for quantifying uncertainty include prediction entropy and margin sampling. Diversity Considerations, used in conjunction with uncertainty metrics, address the limitation of selecting highly similar uncertain samples by encouraging the selection of points that are dissimilar to those already labeled; techniques such as k-center-greedy and core-set selection are employed to maximize coverage of the input space. Combining these approaches ensures that the training set includes both informative samples where the model is uncertain and a representative distribution of inputs, leading to more robust and generalizable models.

Training set augmentation improves model generalization and robustness by synthetically expanding the training data. This is achieved through the generation of novel examples, often utilizing techniques like adversarial training or applying various transformations to existing data points. By exposing the model to a wider range of inputs, including those near the decision boundary or representing potential adversarial perturbations, augmentation reduces overfitting and enhances the model’s ability to correctly classify unseen data. Specifically, models trained with augmented data demonstrate increased resistance to adversarial attacks, as they have already encountered and learned to correctly classify examples similar to those used in such attacks. The effectiveness of augmentation is dependent on the diversity and quality of the generated samples, and careful consideration must be given to avoid introducing biased or unrealistic data.

Formal Guarantees: Verifying Robustness with Mathematical Rigor

Formal verification, in the context of neural network robustness, employs mathematical techniques to definitively prove that a model will behave as expected within specified input ranges and under defined perturbations. Unlike empirical testing, which can only demonstrate performance on a finite set of inputs, formal verification aims to provide a guarantee of robustness. This is achieved by formulating the network’s behavior as a set of mathematical constraints and utilizing solvers – such as Satisfiability Modulo Theories (SMT) solvers – to determine whether those constraints are satisfied for all possible inputs within a defined domain. The strength of formal verification lies in its ability to certify robustness against $L_p$ norm-bounded perturbations, offering a quantifiable measure of a model’s resilience to adversarial attacks and ensuring predictable behavior in safety-critical applications.

Marabou is a tool designed for the formal verification of neural networks, and a key capability is the generation of counterexamples. These counterexamples are specifically crafted inputs that cause the network to misclassify or produce an incorrect output, thereby demonstrating a vulnerability in the model’s decision boundary. The tool achieves this by solving a mixed-integer linear programming (MILP) problem that determines if a specific output can be achieved given a set of input constraints. Generated counterexamples are not simply adversarial perturbations; they provide concrete evidence of model failure and can be used to guide refinement through techniques like adversarial training or network architecture modification, allowing developers to address weaknesses and improve robustness.

FVAAL (Formal Verification-Assisted Active Learning) enhances active learning performance by integrating counterexamples generated through formal verification techniques. This method combines a margin-based selection strategy – prioritizing samples closest to the decision boundary – with data sourced from verification processes. Specifically, counterexamples, which demonstrate network vulnerabilities to defined perturbations, are added to the training set. This augmentation provides the model with challenging, yet valid, data points, effectively guiding the learning process and improving generalization capabilities. By leveraging verification-generated data, FVAAL aims to efficiently identify and address weaknesses in the neural network, resulting in improved accuracy with fewer labeled samples compared to traditional active learning or augmentation with solely gradient-based adversarial examples.

Experimental results demonstrate that incorporating adversarial examples generated through formal verification into Deep Active Learning consistently enhances model performance. Specifically, this augmentation technique yields improved learning curves and achieves a higher Area Under the Budget Curve (AUBC) when evaluated across multiple datasets. Comparative analysis indicates that this approach outperforms standard adversarial training methods utilizing the Fast Gradient Sign Method (FGSM), as well as baseline Deep Active Learning models trained without any adversarial augmentation. The observed improvements in AUBC suggest that verification-generated examples provide more effective and informative training signals for active learning algorithms, leading to more robust and accurate models with fewer labeled samples.

Beyond Benchmarks: Charting a Course Towards Trustworthy AI

The efficacy of the developed active learning and formal verification techniques has been rigorously demonstrated across several established benchmark datasets. Performance evaluations on MNIST, fashionMNIST, and CIFAR-10 showcase the method’s adaptability to varying image complexities and dataset sizes. These datasets, commonly used for evaluating machine learning algorithms, provided a standardized platform to assess the ability of the proposed strategies to efficiently select informative samples for labeling and generate robust adversarial examples. Consistent positive results across these benchmarks suggest the potential for broader application to diverse computer vision tasks and highlight a promising pathway towards building more reliable and trustworthy deep learning models.

Recent research highlights the significant benefits of integrating adversarial query strategies into active learning frameworks. Specifically, the Deep Feature Adversarial Learning (DFAL) and Budgeted Adversarial Data Generation for Exploration (BADGE) methods demonstrate enhanced performance by intelligently selecting data points for labeling. Rather than randomly querying data, these techniques leverage adversarial perturbations to identify samples that are most informative and likely to improve model generalization. By strategically crafting queries designed to challenge the current model, DFAL and BADGE effectively accelerate the learning process and achieve higher accuracy with fewer labeled examples. This approach proves particularly valuable when labeled data is scarce or expensive to obtain, offering a pathway toward more efficient and robust deep learning systems.

Evaluations on the MNIST and fashionMNIST datasets reveal that the BADGE+FV-Adv methodology achieved state-of-the-art performance, as measured by the Area Under the Budget Curve (AUBC). This metric quantifies the model’s ability to improve its accuracy with a limited number of labeled examples, and BADGE+FV-Adv consistently outperformed competing active learning strategies. The superior AUBC scores indicate that this approach efficiently selects the most informative samples for labeling, leading to faster learning and enhanced model generalization. This result highlights the effectiveness of combining adversarial query strategies with formal verification techniques to optimize the active learning process and maximize performance within budgetary constraints.

Analysis reveals that the implemented formal verification technique consistently produces adversarial examples exhibiting greater diversity compared to those generated by the Fast Gradient Sign Method (FGSM). Specifically, these formally verified examples demonstrate a larger mean and standard deviation when analyzed within the feature space of the neural network. This suggests a broader exploration of the input space and a more comprehensive identification of potential vulnerabilities, as the adversarial perturbations are not limited to the most obvious or immediate directions. The increased diversity contributes to more robust stress-testing of the deep learning model, allowing for the discovery of a wider range of weaknesses and ultimately leading to improvements in its overall resilience against adversarial attacks.

Current investigations are directed towards extending the demonstrated capabilities to increasingly sophisticated neural network architectures and larger, more representative datasets. While initial successes have been achieved with benchmark examples, the true potential of this combined active learning and formal verification framework hinges on its adaptability to real-world complexities. Researchers are actively exploring new verification methodologies, moving beyond gradient-based approaches to encompass techniques that offer stronger guarantees of robustness and uncover a wider range of potential vulnerabilities. This includes investigating methods to efficiently verify networks with millions of parameters and datasets containing high-dimensional, noisy data, ultimately striving to build deep learning systems demonstrably resilient to adversarial manipulation and capable of reliable performance in critical applications.

The convergence of active learning and formal verification presents a pathway towards demonstrably reliable deep learning systems, critical for deployment in real-world applications. Active learning strategically selects the most informative data points for labeling, reducing the annotation burden and accelerating model training, while formal verification rigorously assesses model robustness by systematically generating and testing adversarial examples. This combined approach doesn’t merely predict performance, but provides guarantees about a model’s behavior within defined operational boundaries. By proactively identifying and mitigating vulnerabilities through verified robustness, these systems move beyond statistical correlations to offer a level of trustworthiness essential for safety-critical domains like autonomous vehicles, medical diagnosis, and financial modeling. The resulting models aren’t simply accurate on training data; they are demonstrably resilient to unforeseen inputs and malicious attacks, fostering confidence and enabling responsible innovation.

The pursuit of robust deep learning, as detailed in this exploration of active learning and formal verification, mirrors a fundamental truth about all complex systems. The generation of adversarial examples, used to stress-test the decision boundary of a DNN, isn’t simply about finding weaknesses, but acknowledging the inevitable entropy inherent in any constructed order. As Donald Davies observed, “The best systems are those that anticipate their own decay.” This proactive anticipation-building resilience through rigorous testing and refinement-is crucial. The paper’s method attempts to gracefully manage that decay, improving labeling efficiency not by eliminating vulnerability, but by understanding and accounting for it within the model’s lifecycle. It’s a recognition that uptime, like temporal harmony, is a fleeting state, achievable only through continuous vigilance and adaptation.

The Long View

The pursuit of efficiency in deep active learning, as demonstrated by this work, inevitably confronts the limitations inherent in any system attempting to optimize itself. Generating adversarial examples through formal verification offers a means of stressing the decision boundary, revealing vulnerabilities, and prompting more informative labeling. Yet, this process merely delays the inevitable entropy. Systems learn to age gracefully, not by becoming impervious to change, but by adapting to it. The core question isn’t whether a model can be made ‘robust’ in the face of ever-evolving adversarial threats, but rather how it degrades, and whether that degradation is predictable.

Future investigations might benefit from shifting focus from simply maximizing labeling efficiency to characterizing the types of failures that arise during active learning. Understanding the geometry of these failures-the specific weaknesses exposed by particular adversarial strategies-could be more valuable than striving for universally ‘robust’ models. The search for an unassailable model is, perhaps, a misdirection.

Sometimes observing the process of decay, documenting the subtle shifts in the decision boundary over time, is better than trying to speed it up. The accumulation of knowledge about these processes-the mapping of failure modes-may prove more enduring than any temporary fix. The goal, then, isn’t to prevent the system from aging, but to learn from its decline.

Original article: https://arxiv.org/pdf/2512.14170.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Perception: Deep Networks and Adversarial Shadows

Strategic Sampling: The Efficiency of Active Learning

Formal Guarantees: Verifying Robustness with Mathematical Rigor

Beyond Benchmarks: Charting a Course Towards Trustworthy AI

The Long View

See also: