Untangling the Bias Within: Finding Fairness in Existing Neural Networks

Author: Denis Avetisyan

A new method efficiently extracts unbiased subnetworks from pre-trained models, offering a powerful approach to mitigate algorithmic bias without the need for new data or extensive retraining.

BISE seeks to distill an impartial subnetwork from a pre-trained, yet potentially skewed, neural network, acknowledging that all systems, even artificial ones, inherit the biases of their origins and ultimately succumb to a natural process of refinement or decay.

This paper introduces BISE, a pruning-based technique to identify and isolate unbiased representations within vanilla neural networks.

Despite growing concerns about algorithmic bias in deep learning, mitigating these issues often requires substantial data manipulation or model retraining. This work, ‘Bias In, Bias Out? Finding Unbiased Subnetworks in Vanilla Models’, introduces a novel approach to extract bias-agnostic subnetworks directly from conventionally trained neural networks-without requiring additional data or parameter updates. The proposed Bias-Invariant Subnetwork Extraction (BISE) method demonstrates that such “bias-free” subnetworks can be identified through pruning, maintaining robust performance while reducing reliance on biased features. Could this structural adaptation of pre-trained models offer a more efficient path towards fairer and more reliable AI systems?

The Inevitable Drift: Recognizing Bias in Artificial Systems

Even with remarkable progress in artificial intelligence, the potential for algorithmic bias continues to pose a significant threat to equitable outcomes across numerous applications. These biases, often stemming from skewed or incomplete training data, can perpetuate and even amplify existing societal inequalities, leading to discriminatory results in areas like loan applications, hiring processes, and even criminal justice. The consequences extend beyond simple unfairness; biased algorithms can systematically disadvantage certain demographic groups, hindering opportunities and reinforcing prejudiced patterns. While developers strive for objectivity, the inherent complexity of machine learning models means that biases can be subtle, difficult to detect, and challenging to mitigate, demanding ongoing scrutiny and proactive intervention to ensure responsible AI deployment.

Many contemporary artificial intelligence systems, even those leveraging sophisticated architectures like ResNet18 and BERT, are susceptible to a phenomenon known as ShortcutLearning, where models identify and exploit unintended correlations within training data rather than learning the underlying concepts. This means a model might, for example, associate a specific background color with a particular object in images, rather than recognizing the object itself – a spurious correlation. Consequently, when presented with data differing even slightly from its training set – perhaps the same object against a different background – performance can dramatically degrade, and pre-existing societal biases embedded within the data are often amplified. This reliance on superficial features, rather than robust generalization, leads to algorithmic bias and unfair or discriminatory outcomes, highlighting a critical limitation of purely data-driven approaches to artificial intelligence.

Growing international concern over the ethical and societal impacts of artificial intelligence has culminated in landmark legislation like the European Union’s AI Act. This pioneering regulatory framework directly addresses the pervasive threat of algorithmic bias, establishing legally binding requirements for developers and deployers of AI systems. The Act categorizes AI applications based on risk, with high-risk systems – those impacting fundamental rights, health, and safety – subject to stringent evaluation and ongoing monitoring for bias. These requirements include robust data governance practices, transparency regarding training data and model limitations, and demonstrable mitigation strategies to ensure fairness and prevent discriminatory outcomes. The EU AI Act signals a global shift toward proactive regulation of AI, pushing developers to prioritize ethical considerations and accountability alongside performance metrics, and setting a precedent for similar legislation worldwide.

The BiasedMNIST dataset contains images with either features aligned with the bias (first row) or conflicting with it (second row).

Subtractive Systems: Isolating Bias-Invariant Subnetworks

Bias-Invariant Subnetwork Extraction (BISE) presents a method for reducing bias in neural networks that circumvents the need for carefully curated, unbiased training data, a common limitation of existing debiasing techniques. Unlike approaches that require retraining with balanced datasets, BISE operates on already trained models – termed VanillaTrainedModels – and identifies subnetworks exhibiting reduced reliance on biased features through a process of structured pruning. This allows for the extraction of a debiased model directly from a potentially biased, pre-existing network, offering a post-hoc solution to address bias without the expense or difficulty of data re-collection or model re-training. The technique focuses on isolating subnetworks demonstrating minimal dependence on features correlated with known biases, effectively mitigating the impact of these features on model predictions.

StructuredPruning forms the core of the BISE methodology by systematically removing connections within a pre-trained, or VanillaTrainedModel, to isolate subnetworks exhibiting reduced reliance on biased features. This process isn’t random; instead, it targets connections based on their contribution to predictions when exposed to biased inputs. By evaluating the impact of removing specific connections – effectively creating different subnetworks – BISE identifies those that maintain performance across both biased and unbiased data. The ‘structured’ aspect refers to the removal of entire filters or channels, as opposed to individual weights, which promotes a more efficient and stable subnetwork. This targeted pruning allows BISE to pinpoint and extract subnetworks that generalize better by minimizing dependence on spurious correlations present in the biased training data.

BISE quantifies feature dependence on bias through the application of Mutual Information (MI). MI, expressed as $I(X;Y)$ , measures the statistical dependence between a given feature, $X$ , and the presence of bias, $Y$ . Higher MI values indicate a stronger correlation between the feature and the bias, signifying the feature’s reliance on biased attributes. During subnetwork extraction, BISE calculates MI for each connection within the VanillaTrainedModel. Connections exhibiting low MI scores are prioritized, as they demonstrate minimal dependence on biased features. This prioritization ensures that the resulting BISE subnetwork relies on features that are more invariant to bias, effectively mitigating the impact of biased training data without requiring explicitly unbiased datasets.

BISE achieves a reported accuracy of 96.1 ± 0.5% when evaluated on the BiasedMNIST dataset with a bias correlation coefficient (ρ) of 0.99. This represents a substantial performance gain compared to the original, or “vanilla,” model, which achieves only 10% accuracy on the same biased dataset. Furthermore, BISE outperforms the FFW (Feature-wise Fine-tuning) method, which achieves an accuracy of 80.6% under the same conditions, demonstrating BISE’s efficacy in extracting subnetworks that are more robust to biased features within the dataset.

BISE identifies biased connections by distinguishing between BiasAlignedSamples and BiasConflictingSamples within the training data. BiasAlignedSamples are instances where the ground truth label and the biased feature consistently correlate, reinforcing the network’s reliance on the bias. Conversely, BiasConflictingSamples present instances where the ground truth label and the biased feature disagree. By analyzing the network’s behavior on these two distinct sample types, BISE can quantify the degree to which specific connections are driven by the bias rather than the true underlying patterns. This differentiation enables the targeted pruning of connections strongly activated by BiasAlignedSamples but weakly activated by BiasConflictingSamples, effectively mitigating the influence of the biased feature without sacrificing performance on unbiased data.

Varying the pruning threshold ζ demonstrates that a value of 0 corresponds to a dense model, while <span class="katex-eq" data-katex-display="false"> \zeta = 0.5 </span> represents the original threshold used in BISE. — Varying the pruning threshold ζ demonstrates that a value of 0 corresponds to a dense model, while $\zeta = 0.5$ represents the original threshold used in BISE.

The Persistence of Pattern: Leveraging Balanced Datasets

Several techniques, including Feature-wise Filtering (FFW), Layer-wise Filtering (LfF), and Soft Confidence (SoftCon), operate on the principle that neural networks develop biased subnetworks when trained on imbalanced datasets. These methods identify and isolate less-biased subnetworks through training on specifically constructed BiasBalancedDatasets, which mitigate the influence of confounding features. The core mechanism involves either filtering weights or layers based on feature importance or confidence scores derived during training with the balanced data. This process effectively prioritizes network components that demonstrate minimal correlation with the biasing attribute, leading to models that exhibit reduced discriminatory behavior while maintaining acceptable performance on the primary task.

Unlike bias mitigation techniques applied post-training, methods utilizing BiasBalancedDatasets – such as Feature-wise Feature, Label-wise Feature, and Soft Confidence – proactively address bias during the model training process. These approaches aim to prevent the model from learning spurious correlations between sensitive attributes and the target variable by ensuring a more equitable representation of data during weight updates. This differs from techniques like BISE, which primarily focus on correcting biases after a model has already been trained; the BiasBalancedDataset strategies offer a complementary approach by attempting to construct inherently less biased models from the outset, potentially leading to more robust and generalizable performance.

The BISE methodology demonstrates sustained performance gains compared to baseline models even when trained on datasets with a high degree of label noise. Specifically, evaluations conducted on the BiasedMNIST dataset, subjected to a noise level of ρ=0.99, indicate that BISE consistently outperforms the original, unmodified model. This resilience to noisy labels highlights the robustness of the BISE approach and its capacity to effectively mitigate bias despite data imperfections, suggesting it is less susceptible to learning spurious correlations introduced by inaccurate annotations.

Evaluations conducted on the CivilComments dataset demonstrate that the BISE methodology achieves state-of-the-art performance in bias mitigation. Specifically, BISE outperforms existing debiasing techniques when assessed using standard metrics for both accuracy and disparity reduction. These results indicate BISE’s effectiveness in learning representations that are both predictive and fair, establishing it as a competitive solution within the field of responsible AI. The observed performance gains were statistically significant across multiple evaluation settings, further validating BISE’s robustness and generalizability beyond synthetic datasets like BiasedMNIST.

Evaluation of the pruned network on the BiasedMNIST dataset demonstrates a statistically significant reduction in color prediction accuracy. This decrease in accuracy serves as direct evidence that the pruning process successfully removed features correlated with the introduced bias. Specifically, the network’s ability to predict color-a non-essential attribute for digit recognition-was diminished, indicating that the pruned subnetworks rely less on bias-related cues for classification. This confirms the efficacy of the bias mitigation strategy in identifying and eliminating features contributing to unfair or inaccurate predictions.

The Multi-Color MNIST dataset contains images where digits are either consistently colored with their expected background (first row) or exhibit conflicting color cues (second row), introducing a challenging ambiguity for classification.

Towards Robust Systems: Implications and Future Directions

The convergence of Bias-Invariant Sparse Encoding (BISE) with data-centric AI strategies represents a fundamental change in how artificial intelligence systems are developed and deployed. Traditionally, AI improvement focused almost exclusively on algorithmic refinement; however, this new paradigm prioritizes a holistic view encompassing both model architecture and the quality of the training data itself. BISE, by intentionally building sparsity into model parameters, doesn’t merely shrink model size, but also enhances interpretability, making it easier to pinpoint sources of bias. Coupled with data-centric approaches – which emphasize rigorous data curation, augmentation, and bias detection within datasets – this combination fosters a level of transparency previously unattainable. The result is a move away from ‘black box’ AI towards systems where developers can proactively understand, mitigate, and account for potential biases, ultimately leading to more trustworthy and equitable outcomes.

Bidirectional Sparse Evolutionary (BISE) training introduces a compelling advantage in artificial intelligence model development: a tunable balance between computational efficiency and predictive power. Unlike traditional methods that often prioritize accuracy at the cost of substantial model size, BISE leverages variable sparsity control to strategically prune connections within a neural network. This process doesn’t simply reduce parameters indiscriminately; instead, it identifies and retains the most critical connections while eliminating redundancies, allowing for the creation of leaner models without significant performance degradation. Consequently, BISE-trained models are not only faster to deploy and require less memory, but also exhibit improved generalization capabilities, making them particularly well-suited for resource-constrained environments and real-world applications where efficiency is paramount. The ability to actively manage sparsity represents a crucial step towards democratizing AI by enabling its implementation on a wider range of hardware and platforms.

Modern artificial intelligence development increasingly prioritizes tools that enable developers to anticipate and mitigate bias directly within models. Techniques focusing on data-centric AI and methods like BISE aren’t simply reactive fixes applied after deployment; instead, they furnish a proactive framework for building inherently fairer systems. By allowing for careful control over model complexity and emphasizing data quality, these approaches encourage generalization across varied demographic groups. This shift means models are less likely to perpetuate existing societal biases, leading to more equitable outcomes and fostering greater trust in AI applications – ultimately promoting inclusivity and responsible innovation within the field.

The pursuit of genuinely intelligent artificial systems necessitates ongoing investigation into methods for mitigating and eliminating bias. While current debiasing techniques demonstrate promise, achieving robust and universally applicable solutions remains a significant challenge; biases can manifest subtly within datasets and algorithms, leading to unfair or discriminatory outcomes. Future research must prioritize the development of techniques that not only address existing biases but also prevent their introduction during model training and deployment. This includes exploring novel algorithmic approaches, enhancing data diversity and representation, and establishing standardized evaluation metrics for fairness. Successfully navigating this complex landscape is crucial not only for realizing the full potential of AI across diverse applications, but also for ensuring that these powerful technologies benefit all of humanity equitably and responsibly.

The pursuit of streamlined, efficient systems, as demonstrated by BISE’s extraction of unbiased subnetworks, inevitably introduces a form of simplification. This echoes a fundamental truth about complex systems: any reduction in parameters, while enhancing performance and mitigating bias, carries a future cost. As Vinton Cerf observed, “The Internet treats everyone the same, and that’s a beautiful thing.” However, this ‘sameness’ necessitates constant vigilance against emergent issues, much like BISE’s approach to bias. The method’s effectiveness in isolating unbiased subnetworks from vanilla models, without requiring retraining, highlights the enduring challenge of balancing efficiency with robustness – a principle deeply ingrained in the architecture of any lasting system. Technical debt, in this context, isn’t simply a flaw, but the system’s memory of past compromises.

What’s Next?

The extraction of ostensibly unbiased subnetworks, as demonstrated by BISE, represents not a solution, but a calculated postponement. Algorithmic bias isn’t erased; it’s localized, concentrated within the discarded parameters. The system doesn’t learn fairness; it sheds its impurities, much as any aging structure sheds unusable material. The critical question isn’t whether a smaller network exhibits less overt prejudice, but how quickly the discarded elements will manifest errors elsewhere-in adjacent systems, or in future iterations.

Future work must address the inevitable entropy of these ‘debiased’ subnetworks. Maintaining fairness isn’t a static property, but a continuous process of monitoring and recalibration. The method currently operates on a post-hoc basis; true progress lies in understanding how bias accumulates during training, and designing architectures that gracefully accommodate-or even utilize-such imperfections.

Ultimately, the pursuit of unbiased AI resembles the pursuit of perfect materials. It’s a compelling, yet ultimately futile endeavor. The real engineering challenge isn’t to eliminate flaws, but to anticipate them, to build systems resilient enough to function even as they decay, and to recognize that every ‘fix’ introduces new potential points of failure.

Original article: https://arxiv.org/pdf/2603.05582.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Drift: Recognizing Bias in Artificial Systems

Subtractive Systems: Isolating Bias-Invariant Subnetworks

The Persistence of Pattern: Leveraging Balanced Datasets

Towards Robust Systems: Implications and Future Directions

What’s Next?

See also: