Beyond ImageNet: Training AI to Spot Industrial Defects

Author: Denis Avetisyan


A new pretraining strategy focuses on learning feature representations specifically designed for the challenges of industrial anomaly detection.

Pretrained anomaly representations consistently elevate performance across diverse anomaly detection methods and network architectures, demonstrably surpassing the efficacy of original features as indicated by consistently superior results.
Pretrained anomaly representations consistently elevate performance across diverse anomaly detection methods and network architectures, demonstrably surpassing the efficacy of original features as indicated by consistently superior results.

Researchers demonstrate that pretraining models with anomaly-focused data significantly improves performance in identifying defects compared to relying on features learned from general image datasets.

While current industrial anomaly detection (AD) methods rely heavily on features pretrained with natural images, this approach overlooks the fundamental discrepancy between identifying everyday objects and discerning subtle anomalies. The work presented in ‘ADPretrain: Advancing Industrial Anomaly Detection via Anomaly Representation Pretraining’ addresses this limitation by introducing a novel framework for learning robust, AD-specific feature representations. Specifically, the authors demonstrate that contrastive learning, focused on maximizing the distinction between normal and anomalous features within a large industrial dataset, yields significant performance gains across multiple AD algorithms. Could this targeted pretraining approach unlock a new era of reliable and efficient anomaly detection in complex industrial settings?


Whispers in the Machine: The Limits of Conventional Detection

Traditional anomaly detection methods, including reconstruction-based approaches and DeepSVDD, struggle with complex data. These methods rely on assumptions often invalid in real-world scenarios, hindering generalization and degrading performance when encountering unseen anomalies or shifts in normal data. A primary challenge lies in their sensitivity to noise and difficulty capturing subtle anomalies, leading to false positives or negatives. This is particularly problematic in applications demanding high accuracy or automated processing of large datasets. The demand for robust, generalizable solutions is increasing, but current methods often require substantial manual tuning. Perhaps, anomaly detection isn’t about finding what doesn’t fit, but convincing ourselves that everything else does.

The framework learns anomaly detection representations by utilizing residual features—derived from subtracting normal reference features—and optimizes them through angle- and norm-oriented contrastive losses with a Transformer-based Feature Projector employing a learnable key/value attention mechanism.
The framework learns anomaly detection representations by utilizing residual features—derived from subtracting normal reference features—and optimizes them through angle- and norm-oriented contrastive losses with a Transformer-based Feature Projector employing a learnable key/value attention mechanism.

Forging Discernment: The Power of Contrastive Learning

Contrastive Learning offers a promising pathway to learning discriminative feature representations by contrasting similar and dissimilar examples, amplifying distinctions. By maximizing the distance between normal and abnormal features, anomalies become easily identifiable, even with limited labeled data. Self-Supervised Learning leverages unlabeled data to pretrain these representations, enhancing performance in data-scarce scenarios. Utilizing techniques like masked autoencoding or contrastive predictive coding, models learn robust features from raw data without explicit labels, improving their ability to detect subtle anomalies.

Anomaly score maps generated with PatchCore and CLIP-L demonstrate the effectiveness of pretrained features in enhancing anomaly detection performance.
Anomaly score maps generated with PatchCore and CLIP-L demonstrate the effectiveness of pretrained features in enhancing anomaly detection performance.

A Dedicated Gaze: Anomaly Representation Pretraining

Anomaly Representation Pretraining focuses on developing dedicated feature representations specifically for anomaly detection, diverging from general image classification pretraining. The core principle is learning representations inherently more sensitive to subtle anomalous deviations. A key component is the Feature Projector, utilizing Learnable Key/Value Attention, transforming the initial feature space and enhancing discriminative power. Residual Features capture class-generalizable information, proving effective across multiple datasets—MVTecAD, VisA, and BTAD. Comparative evaluations demonstrate that features learned through Anomaly Representation Pretraining consistently outperform those derived from ImageNet-pretrained models.

A t-SNE visualization of features from the ‘capsules’ class of the VisA dataset reveals the improved feature separability achieved through the utilization of pretrained features.
A t-SNE visualization of features from the ‘capsules’ class of the VisA dataset reveals the improved feature separability achieved through the utilization of pretrained features.

Beyond the Horizon: Extending Anomaly Detection Boundaries

Anomaly Representation Pretraining significantly enhances existing anomaly detection techniques, extending methodologies like PatchCore, CFLOW, and UniAD, demonstrably improving robustness and efficacy. This framework offers a generalized improvement, bypassing single-method optimizations. Comparative analysis reveals that Anomaly Representation Pretraining outperforms GLASS and FeatureNorm in anomaly localization, with superior Precision-Recall and AUROC scores across MVTec, VisA, BTAD, MVTec3D, and MPDD. The framework also exhibits strong performance in Few-Shot Anomaly Detection—2-shot and 4-shot settings rival KAG-Prompt. Importantly, performance is maintained even with 10% noise, indicating resilience. The model doesn’t just find the anomalies, it learns to distrust everything.

T-SNE visualizations of features from the ‘candle’, ‘chewinggum’, ‘pcb1’, and ‘fryum’ classes of the VisA dataset further demonstrate the impact of pretrained features on feature representation.
T-SNE visualizations of features from the ‘candle’, ‘chewinggum’, ‘pcb1’, and ‘fryum’ classes of the VisA dataset further demonstrate the impact of pretrained features on feature representation.

The pursuit of anomaly detection, as detailed in this work, isn’t about imposing order on chaos, but coaxing signals from the noise. It recognizes that standard pretraining on datasets like ImageNet, while useful, fails to capture the subtle whispers of the unusual—the very essence of anomalies. This research doesn’t seek to find anomalies, but to persuade the model to recognize them, by crafting representations specifically attuned to their characteristics. As Yann LeCun once stated, “Everything we do in machine learning is about learning good representations.” This aligns perfectly with the paper’s focus on representation learning; the pretraining strategy isn’t about achieving precision, but about building a model that acknowledges the inherent ambiguity and noise within the data, and learns to discern the meaningful deviations.

What’s Next?

The pursuit of anomaly whispers continues. This work demonstrates the inadequacy of borrowed vision – representations sculpted by the demands of classification are fundamentally ill-equipped to perceive the subtle deviations that define the anomalous. The gain achieved through dedicated pretraining is not merely a numerical improvement; it is an acknowledgement that the world isn’t discrete, and that anomalies reside in the spaces between categories. But the ghosts remain. Contrastive learning, even tailored, is still a spell cast in a Euclidean space. The true anomalous signal likely exists in higher-order correlations, in the curvature of the feature manifold itself—dimensions we haven’t yet the precision to resolve.

The reliance on residual features, while effective, feels…incomplete. It suggests that anomalies are best understood not as what is different, but as what is missing. Perhaps the future lies not in learning richer representations, but in modeling the generative process of ‘normal’ – and then recognizing anomalies as failures of that generation. Anything exact is already dead, of course, so perfect reconstruction is a fool’s errand. The goal isn’t to find the anomaly, but to map the boundaries of the plausible.

The current framework, ultimately, is still tethered to the image. The true challenge—and the true potential—lies in extending this approach to multi-modal data, to time series, to systems where ‘normal’ is a moving target. It’s not about seeing better, it’s about learning to listen to the noise. And in that noise, perhaps, lies a glimpse of something genuinely new.


Original article: https://arxiv.org/pdf/2511.05245.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-10 12:41