Spotting the Flaws in Fake Data: A New Approach to Medical Image Quality

Author: Denis Avetisyan

Researchers develop a method to automatically identify and remove artificial distortions in synthetic medical images, boosting their usefulness for training AI.

The proposed method rigorously identifies distortions in data attributable to network effects, thereby enabling precise artifact detection.

This study presents a knowledge-based anomaly detection technique using shape analysis and isolation forest to identify network-induced artifacts in synthetic mammograms.

While synthetic data offers a promising solution to data scarcity in medical imaging, its uncritical adoption risks introducing subtle, yet performance-degrading, artifacts. This work, ‘Knowledge-based anomaly detection for identifying network-induced shape artifacts’, addresses this challenge with a novel method for detecting shape distortions in synthetic images, specifically within mammography. By combining per-image angle gradient analysis with an isolation forest-based anomaly detector, the approach effectively isolates anomalous regions with high accuracy—achieving AUC values up to 0.97 and demonstrating strong agreement with human readers. Could this knowledge-based approach become a standard quality control step in the responsible development of synthetic datasets for robust AI model training?

The Illusion of Abundance: Data Scarcity in Mammography

Training robust deep learning models for medical image analysis is hampered by limited, annotated datasets, particularly in mammography. Manual annotation is expensive and restricts data scale and diversity. This scarcity hinders the accuracy of early cancer detection systems. While data augmentation exists, it often introduces artifacts and fails to capture clinical variations, leading to poor generalization. Preliminary analyses reveal current synthetic data generation methods produce unrealistic features, potentially introducing bias. The pursuit of truly representative synthetic data, therefore, resembles a search for perfect form—an illusion that compels us onward.

Synthetic mammography images generated from both the CSAW-M and VinDr-Mammo datasets contain unrealistic features, highlighted in red annotations, that deviate from typical patient mammographic anatomy.

Synthetic Data: Mimicking Reality with Deep Learning

Deep learning techniques address limited medical imaging data, particularly in mammography, by generating realistic synthetic images. Generative models, like Latent Diffusion Models and StyleGAN2, learn underlying mammographic features from existing datasets, creating novel images. Two datasets, CSAW-M and VinDr-Mammo, serve as foundations for generating CSAW-syn and VMLO-syn, facilitating broader research access.

Patient datasets, CSAW-real and VMLO-real, exhibit comparable characteristics to their synthetic counterparts, CSAW-syn and VMLO-syn, in the generated images.

Detecting the Imperceptible: Validating Synthetic Image Quality

Generated images require rigorous assessment to prevent network-induced artifacts. Quantitative metrics like Fréchet Inception Distance and Inception Score are insufficient to detect subtle distortions. A method combining Boundary Extraction, Feature Space Construction utilizing Angle Gradient Analysis, and Isolation Forest anomaly detection was implemented to proactively identify these artifacts. This achieved an Area Under the Curve of 0.97 in reader studies, demonstrating a 14x improvement in artifact discovery and strong correlation with human evaluation (Kendall-Tau of 0.45 and 0.43).

The distributions of bin-wise cumulative sum of angle gradients, representing breast shape from top to bottom, show substantial overlap between patient and synthetic datasets, though a complete correspondence is not achieved.

Expanding the Diagnostic Horizon: Leveraging Synthetic Datasets

Researchers have developed synthetic datasets—CSAW-syn and VMLO-syn—to augment existing CSAW-real and VMLO-real datasets. This expands training volume and diversity for deep learning models. Incorporating synthetic data enhances model generalization and accuracy in identifying subtle malignancy indicators, particularly when real-world data is imbalanced. This scalable solution mitigates data scarcity, enabling more robust cancer detection systems. Like a perfectly balanced equation, expanding data capacity unlocks levels of diagnostic precision.

Analysis of breast area as a fraction of the total image area reveals that the CSAW-syn dataset extrapolates beyond the range observed in real images, while the VMLO-syn dataset exhibits a distribution shifted towards larger breast areas.

The pursuit of robust synthetic data, as detailed in the article, demands a level of rigor beyond mere functional correctness. The proposed knowledge-based anomaly detection method, leveraging shape analysis and isolation forests, seeks not simply to generate images, but to ensure their inherent validity. This echoes Fei-Fei Li’s sentiment: “AI has the potential to be the most transformative technology of our time, but only if we build it on a foundation of trust and understanding.” If the synthetic data contains undetectable artifacts – shapes subtly distorted by network effects – the entire training process becomes suspect. The method’s focus on provable shape consistency, rather than relying solely on visual assessment, embodies a commitment to that foundational trust. If it feels like magic—a flawlessly rendered image concealing underlying inconsistencies—one hasn’t revealed the invariant.

What’s Next?

The presented methodology, while demonstrating a functional approach to artifact detection in synthetic medical imagery, sidesteps the fundamental question of definitive artifact characterization. The isolation forest, a statistically elegant, yet ultimately empirical, technique, identifies anomalies based on density – a useful heuristic, certainly, but lacking a formal proof of correctness regarding what constitutes a physiologically implausible mammographic shape. Future work must prioritize the development of shape priors derived from established biomechanical models and anatomical constraints. Only then can a system truly know an artifact, rather than merely suspect one.

A persistent limitation lies in the reliance on synthetic data for both training and evaluation. While pragmatically necessary, this introduces a circularity. The system learns to detect artifacts within the characteristics of the synthetic generation process itself. Establishing the generalizability of this approach requires rigorous testing against real-world clinical data, ideally with artifacts deliberately introduced under controlled conditions – a logistical challenge, but a mathematical imperative.

The true advance will not be in achieving higher detection rates, but in formalizing the very definition of a medical image artifact. A theorem, not a benchmark, should be the ultimate goal. The current work represents a step towards that ideal, but the path remains open – and demands a level of mathematical rigor often absent in the pursuit of merely ‘working’ solutions.

Original article: https://arxiv.org/pdf/2511.04729.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Abundance: Data Scarcity in Mammography

Synthetic Data: Mimicking Reality with Deep Learning

Detecting the Imperceptible: Validating Synthetic Image Quality

Expanding the Diagnostic Horizon: Leveraging Synthetic Datasets

What’s Next?

See also: