Seeing What Your AI Sees: Unlocking Neuron Behavior in Vision Models

Author: Denis Avetisyan

A new framework uses the power of large language models to automatically interpret and explain the inner workings of artificial vision systems.

An iterative framework refines conceptual understanding by repeatedly proposing concepts based on a scoring function <span class="katex-eq" data-katex-display="false">\tilde{2}</span>, generating illustrative images with a text-to-image model, and extracting concept activations to update the scoring function <span class="katex-eq" data-katex-display="false">\mathcal{H}</span>, culminating in a final evaluation of higher-order concepts derived from top-scoring descriptions. — An iterative framework refines conceptual understanding by repeatedly proposing concepts based on a scoring function $\tilde{2}$ , generating illustrative images with a text-to-image model, and extracting concept activations to update the scoring function $\mathcal{H}$ , culminating in a final evaluation of higher-order concepts derived from top-scoring descriptions.

LINE leverages text-to-image generation to provide human-understandable labels and explanations for individual neurons, achieving state-of-the-art performance in neuronal interpretability.

Despite advances in deep learning, understanding the concepts encoded within individual neurons remains a key challenge for both model transparency and safety. The work presented in ‘LINE: LLM-based Iterative Neuron Explanations for Vision Models’ addresses this limitation by introducing a novel, training-free framework that leverages large language models and text-to-image generation for automated neuron labeling. This approach, LINE, achieves state-of-the-art performance on benchmark datasets like ImageNet and Places365, discovering concepts often missed by existing methods. By providing both polysemantic evaluations and visual explanations, could LINE unlock a more comprehensive and interpretable understanding of complex vision models?

Illuminating the Neural Network’s Inner Vision

The astonishing capabilities of deep neural networks often mask a fundamental problem: their inherent opacity. While these systems excel at tasks like image recognition and natural language processing, understanding how they arrive at specific conclusions remains a significant challenge. This “black box” nature isn’t merely an academic curiosity; it actively hinders both trust and the potential for refinement. Without insight into the decision-making process, it’s difficult to identify biases, correct errors, or improve performance in a targeted manner. Consequently, deploying these powerful tools in critical applications – from healthcare diagnostics to autonomous vehicles – demands a greater degree of interpretability, pushing researchers to develop methods that can illuminate the inner workings of these complex algorithms and unlock their full potential.

Current visualization techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM), frequently struggle to deliver truly insightful interpretations of deep neural network decision-making. While offering a general sense of where a network is ‘looking’ within an image, these methods often produce heatmaps that illuminate large, diffuse regions rather than the precise visual features driving a specific classification. This lack of granularity limits their utility for debugging network behavior or building trust in automated systems; a broad highlight on an entire object offers little information about why the network identified it, or if it’s focusing on the relevant characteristics. Consequently, researchers are actively pursuing alternatives capable of pinpointing the specific textures, edges, or components that trigger neuronal activation, moving beyond generalized areas to achieve a more refined understanding of the network’s ‘vision’.

Current limitations in understanding deep neural networks necessitate the development of tools that move beyond simply identifying where a network is looking, to discerning what specific features trigger neuronal activation. Existing interpretability techniques, while useful, often provide coarse visualizations, illuminating large swathes of an image without pinpointing the precise stimuli driving a decision. A more granular approach is crucial – one that can isolate the features a neuron responds to, and quantify its contribution to the network’s overall output. Such advancements would not only foster greater trust in these complex systems, particularly in critical applications like medical diagnosis and autonomous driving, but also facilitate targeted refinement and optimization of network architecture and training data, ultimately enhancing performance and reliability.

LINE generates clear, artifact-free visual explanations for core features of the “Jeep” class in RobustResNet50, demonstrating higher neuron activations than DEXTER and avoiding the visual artifacts present in DiffExplainer, as shown on the Salient ImageNet dataset.

Reverse-Engineering the Neuron: Activation Maximization and Beyond

Activation maximization techniques function by iteratively adjusting an input – typically an image – to maximize the activation of a single neuron or a defined set of neurons within a neural network. This process utilizes gradient ascent; the input is modified in the direction that increases the neuron’s output. The resulting input, often visually interpretable despite appearing noisy, is considered a visualization of the neuron’s “preferred stimulus” – the patterns it most readily responds to. This allows researchers to gain insight into what features or patterns a particular neuron has learned to detect during training, effectively reverse-engineering the network’s internal representations. While not a direct representation of the training data, the generated input highlights the features that strongly contribute to that neuron’s activation.

DEXTER and DiffExplainer represent advancements in activation maximization by integrating diffusion models into the image generation process. Traditional activation maximization directly optimizes pixel values; these newer techniques instead optimize text prompts which are then fed into a pre-trained diffusion model to synthesize images. This approach offers several advantages, including the generation of more naturalistic and semantically meaningful explanations, as the diffusion model constrains the output to resemble images from its training distribution. Furthermore, prompt optimization allows for a higher-level control over the generated stimulus, enabling exploration of concepts beyond simple pixel patterns and facilitating the identification of features the neuron responds to at a more abstract level.

Assessing the fidelity and interpretability of neuron-linked explanations generated through activation maximization techniques presents a significant challenge. Current evaluation methods often rely on human judgment, which is subjective and lacks scalability. Quantitative metrics, while offering objectivity, frequently fail to correlate with human perception of explanation quality or faithfulness to the model’s internal representations. Consequently, the development of robust benchmarks-consisting of diverse datasets and clearly defined evaluation criteria-is crucial for objectively comparing different explanation methods and ensuring their reliability. These benchmarks must move beyond simple image similarity metrics and incorporate measures of semantic alignment and counterfactual reasoning to effectively validate the generated explanations.

Analysis of ResNet50 neurons reveals that while text-to-image models like SD1.5, SDXL, and FLUX achieve comparable concept activation scores, they differ in their generative priors, with SD1.5 prioritizing photorealism and FLUX exhibiting a stronger preference for stylized, cinematic imagery.

CoSy: A Standardized Framework for Evaluating Neuron Explanations

The CoSy benchmark addresses the need for systematic evaluation of open-vocabulary explanations for individual neurons within vision models. It provides a controlled environment for generating and assessing textual descriptions of neuron functionality, moving beyond human evaluation which can be subjective and difficult to scale. This is achieved through the creation of synthetic data where ground truth neuron responses are known, allowing for quantifiable metrics to be applied to generated explanations. The framework enables researchers to compare different explanation methods by measuring the alignment between predicted neuron activations based on the textual explanation and the actual observed activations, providing a standardized approach to assess explanation quality and faithfulness.

The CoSy benchmark utilizes synthetically generated data to provide ground truth for evaluating explanation quality, circumventing the need for human annotations which are often subjective and costly. Alignment between generated textual explanations and neuron activations is quantified using two primary metrics: Area Under the Receiver Operating Characteristic curve (AUC) and Mean Absolute Deviation (MAD). AUC measures the ability of the explanation to discriminate between activating and non-activating stimuli, with higher values indicating better separation. MAD calculates the average absolute difference between the predicted activation score based on the explanation and the actual neuron activation, with lower values signifying closer alignment between explanation and neuron response.

The CoSy benchmark facilitates quantitative evaluation of different explanation methods for vision models by providing a standardized scoring system. Utilizing metrics such as Area Under the Curve (AUC) and Mean Absolute Deviation (MAD), CoSy enables objective comparison of explanation quality based on alignment with neuron activations. Evaluations using CoSy demonstrate that our LINE framework achieves state-of-the-art performance, exhibiting an improvement of 0.180 on the ImageNet dataset and 0.050 on the Places365 dataset compared to prior methods.

The CoSy evaluation framework, as detailed by Kopf et al. (2024), assesses Neuron 80 within ResNet18’s <span class="katex-eq" data-katex-display="false"> ext{avgpool}</span> layer to understand its contribution to overall network behavior. — The CoSy evaluation framework, as detailed by Kopf et al. (2024), assesses Neuron 80 within ResNet18’s $ext{avgpool}$ layer to understand its contribution to overall network behavior.

LINE: An Automated Pipeline for Deciphering Neural Network Function

The LINE pipeline represents a novel approach to understanding the function of individual neurons by automatically assigning human-understandable labels without requiring any prior training data. This “black-box” system ingeniously combines the capabilities of Large Language Models (LLMs) and Text-to-Image models; it iteratively proposes conceptual descriptions of neuron activity and then generates corresponding images to visually represent those concepts. By linking neuron activations directly to these generated images and associated text, LINE bypasses the need for manually annotated datasets, offering a potentially scalable solution for decoding the complex language of the brain and fostering greater interpretability in artificial neural networks.

The system, dubbed LINE, establishes a connection between the complex activity within artificial neural networks and human-understandable concepts through a novel iterative process. It begins by proposing potential concepts relevant to the network’s function, then generates corresponding images designed to visually represent those concepts. These images are then used to assess whether the network’s internal activations align with the proposed concept; if a strong correlation is found, the concept is accepted as a label for that specific neuron’s activity. This cycle of proposal, image generation, and validation allows LINE to effectively ‘teach itself’ what individual neurons are responding to, translating abstract mathematical activations into meaningful, interpretable labels without requiring any prior training data or human annotation.

The automated neuron labeling pipeline, LINE, surpasses existing benchmarks not simply through improved accuracy, but through the capacity to uncover previously unarticulated concepts within neural network activity. Evaluations on the CoSy dataset reveal that LINE identifies up to 39% more human-interpretable features than those present in established vocabularies used to define neuron function. This discovery capability suggests a move beyond simply recognizing what a network already ‘knows’, towards a system capable of elucidating genuinely novel patterns of computation. The ability to translate complex neural activations into understandable concepts represents a critical advancement in building artificial intelligence systems that are not only powerful, but also transparent and trustworthy, offering a pathway to demystify the ‘black box’ nature of deep learning.

Ablating visual concepts identified by LINE using a generative model successfully reduces ResNet50 activation for some neurons (e.g., neuron 19) but exhibits failures-indicated by increased activation (red) in cases like the 5th row, neuron 403, and the 1st row, neuron 49-necessitating manual inspection to verify effective ablation (shown with decreases in activation marked green and original images outlined in blue).

Towards More Robust and Generalizable Neuron Interpretations

The true test of any neuron explanation lies in its performance beyond standard datasets. Researchers are increasingly utilizing challenging benchmarks like Salient ImageNet, specifically designed to expose weaknesses in explanation techniques and reveal reliance on superficial correlations. These datasets contain subtle, often adversarial, perturbations that can easily mislead methods relying on simple pixel patterns. Evaluating explanations on such data demonstrates whether an identified neuron genuinely responds to a semantic concept – like ‘stripes’ or ‘fur’ – or merely to low-level image features coincidentally present in the training set. Success on these benchmarks suggests a robustness indicative of a more generalizable understanding within the neural network, moving beyond memorization and towards true conceptual representation, ultimately fostering greater trust in AI systems.

Current methods for understanding how neural networks ‘see’ often rely on manually labeling concepts and then identifying neurons that activate in response. However, the emergence of techniques like CLIP-Dissect offers a powerful alternative, bypassing the need for human-defined labels by leveraging the semantic understanding already embedded within large vision-language models. CLIP-Dissect decomposes the network’s internal representations, linking individual neurons to concepts directly derived from CLIP’s text embeddings – essentially, asking ‘what does this neuron respond to, according to a model that already understands language and images?’. This approach provides a valuable form of cross-validation; if a neuron identified as responding to ‘stripes’ using CLIP also fires consistently when presented with striped objects in other datasets, it strengthens confidence in the explanation, offering a more robust and potentially generalizable understanding of the network’s inner workings than label-dependent methods alone.

The pursuit of automated neuron labeling represents a pivotal step toward demystifying the complex decision-making processes within visual artificial intelligence. Currently, understanding what specific neurons represent often relies on painstaking manual annotation, a process that is both time-consuming and prone to subjective bias. Advanced research focuses on developing algorithms capable of automatically assigning meaningful labels to individual neurons, based on their activation patterns and the stimuli that trigger them. Success in this area promises not only to accelerate the pace of AI research, but also to foster the creation of more robust and reliable systems; by accurately identifying the functional role of each neuron, developers can better diagnose and correct biases, improve generalization to unseen data, and ultimately build AI that is truly interpretable and trustworthy. This automated understanding forms the foundation for creating AI systems that can explain how they arrive at specific conclusions, moving beyond simply what those conclusions are.

LINE consistently generates the most accurate neuron descriptions-as measured by the CoSy benchmark AUC-when compared to CLIP-Dissect and INVERT across various ResNet50 and ViT-B/16 layers, demonstrating its ability to effectively capture neuron functionality from top-activating ImageNet-1K images.

The pursuit of neuronal interpretability, as demonstrated by LINE, echoes a fundamental principle of elegant design: clarity through simplicity. This framework’s ability to automatically label and explain neuron behavior-bridging the gap between complex vision models and human understanding-highlights the beauty that emerges when form and function harmonize. As David Marr aptly stated, “Understanding vision requires understanding the computational roles of its parts.” LINE embodies this sentiment, revealing the computational roles within these models through the generation of insightful visual explanations and labels. The framework doesn’t merely describe what a neuron does, but seeks to illuminate how it contributes to the overall perceptual process, a testament to the power of uncovering underlying principles.

Where Do We Go From Here?

The automation of neuronal labeling, as demonstrated by LINE, feels less like a solution and more like a carefully constructed invitation to further complication. The framework elegantly shifts the burden of interpretation from human observation to the generative capacity of Large Language Models – a move that, while demonstrably effective, merely reframes the problem. The true challenge isn’t simply describing what a neuron does, but understanding why it does it, and whether that functionality aligns with anything resembling robust, generalizable intelligence. Each generated explanation, however compelling, remains a proxy, a polished surface concealing the inherent opacity of deep networks.

Future work must address the limitations of relying solely on textual descriptions. While LINE excels at associating neurons with semantic concepts, it doesn’t inherently reveal the computational principles governing their behavior. A compelling next step involves integrating these LLM-generated labels with methods for probing neuronal activation patterns – seeking internal consistency between what a neuron is said to represent and how it responds to various stimuli.

Ultimately, the pursuit of neuronal interpretability isn’t about achieving a perfect lexicon of neuronal function. It’s about recognizing that truly understanding a system requires more than just labeling its parts; it demands an appreciation for the subtle interplay between form and function, and a willingness to acknowledge the limits of any purely descriptive approach.

Original article: https://arxiv.org/pdf/2604.08039.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/