Where the Eye Wanders: Predicting Gaze on Masterpieces

Author: Denis Avetisyan

A new deep learning model accurately simulates human visual attention when viewing paintings, offering insights into how we perceive art.

The system evaluates the alignment between predicted and actual gaze patterns through a congruency calculation, effectively quantifying how well the model anticipates visual attention.

This work introduces SPGen, a stochastic scanpath generation technique utilizing unsupervised domain adaptation to improve scanpath prediction on painting images.

Understanding human visual attention remains a challenge, particularly when applied to the nuanced interpretation of artwork. This limitation motivates the development of ‘SPGen: Stochastic scanpath generation for paintings using unsupervised domain adaptation’, a novel deep learning model designed to predict realistic eye movement sequences-scanpaths-as viewers observe paintings. By leveraging a fully convolutional network with learnable priors and employing unsupervised domain adaptation, SPGen effectively transfers knowledge from natural images to the artistic domain, outperforming existing methods in scanpath prediction accuracy. Could this approach unlock new insights into how we perceive and appreciate visual art, ultimately aiding in its preservation and study?

The Illusion of Effortless Vision

The human visual system operates with astonishing efficiency, a feat achieved despite inherent limitations in processing power and available data. Consider that the retina captures a vast amount of visual information with each glance, yet conscious perception remains focused and manageable. This isn’t due to unlimited capacity, but rather a highly optimized system that actively filters and prioritizes incoming stimuli. The brain doesn’t simply record everything; instead, it constructs a representation of the world based on relevance, expectation, and prior experience. This selective processing allows individuals to navigate complex environments, identify crucial details, and respond appropriately – all while conserving cognitive resources. Ultimately, the HVS demonstrates that effective vision isn’t about seeing more, but about intelligently processing what is seen, a principle increasingly vital in the development of artificial intelligence.

The human visual system doesn’t passively record everything it sees; instead, selective attention acts as a powerful filter, dramatically enhancing the processing of relevant information while suppressing the rest. This prioritization isn’t random; it’s guided by a complex interplay of bottom-up factors – like sudden movements or bright colors – and top-down influences, such as expectations and goals. Essentially, the brain allocates limited neural resources – the ‘spotlight of attention’ – to the most pertinent aspects of a visual scene, allowing for remarkably efficient processing despite the overwhelming amount of incoming data. This focused processing isn’t merely about noticing things; it’s fundamentally linked to conscious perception, meaning that much of what enters the visual field remains unnoticed unless it’s selected for attention. Consequently, understanding the neural mechanisms underpinning selective attention is critical not only for unraveling the complexities of human vision but also for building artificial intelligence capable of mimicking this efficient and adaptable information processing.

Mimicking the human visual system’s selective attention is proving vital for advancements in computer vision. Current systems often struggle with the sheer volume of visual data, leading to inefficiencies and inaccuracies; however, by incorporating principles of how humans prioritize information – focusing on salient features and filtering out noise – researchers are developing algorithms that dramatically improve object recognition, scene understanding, and real-time processing capabilities. This bio-inspired approach not only enhances the speed and accuracy of computer vision tasks but also allows for the creation of systems that are more robust to variations in lighting, viewpoint, and occlusion – ultimately paving the way for more intelligent and adaptable artificial vision.

The Two Sides of Attention: Reacting and Seeking

Bottom-up attention, also known as stimulus-driven attention, operates as an automatic and preattentive mechanism. This process is initiated by the physical properties of a visual stimulus, such as high contrast, bright colors, sudden movement, or the presence of edges and corners. Because it is largely unconscious, bottom-up attention captures our focus before we are even aware of a potential item of interest, functioning as an initial filtering stage for incoming sensory information. The strength of a stimulus in driving bottom-up attention is determined by its salience – how much it visually “pops out” from its surroundings – and does not require cognitive resources or prior expectations.

Top-down attention is a cognitive process wherein attentional resources are intentionally allocated based on pre-existing knowledge, expectations, and current goals. This contrasts with stimulus-driven attention; instead of reacting to external cues, top-down attention actively searches for specific information or features relevant to the task at hand. The prefrontal cortex plays a crucial role in implementing top-down control, modulating activity in other brain regions – such as the visual cortex – to prioritize processing of goal-relevant stimuli and suppress irrelevant distractions. Consequently, top-down attention allows for flexible and efficient visual search, enabling individuals to focus on what is important despite a complex visual environment.

Attentional processes are not mutually exclusive; bottom-up and top-down attention operate in tandem with both covert and overt shifts in gaze. Bottom-up attention rapidly captures focus via stimulus salience, while top-down attention intentionally guides processing based on goals and expectations. Covert attention allows for processing of stimuli without direct eye movement, while overt attention involves physical shifting of gaze to focus on specific locations. These systems interact dynamically; a salient stimulus detected via bottom-up processing can trigger a top-down search for related information, and conversely, goal-directed attention can enhance the processing of specific features. This integrated operation allows for efficient scanning of visual scenes and the construction of a coherent perceptual experience.

The dynamic interaction between bottom-up and top-down attention fundamentally shapes visual perception and scene understanding. Bottom-up processing rapidly identifies visually salient stimuli – high contrast regions, bright colors, or sudden movements – automatically capturing attention. Simultaneously, top-down processing utilizes existing knowledge, goals, and expectations to prioritize specific features or locations within the scene. This prioritization modulates the influence of bottom-up signals, directing resources towards task-relevant information and filtering out distractions. The resulting integrated processing stream generates a coherent mental representation, allowing for efficient interpretation of the visual environment and guiding subsequent behavior.

Decoding the Gaze: Following the Scanpath

The human visual system does not passively record an entire scene at once; instead, attention operates through a series of focused fixations – periods where the eye remains relatively still – interspersed with rapid movements called saccades. A scanpath is the sequential record of these fixations and saccades as a viewer explores a visual stimulus. Because fixations represent moments of active information processing and saccades direct attention to salient regions, the resulting scanpath directly reflects the cognitive processes underlying visual attention. Analyzing the characteristics of a scanpath – including fixation duration, saccade amplitude, and the order of fixated locations – therefore provides a quantifiable measure of attentional allocation and can reveal how individuals prioritize and extract information from visual scenes.

Scanpath analysis provides quantifiable data regarding visual prioritization and scene navigation. By tracking fixation durations, saccade amplitudes, and the order of fixations, researchers can determine which image regions attract the most attention and the sequence in which these regions are processed. Longer fixation durations typically indicate more cognitive processing of a given area, while saccades reveal the visual search strategy employed. Metrics derived from scanpaths, such as time-to-first-fixation, dwell time, and transition probabilities between areas of interest, allow for the creation of saliency maps and the modeling of attentional mechanisms. These analyses are applicable across various domains, including usability testing, advertising effectiveness, and the study of visual expertise.

Employing lightweight convolutional neural networks, specifically architectures like MobileNet, offers an efficient methodology for modeling human scanpaths due to their reduced computational demands and parameter count. MobileNet’s depthwise separable convolutions minimize the number of parameters while maintaining acceptable accuracy in feature extraction from visual stimuli. This allows for the rapid processing of image features relevant to predicting gaze locations, enabling real-time scanpath modeling and prediction without requiring extensive computational resources. The resulting models are suitable for deployment on resource-constrained devices and facilitate large-scale analysis of visual attention data.

Predictive gaze algorithms, derived from scanpath modeling, utilize learned patterns of human attention to forecast future fixation locations within a visual scene. These algorithms typically employ machine learning techniques, trained on datasets of recorded human scanpaths, to establish correlations between visual features and subsequent gaze positions. The resulting models can then input a new image and, based on the learned correlations, output a probability distribution over potential fixation points, effectively predicting where a human observer is likely to look next. Performance is commonly evaluated using metrics such as Area Under the ROC Curve (AUC) or Root Mean Squared Error (RMSE) between predicted and actual gaze locations, enabling quantitative comparison of different algorithmic approaches.

The model successfully generates plausible audio-visual predictions on the AVAtt dataset, demonstrating its capacity for cross-modal understanding.

Bridging the Gap: Unsupervised Adaptation to New Views

Unsupervised Domain Adaptation (UDA) addresses the challenge of deploying machine learning models in new environments where labeled data is unavailable. Traditional supervised learning requires extensive labeled data for each target domain, which is often costly and time-consuming to obtain. UDA techniques enable a model, initially trained on a source domain with abundant labeled data, to generalize effectively to a different, unlabeled target domain. This is achieved by learning domain-invariant features – representations that are similar across both domains – allowing the model to transfer knowledge without requiring any labeled examples from the new domain. The core principle is to minimize the discrepancy between the feature distributions of the source and target domains, effectively bridging the gap and enabling successful knowledge transfer.

The Gradient Reversal Layer (GRL) is a technique used in domain adaptation to minimize the discrepancy between feature distributions of source and target domains. During forward propagation, the GRL acts as an identity function, passing the features unchanged. However, during backpropagation, it reverses the sign of the gradient flowing from the feature extractor. This encourages the feature extractor to learn representations that are indistinguishable between the two domains, effectively learning domain-invariant features. By minimizing domain-specific information, the model can generalize better to the target domain without requiring labeled data from that domain, as the learned features are less biased towards the source domain’s characteristics.

Evaluation of unsupervised domain adaptation techniques relies on quantitative metrics assessing the alignment between model predictions and ground truth data in the target domain. Normalized Scanpath Saliency (NSS) measures the similarity between predicted and observed scanpaths, quantifying how well the model focuses on salient regions. MultiMatch evaluates the overlap between the top-N predicted fixations and the ground truth fixations, providing an indication of precision. Congruency, typically assessed using Area Under the ROC curve (AUC), quantifies the probability that a ground truth fixation falls within a predicted fixation, reflecting the overall agreement between predicted and observed eye movements. These metrics provide objective assessments of the model’s ability to generalize to new visual domains without requiring labeled data in the target domain.

Following unsupervised domain adaptation, our model achieved state-of-the-art results on multiple datasets. Specifically, the model attained the highest MultiMatch (MM) score on the Salicon dataset and the highest Normalized Scanpath Saliency (NSS) score on the Le Meur dataset, both indicating improved ability to identify and align with visually salient regions. Furthermore, a significant improvement in Congruency was observed on the AVAtt dataset, demonstrating a stronger correlation between predicted gaze fixations and actual human fixations during visual processing.

The proposed method utilizes a general architecture integrating <span class="katex-eq" data-katex-display="false">\mathbf{x}</span> as input, processing it through a dynamics model <span class="katex-eq" data-katex-display="false">f_{\\theta}</span>, and predicting a subsequent state <span class="katex-eq" data-katex-display="false">\\hat{\\mathbf{x}}\</span>. — The proposed method utilizes a general architecture integrating $\mathbf{x}$ as input, processing it through a dynamics model $f_{\\theta}$ , and predicting a subsequent state $\\hat{\\mathbf{x}}\$ .

Preserving the Past: Understanding How We Look at Art

The ways in which people visually engage with paintings and cultural artifacts are fundamental to both their long-term preservation and meaningful interpretation. Human attention isn’t distributed evenly across an artwork; rather, it follows specific patterns – known as scanpaths – revealing what elements capture interest and how viewers construct meaning. Recognizing these attentional priorities is vital for conservation efforts, allowing restorers to focus on preserving the most visually significant areas. Moreover, understanding how people look at art unlocks insights into cultural understanding, aesthetic preferences, and even the historical context surrounding the creation of the work. By deciphering the visual language inherent in the act of viewing, researchers can offer new perspectives on artistic intention and the enduring power of cultural heritage, ensuring these treasures continue to resonate with future generations.

Researchers are increasingly leveraging the analysis of scanpaths – the record of where a viewer’s gaze falls upon an image – to decode the complex interplay between perception and aesthetic appreciation. By mathematically modeling these visual trajectories, scientists can infer cognitive processes such as attention allocation, feature extraction, and emotional response. These models reveal that viewers don’t simply scan an artwork randomly; instead, patterns emerge indicating a preference for specific compositional elements, a focus on areas of high contrast or detail, and a tendency to follow established artistic conventions. Consequently, understanding these scanpaths provides valuable insights into how individuals interpret and engage with cultural artifacts, offering a novel approach to studying aesthetic preferences and the underlying cognitive mechanisms that shape our visual experience.

The methodology developed offers practical benefits extending into several disciplines focused on cultural preservation. Art historians can utilize the insights into visual attention patterns to refine interpretations of artwork, understanding not just what viewers see, but how they engage with composition and detail. Conservators gain a data-driven approach to assessing the impact of restoration efforts – pinpointing areas of visual importance that warrant particular care. Furthermore, museum curators are empowered to optimize exhibit design, strategically placing artworks and informational displays to align with natural viewing behaviors, ultimately enhancing the visitor experience and fostering a more profound connection with cultural heritage. This research, therefore, moves beyond theoretical understanding, providing tangible tools for those dedicated to the stewardship of art and history.

The study of visual attention, when applied to cultural artifacts, transcends mere aesthetic appreciation and ventures into the realm of preserving collective memory. By deciphering how individuals engage with paintings and other historical objects, researchers are effectively reconstructing the cognitive pathways through which meaning is derived and cultural narratives are internalized. This understanding isn’t simply academic; it informs crucial decisions regarding conservation efforts, ensuring that restoration prioritizes elements most salient to the human eye and, consequently, to cultural interpretation. Furthermore, this work provides a framework for enhancing museum experiences, allowing curators to design exhibits that resonate more deeply with viewers and foster a more profound connection to the past, ultimately safeguarding shared cultural heritage for generations to come.

Scanpath lengths on the MIT1003 dataset exhibit a characteristic distribution, reflecting typical visual exploration patterns.

The pursuit of accurate scanpath prediction, as demonstrated by SPGen, invariably leads to increasingly complex models. It’s a predictable trajectory. This paper attempts to bridge the gap between generic image understanding and the nuances of artistic paintings through domain adaptation – a clever approach, certainly. However, one suspects that even this elegantly constructed system will eventually succumb to the entropy of real-world data. As Andrew Ng once said, ‘AI is not about replacing humans; it’s about augmenting them.’ This applies equally to the models themselves; each refinement merely addresses a new set of edge cases, adding layers of complexity that future iterations will inevitably have to untangle. The core idea of adapting models to new domains is sound, but the illusion of a ‘solved’ problem is fleeting.

What’s Next?

The pursuit of predictive scanpaths, even with the elegance of unsupervised domain adaptation, inevitably encounters the limits of generalization. This work demonstrates transfer learning from images, but neglects the fundamental question of what happens when the source domain is itself a flawed representation of attention. Every optimization for saliency will, at some point, be optimized back towards noise. The model’s success hinges on approximating human gaze, yet human attention isn’t merely visual; it’s a negotiation between expectation, context, and the frankly inexplicable.

Future iterations will likely focus on increasingly granular domain adaptation – moving beyond broad image categories to account for artistic style, period, or even individual painterly technique. But architecture isn’t a diagram; it’s a compromise that survived deployment. The real challenge isn’t achieving higher accuracy on benchmark datasets, but building systems resilient to the unpredictable variations of real-world observation.

The promise of cultural heritage analysis remains compelling, but the field must acknowledge that predictive models don’t reveal meaning, they merely map patterns. One doesn’t refactor code; one resuscitates hope. The next phase isn’t about building a perfect gaze simulator, but about understanding – and gracefully accepting – the inherent messiness of visual attention itself.

Original article: https://arxiv.org/pdf/2602.22049.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/