Do Algorithms See What We See?

Author: Denis Avetisyan

New research benchmarks how well AI models can judge the similarity of graph visualizations, mirroring human visual perception.

Through a three-part study examining human and machine approaches to graph comparison, research demonstrates that multimodal large language models (MLLMs) more closely align with human judgment than traditional computational measures-a result substantiated by both perceptual alignment and the models’ ability to provide interpretable reasoning for their assessments, ultimately positioning them as effective tools for assisting human analysts in this complex task-as assessed through investigations into indirect similarity measurements, pairwise comparisons utilizing sixteen distinct computational methods, and a relative evaluation of MLLM capabilities <span class="katex-eq" data-katex-display="false">RQ_1</span>, <span class="katex-eq" data-katex-display="false">RQ_2</span>, <span class="katex-eq" data-katex-display="false">RQ_3</span>. — Through a three-part study examining human and machine approaches to graph comparison, research demonstrates that multimodal large language models (MLLMs) more closely align with human judgment than traditional computational measures-a result substantiated by both perceptual alignment and the models’ ability to provide interpretable reasoning for their assessments, ultimately positioning them as effective tools for assisting human analysts in this complex task-as assessed through investigations into indirect similarity measurements, pairwise comparisons utilizing sixteen distinct computational methods, and a relative evaluation of MLLM capabilities $RQ_1$ , $RQ_2$ , $RQ_3$ .

This study assesses computational measures and multimodal large language models in evaluating graph visualization similarity, with implications for visual analytics.

Identifying similar graph visualizations remains a challenge, as automated metrics often diverge from human perceptual judgments. This discrepancy motivates the research presented in ‘Seeing Graphs Like Humans: Benchmarking Computational Measures and MLLMs for Similarity Assessment’, which investigates the alignment between computational measures, human assessments, and the emerging capabilities of multimodal large language models (MLLMs). Our findings demonstrate that MLLMs, particularly GPT-5, significantly outperform traditional metrics in mirroring human perception of graph similarity and offer interpretable rationales for their decisions. Could these explainable AI systems ultimately serve as intelligent guides, enhancing both the efficiency and insights of visual analytics workflows?

The Echo of Structure: Bridging Human Perception and Graph Similarity

The ability to determine the similarity between graphs – visual representations of relationships between entities – is fundamental to a surprisingly broad range of disciplines, from social network analysis and cheminformatics to computer vision and cybersecurity. However, current computational approaches to this assessment frequently prioritize precise algorithmic matching of nodes and edges, often overlooking the holistic, pattern-based reasoning employed by human observers. This disconnect arises because humans don’t typically deconstruct graphs into their constituent parts for comparison; instead, they rapidly grasp the overall structure, density of connections, and prominent features – a process that existing methods struggle to replicate. Consequently, algorithmic evaluations of graph similarity can diverge significantly from human perception, hindering the development of truly intuitive and effective tools for data analysis and visualization.

Initial human assessments of graph similarity aren’t based on meticulous node-by-node comparisons, but rather a swift evaluation of overall structure and how densely connected the graph appears. Research indicates individuals rapidly form impressions by prioritizing the ‘gestalt’ of the graph – its broad shape and the concentration of edges. A sparse graph, even with a few similar local motifs, will be perceived as distinct from a dense, interconnected one. This prioritization of global features and edge density suggests human visual processing leverages shortcuts, focusing on readily available information to quickly categorize and differentiate between graph structures – a process that often bypasses detailed algorithmic calculations of node or edge correspondence.

Current computational approaches to graph similarity frequently stumble when mirroring human judgment, largely because algorithms prioritize precise node-level or edge-level comparisons that diverge from how people intuitively perceive structural differences. While a computational method might flag a single altered connection as significant, human assessment tends to focus on broader patterns – the overall ‘shape’ of the graph and its density of connections. This disconnect arises because many algorithms lack the capacity to weigh global structural features and holistic density as heavily as the human visual system does, resulting in situations where an algorithm and a human observer will rank the similarity of two graphs quite differently, even when presented with visually obvious distinctions. Consequently, developing algorithms that more closely align with human perception requires a shift towards incorporating these higher-level structural cues and a greater emphasis on the gestalt principles that govern how people rapidly form impressions of complex visual patterns.

The development of truly effective graph analysis tools hinges on mirroring human visual perception. Current computational methods for assessing graph similarity frequently prioritize algorithmic efficiency over cognitive plausibility, resulting in outputs that diverge from intuitive human judgments. Consequently, a deeper understanding of how people rapidly grasp and compare graph structures – focusing on global arrangement and the concentration of connections – is not merely an academic exercise. It is a fundamental requirement for creating tools that are both powerful and interpretable, allowing users to effortlessly validate results and derive meaningful insights from complex network data. By aligning computational approaches with the principles of human visual cognition, researchers can bridge the gap between algorithmic precision and human understanding, fostering a more symbiotic relationship between people and the increasingly complex world of network science.

Experiment 1 utilizes a system where participants compare three node-link diagrams to a query graph, selecting the most similar one, justifying their choice from provided criteria, and rating their confidence-all within a one-minute time limit, with access to explanatory help.

Portrait Divergence: A Metric Informed by Visual Perception

Portrait Divergence is a computational metric developed to assess graph similarity by modeling human visual perception. Unlike traditional graph comparison methods that rely solely on node or edge counts, or shortest path lengths, Portrait Divergence explicitly incorporates principles of how humans visually process and interpret graph layouts. The metric operates by quantifying the divergence between the ‘portraits’ – global structural representations – of two graphs, focusing on features like edge density and overall shape. This approach allows for the comparison of graphs with differing sizes or minor local variations, emphasizing shared high-level structural characteristics as perceived by a human observer. The resulting divergence score provides a quantitative measure of how visually similar two graphs are, according to principles of Gestalt psychology and perceptual organization.

Portrait Divergence prioritizes the overall arrangement of nodes and the concentration of connections when determining graph similarity. The metric calculates divergence based on differences in global graph properties, such as the distribution of node degrees and the presence of densely connected subgraphs, rather than focusing on localized node-to-node comparisons. This approach is intended to mirror human visual perception, where individuals tend to first perceive the general layout and prominent features of a graph before analyzing individual edges. Consequently, graphs with similar high-level structures and edge densities will exhibit lower divergence scores, even if they differ in specific edge configurations or node attributes.

Portrait Divergence distinguishes itself from traditional graph comparison methods by prioritizing high-level structural features – specifically, global organization and edge density – to produce a metric more aligned with human visual perception. Unlike measures focused solely on node or edge counts, or shortest path lengths, Portrait Divergence assesses similarity based on how easily a human observer would recognize equivalent structural patterns between graphs. This emphasis on qualitative visual characteristics yields a more interpretable divergence score, facilitating understanding of why two graphs are considered dissimilar, rather than simply that they are dissimilar. The resulting metric is therefore considered psychologically relevant, as its output correlates more strongly with human judgments of graph similarity.

Traditional graph comparison metrics often rely on quantitative properties such as node degree, path length, or clustering coefficient. Portrait Divergence distinguishes itself by integrating qualitative assessments of visual structure into the comparison process. This is achieved by prioritizing the overall arrangement and density of edges, rather than strictly counting specific features. The metric assesses similarity based on how effectively the global organization of a graph – its ‘shape’ – is preserved during comparisons, acknowledging that human perception is more sensitive to these holistic properties than to precise numerical values. This shift allows for the evaluation of graphs with differing sizes or minor variations in node attributes, focusing instead on the preservation of structural relationships.

GPT demonstrates a statistically significant correlation between its decision confidence and human confidence, comparable to Portrait divergence and exceeding the significantly lower correlation exhibited by Gemini, as measured by Spearman’s correlation ρ.

Validating Algorithmic Alignment with Human Judgement

A Visual Analytics System was developed to enable comparative analysis of graph structures, integrating the $Portrait Divergence$ metric with outputs from large language models. This system allows for simultaneous visualization of graphs and associated rationales generated by both GPT-5 and Claude Sonnet 4.5, facilitating a direct comparison between algorithmic assessment and human perception. The system’s design prioritized a user interface capable of displaying graph layouts alongside textual explanations for similarity judgements, allowing researchers to assess the consistency and interpretability of both the $Portrait Divergence$ metric and the language model outputs. Data generated through the system formed the basis for validating the computational alignment of graph similarity with human assessment.

To establish a comparative baseline for evaluating the computational alignment of graph similarity metrics with human perception, two large language models – GPT-5 and Claude Sonnet 4.5 – were employed to independently assess the similarity of paired graphs. Beyond providing similarity scores, both models were prompted to generate textual rationales explaining the basis for their judgements, enabling qualitative analysis alongside quantitative metrics. This approach facilitated a direct comparison between the models’ reasoning processes and the outputs of Portrait Divergence, contributing to a multi-faceted validation of the computational methods against human assessment criteria.

Analysis revealed a statistically significant correlation between the Portrait Divergence metric and human assessment of graph similarity, as quantified by a Cohen’s Kappa value of 0.424. Cohen’s Kappa measures inter-rater agreement, with values ranging from 0 to 1, where higher values indicate greater agreement; 0.424 represents a moderate level of agreement. This result indicates that the Portrait Divergence algorithm, while not perfectly aligned with human judgement, provides a quantifiable measure that generally reflects how humans perceive similarity between graphs. The statistical validation supports the use of Portrait Divergence as a tool for computational analysis where approximation of human perception is desired.

Quantitative analysis reveals GPT-5 demonstrates a statistically significant improvement in aligning with human judgement of graph similarity compared to the Portrait Divergence metric. Specifically, GPT-5 achieved a Cohen’s Kappa of 0.479, exceeding Portrait Divergence’s Kappa of 0.424 (p<0.05). Furthermore, GPT-5’s Spearman’s correlation coefficient of 0.353 is also significantly higher than that of Portrait Divergence (ρ=0.269, p<0.001). These results indicate that GPT-5’s assessments of graph similarity more closely reflect human perceptions than those generated by the Portrait Divergence algorithm, as measured by these statistical methods.

Despite no statistically significant differences between state-of-the-art MLLMs, GPT achieves the highest agreement with human judgments (κ values exceeding 0.4), demonstrating a substantial improvement over the Portrait divergence metric in evaluating model performance.

Scaling Complexity: The Limits of Perception and Computation

The ability to assess graph similarity diminishes rapidly as network size increases, presenting a substantial challenge for both human observers and computational algorithms. Initial studies reveal that performance on similarity judgements plateaus – and often declines – beyond a relatively small number of nodes and edges, indicating a cognitive limit in processing complex relational data. This scalability issue isn’t merely perceptual; computational methods also struggle with the exponential growth in complexity as graphs expand, requiring increasingly intensive resources and time. Consequently, research is heavily focused on developing scalable techniques – including novel graph embeddings and efficient comparison metrics – that can maintain accuracy and speed even with large-scale networks, crucial for applications ranging from social network analysis to drug discovery and materials science. Addressing this fundamental limitation is paramount to unlocking the full potential of graph-based data analysis.

Human assessment of graph similarity isn’t simply a matter of counting edges or nodes; instead, perception is heavily influenced by how interconnected groups – the community structure – are arranged, and the prominence of individual nodes measured by their degree. Research indicates that humans readily identify and prioritize graphs where communities align, even if overall graph size or edge count differs significantly. Algorithmic judgements echo this pattern; methods that incorporate node degree and community detection consistently outperform those relying solely on global graph properties. This suggests that effective graph comparison requires an understanding of not just what is connected, but how those connections form meaningful clusters, influencing the perceived similarity and impacting the accuracy of computational assessments. Consequently, algorithms mirroring human sensitivity to these structural elements prove more robust and reliable in discerning nuanced relationships between complex networks.

The comparison of complex networks isn’t solely determined by overall size or broad structure; rather, the presence of recurring patterns – known as motifs or substructure – significantly complicates the process of both human visual assessment and computational analysis. These motifs, small, interconnected subgraphs that appear more often than expected by chance, act as building blocks within larger networks, and their identification requires detailed examination beyond simple node and edge counts. A network rich in specific motifs will present a unique ‘signature’ influencing how readily it’s perceived as similar to another, even if differing in overall topology. Consequently, algorithms designed for graph comparison must account for these substructures, as failing to do so can lead to inaccurate assessments of network similarity and obscure meaningful relationships between networks. The computational cost of motif detection, however, represents a substantial challenge in scaling these analyses to very large networks.

Current large language models demonstrate varying capabilities in graph comparison, with GPT-5 exhibiting stronger alignment to human perceptual judgements than its counterparts. However, this enhanced reasoning comes at a cost: GPT-5 requires significantly longer processing times – an average of 39.26 seconds – to reach a conclusion compared to Claude’s 7.37 seconds. This discrepancy suggests a fundamental trade-off between the complexity of the model’s internal reasoning and its operational speed; while GPT-5 may more accurately reflect human understanding of graph similarity, its slower inference latency presents challenges for real-time applications or large-scale network analysis. The observed difference highlights the ongoing need to optimize model architectures for both accuracy and efficiency in the domain of complex network comparison.

Three graph layout algorithms-force-directed (simulating physical forces), circular (emphasizing connectivity), and UMAP (preserving topological structure)-were compared to visualize a real-world graph, each offering a distinct approach to balancing aesthetics and structural representation.

The study meticulously charts the inevitable drift between computational metrics and human judgment when assessing graph similarity. It acknowledges that any attempt to quantify visual perception is inherently transient. This resonates with Claude Shannon’s assertion that, “The most important thing in communication is to convey the meaning, not the signal.” The research highlights how MLLMs, despite their imperfections, move closer to mirroring human assessment – acknowledging that perfect alignment is an asymptotic goal. Like all systems, the metrics employed will degrade over time, necessitating continuous recalibration to maintain relevance as the ‘signal’ of human visual processing evolves. Latency, in this context, is the increasing divergence between the computational assessment and the nuances of human perception.

What Lies Ahead?

The endeavor to quantify graph similarity, as explored in this work, inevitably highlights the inherent decay in any representational system. Each benchmark, each metric, is merely a snapshot-a momentary calibration against a moving target of human perception. The observed alignment between multimodal large language models and human judgment isn’t a triumph of replication, but a shared vulnerability to the same erosive forces of simplification. Time isn’t a measure of progress here, but the medium in which these approximations degrade.

Future iterations will likely focus on refining these models’ capacity to detect nuance-the subtle deviations from established patterns that often hold the most critical information. However, the pursuit of perfect correlation is a phantom. The true utility of these systems may not lie in mirroring human assessment, but in exceeding it-in revealing structural similarities that remain invisible to the unaided eye, even if those revelations challenge established intuition.

Ultimately, this research underscores a fundamental truth: incidents-discrepancies between algorithmic and human judgment-aren’t failures, but essential steps toward a more robust understanding. Each point of divergence illuminates the boundaries of current methods and guides the development of more resilient systems-systems that acknowledge, rather than resist, the inevitable march of entropy.

Original article: https://arxiv.org/pdf/2602.22416.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Echo of Structure: Bridging Human Perception and Graph Similarity

Portrait Divergence: A Metric Informed by Visual Perception

Validating Algorithmic Alignment with Human Judgement

Scaling Complexity: The Limits of Perception and Computation

What Lies Ahead?

See also: