AI vs. the Data Scientist: When Code Isn’t Enough

Author: Denis Avetisyan


New research reveals that current AI agents struggle to match human performance on complex data science tasks, particularly when domain expertise embedded in visual data is crucial.

A comparison of predictive modeling approaches reveals that incorporating domain knowledge-specifically, inferring roof health from visual data and combining it with tabular data-enables substantially higher predictive performance ($normalized \ Gini = 0.8310$) compared to standard tabular modeling that disregards visual cues and domain expertise ($normalized \ Gini = 0.3823$).
A comparison of predictive modeling approaches reveals that incorporating domain knowledge-specifically, inferring roof health from visual data and combining it with tabular data-enables substantially higher predictive performance ($normalized \ Gini = 0.8310$) compared to standard tabular modeling that disregards visual cues and domain expertise ($normalized \ Gini = 0.3823$).

Agentic AI systems relying on tabular data and generic code generation demonstrably underperform human data scientists leveraging image-based domain knowledge for improved predictive accuracy, as measured by Normalized Gini.

Despite advances in automating data science workflows, a critical gap remains between the performance of agentic AI and human experts leveraging domain-specific knowledge. This study, ‘Can Agentic AI Match the Performance of Human Data Scientists?’, investigates this limitation by presenting a predictive task where crucial information is embedded within image data, a modality where generic, tabular-focused agentic AI struggles. Our experiments, using a synthetic property insurance dataset, demonstrate that current agentic AI systems fall short of human performance, highlighting the necessity of incorporating domain knowledge for robust predictive modeling. Will future research yield agentic AI capable of effectively recognizing and integrating such contextual insights to truly match-or surpass-human data scientists?


Forecasting Risk: The Foundation of Insurance Stability

The foundation of sound financial strategy within property insurance rests upon the ability to forecast potential losses with precision. Accurate loss prediction isn’t merely about calculating premiums; it’s a vital component of risk management that directly influences a company’s solvency and its capacity to fulfill policyholder obligations. Without reliable predictions, insurers risk underpricing policies and facing substantial financial strain from unexpected events, or conversely, overpricing and losing market share to competitors. Effective loss prediction allows for optimal capital allocation, proactive mitigation strategies, and the development of innovative insurance products tailored to evolving risk landscapes. Ultimately, the capacity to anticipate and quantify potential losses is inextricably linked to the long-term stability and success of any property insurance provider, impacting not only its bottom line but also its ability to provide security and peace of mind to its customers.

Established actuarial methods, foundational to property insurance for decades, increasingly face limitations when addressing contemporary risk landscapes. These techniques, often reliant on historical data and simplified models, can struggle to incorporate the sheer volume and variety of modern risk factors – from climate change-induced extreme weather events to the nuanced impacts of evolving building materials and urban development. The proliferation of data sources, while offering potential for improved accuracy, also presents challenges in data integration, cleaning, and meaningful analysis. Consequently, traditional approaches may underestimate potential losses or fail to accurately capture the correlations between diverse risk variables, hindering effective risk management and potentially impacting the financial stability of insurers.

Property insurance outcomes are determined by a combination of structured policy features and a hidden 'Roof Health' variable-encoded in a visual roof image-that influences claim counts and losses, requiring image-based inference for optimal prediction.
Property insurance outcomes are determined by a combination of structured policy features and a hidden ‘Roof Health’ variable-encoded in a visual roof image-that influences claim counts and losses, requiring image-based inference for optimal prediction.

Automating Risk Analysis with Agentic Intelligence

Agentic AI systems leverage Large Language Models (LLMs) to autonomously execute data science tasks typically requiring human intervention. This automation extends beyond simple scripting; these systems can dynamically chain together multiple steps – data ingestion, cleaning, feature engineering, model selection, and evaluation – based on observed data and defined objectives. By utilizing LLMs, agentic systems interpret data schemas, suggest appropriate analytical techniques, and iteratively refine workflows without explicit programming. This capability facilitates the automation of complex processes like loss prediction, reducing the need for manual data manipulation and model tuning, and enabling faster iteration on analytical solutions.

Agentic AI systems demonstrate the capability to integrate and analyze data from multiple sources to produce loss predictions. This functionality extends beyond traditional tabular datasets-such as customer demographics and financial history-to encompass unstructured data like image data. The ingestion process allows these systems to extract features from images-for example, damage assessment from property photos-and combine these insights with structured data to refine predictive models. This multi-modal approach aims to improve prediction accuracy by leveraging a broader range of potentially relevant information than is typically used in standard tabular-only modeling.

Agentic AI systems trained exclusively on tabular data for loss prediction demonstrate a Normalized Gini coefficient of 0.3823. This metric indicates a limited capacity for accurate prediction when relying solely on structured, numerical data. The Normalized Gini, ranging from 0 to 1, assesses the discriminatory power of a model; a value of 0.3823 suggests the model offers only moderate separation between positive and negative cases, highlighting the need for incorporation of alternative data modalities – such as image data – to improve predictive performance and overall model utility.

A synthetic property insurance dataset was created using text-to-image generation to visually represent roof health-good, fair, or bad-allowing for comparative analysis of AI pipelines utilizing tabular data versus those incorporating visual information.
A synthetic property insurance dataset was created using text-to-image generation to visually represent roof health-good, fair, or bad-allowing for comparative analysis of AI pipelines utilizing tabular data versus those incorporating visual information.

Enhancing Prediction Through Visual Data Integration

Image data, sourced from aerial photography or on-site inspections, offers detailed assessments of property characteristics relevant to risk evaluation. This data includes visual indicators of structural integrity, material degradation, landscaping conditions, and potential hazards such as standing water or overgrown vegetation. Analysis of these visual cues allows for the identification of conditions that may not be readily apparent through traditional data sources, providing a more comprehensive understanding of a property’s current state and potential future liabilities. The granularity of detail captured in photographic data facilitates accurate condition scoring and enables proactive identification of maintenance needs or escalating risks.

The integration of computer vision models, specifically the CLIP model and Vision-Language Models like gpt-4o-mini, enables the automated extraction of relevant features from image data. CLIP, or Contrastive Language-Image Pre-training, establishes a relationship between visual and textual representations, allowing the model to understand the content of images. These models process aerial or on-site photographs to identify property characteristics and potential risks, converting visual information into quantifiable data points. The extracted features are then used as inputs for predictive modeling, enhancing the accuracy of risk assessment and property evaluation beyond what is achievable with traditional data sources alone.

The integration of CLIP embeddings and full CLIP features demonstrably enhances predictive accuracy when analyzing property data. Specifically, utilizing CLIP embeddings results in a Normalized Gini score of 0.5042. However, leveraging the complete set of CLIP features yields a significantly improved Normalized Gini score of 0.7719. This nearly 54% increase in the Normalized Gini indicates that visual information, when processed through CLIP models, contributes substantially to more accurate risk assessment and property evaluation compared to methods relying solely on non-visual data.

Validating Predictive Power with Synthetic Datasets

Synthetic data offers a controlled environment for the rigorous evaluation of predictive models by allowing manipulation of key input variables, such as Roof Health. This capability bypasses the limitations of relying solely on real-world data, which often suffers from imbalances, missing values, or inherent biases. By generating datasets with precise control over these variables, developers can systematically assess model performance under diverse conditions and isolate the impact of specific features on prediction accuracy. This controlled experimentation facilitates targeted model refinement and optimization, ultimately leading to more robust and reliable predictive capabilities.

The use of synthetic data facilitates a granular analysis of loss prediction model performance by enabling controlled variation of input features. Specifically, researchers can systematically adjust factors like Roof Health while maintaining consistency in other variables to observe the resultant impact on prediction accuracy. This methodology allows for the identification of which features most strongly influence model performance, and conversely, which features have minimal effect. By quantifying these relationships, data scientists can pinpoint areas where the model requires refinement, such as feature engineering or algorithm selection, leading to targeted improvements in loss prediction capabilities. This controlled experimentation is difficult, if not impossible, to achieve with real-world, observational data due to confounding variables and inherent data limitations.

Evaluation of roof health assessment using the gpt-4o-mini model and human data scientists yielded a Normalized Gini coefficient of 0.7271. Performance significantly improved to 0.8310 when provided with perfectly accurate roof health labels. This score approaches the theoretical maximum, or oracle performance, of 0.8379, indicating that the model, when paired with accurate data, achieves near-optimal predictive capability for this task.

The Convergence of Human Insight and Artificial Intelligence

Despite the increasing capabilities of Agentic AI in automating crucial aspects of property insurance risk assessment, the role of human data scientists remains indispensable. These professionals are critical for validating the AI’s outputs, ensuring accuracy and preventing potentially costly errors that automated systems might overlook. Beyond simple verification, human expertise is essential for incorporating nuanced domain knowledge – understanding local building codes, regional weather patterns, and specific property characteristics – that significantly impacts risk evaluation. Complex cases, involving unique property features or incomplete data, often require the interpretive skills and critical thinking abilities that currently exceed AI’s capacity, necessitating human intervention to arrive at informed and reliable risk assessments.

The future of property insurance risk assessment lies in a powerful convergence of automated analysis and human judgment. Agentic AI systems excel at processing vast datasets and identifying patterns, but these systems are not infallible; they require validation and contextualization. By integrating the speed and scalability of AI with the nuanced understanding of experienced data scientists, insurers can achieve significantly higher levels of accuracy in evaluating risk factors. This synergy isn’t simply about augmenting existing processes; it enables the identification of subtle, complex relationships that might otherwise be missed, leading to more precise underwriting, optimized pricing, and ultimately, more resilient insurance practices. The result is a streamlined workflow where AI handles routine tasks, while human experts focus on complex cases, ensuring both efficiency and a thorough, reliable assessment of property risk.

The convergence of artificial intelligence and human expertise promises a paradigm shift in property insurance risk modeling, fostering not merely incremental improvements but genuinely innovative practices. By leveraging AI’s capacity for high-volume data analysis and pattern recognition, alongside the nuanced judgment and domain-specific knowledge of human data scientists, insurers can move beyond traditional, static models. This collaborative approach enables the creation of dynamic risk assessments that adapt to evolving environmental factors, incorporate localized vulnerabilities, and anticipate previously unforeseen threats. The resulting resilience extends beyond individual policies, strengthening the entire insurance ecosystem’s ability to withstand increasingly complex and frequent catastrophic events, and ultimately delivering more accurate and sustainable coverage for policyholders.

The study illuminates a critical point regarding system architecture and performance. It reveals that simply scaling code generation-even with agentic AI-doesn’t guarantee superior results when fundamental data understanding is lacking. This resonates with the observation of Carl Friedrich Gauss: “If other objects are considered, and their influence upon the observed quantities is known, it is possible to eliminate these disturbances; but if their influence is unknown, it is hopeless to determine the true state of things.” The agentic AI, operating primarily on tabular data, struggles with the ‘unknown influences’ inherent in the image modality – the nuanced domain knowledge a human data scientist intuitively incorporates. The paper demonstrates that predictive performance isn’t merely about computational power, but about a holistic grasp of the underlying system and the quality of information fed into it, mirroring Gauss’s emphasis on accounting for all influencing factors.

Where Do We Go From Here?

The limitations revealed by this work are not merely technical; they speak to a fundamental misunderstanding of intelligence itself. Current agentic systems, proficient as they are at manipulating symbols, remain brittle when confronted with data requiring embodied understanding – the intuitive grasp of relationships inherent in image modalities. Systems break along invisible boundaries – if one cannot see them, pain is coming. The observed performance gap isn’t about code generation speed; it’s about the absence of a framework for integrating abstract reasoning with perceptual information.

Future research must move beyond the pursuit of algorithmic cleverness and address the architecture of knowledge. Simply scaling language models, or even coupling them with code interpreters, will not suffice. The focus should shift towards creating systems capable of building internal representations that mirror the layered complexity of the real world – representations that privilege structure and context. Consider the human data scientist: they do not simply ‘see’ pixels, but interpret scenes, understand physical laws, and leverage decades of accumulated visual experience.

The path forward demands a more holistic approach. The emphasis must be on building systems that can not only process information but also understand its implications within a broader, interconnected framework. Anticipating weaknesses requires recognizing that predictive power doesn’t reside in the data itself, but in the system’s ability to extract meaningful patterns and apply them with nuance and discernment.


Original article: https://arxiv.org/pdf/2512.20959.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-25 07:50