The Uncertainty Problem in AI Research

Author: Denis Avetisyan

New research reveals that artificial intelligence agents consistently produce unreliable results due to subtle choices in how they measure success.

Effect size distributions across stages reveal varying degrees of agent consensus, with stages one and five initially exhibiting narrow ranges-driven by specification choices and the adoption of year dummies, respectively-while stage four demonstrates bimodality linked to differing volume metrics, and stage six converges due to widespread adoption of trade-level price impact assessments.

A multiverse analysis demonstrates substantial nonstandard errors in AI-driven empirical studies, highlighting limitations in automated peer review and the potential for both convergence and inconsistency when exposed to top-tier research.

Despite the promise of reproducible science, empirical research remains susceptible to variation stemming from analytical choices-a phenomenon known as nonstandard error. This paper, ‘Nonstandard Errors in AI Agents’, investigates whether state-of-the-art AI coding agents exhibit similar inconsistencies when independently analyzing the same dataset. We find substantial divergence among agents-analogous to those observed among human researchers-driven primarily by differing methodological preferences in measure selection. If AI is to be broadly deployed in automated policy evaluation and empirical research, can we develop methods to mitigate these “nonstandard errors” without sacrificing analytical flexibility?

The Fragility of Tradition: Errors in Financial Modeling

Historically, financial modeling has been deeply rooted in bespoke code, meticulously written and maintained by individual analysts. This approach, while allowing for nuanced customization, introduces significant vulnerabilities to error-a single misplaced line of code can propagate throughout an entire model, leading to flawed conclusions. Furthermore, the manual nature of this process severely limits scalability; adapting models to incorporate new data, test different hypotheses, or expand to larger datasets requires substantial time and resources. As financial markets generate increasingly complex data streams, the limitations of human-crafted code become more pronounced, hindering both the speed and reliability of financial research and decision-making. The inherent difficulties in auditing and reproducing results from these complex, manually-maintained systems also pose a substantial risk to transparency and accountability within the financial industry.

The pursuit of reliable and verifiable findings in finance is increasingly necessitating a move towards automated systems. Traditional financial modeling, historically reliant on manually written code, suffers from inherent limitations in scalability and a propensity for human error – issues that directly impede reproducibility. Automated methodologies offer a solution by enabling researchers to rigorously test hypotheses across vast datasets, such as those generated by high-frequency trading, with greater speed and precision. This shift isn’t merely about efficiency; it’s about building a more solid foundation for financial knowledge, where results can be consistently replicated and validated, fostering increased trust in financial research and its applications. The automation of workflows, from data acquisition and cleaning to model building and evaluation, promises a new era of robustness in a field where even minor inaccuracies can have significant consequences.

The increasing availability of high-frequency financial data, such as that from the New York Stock Exchange’s Trade and Quote (TAQ) database, presents both opportunities and significant methodological challenges. While this data granularity allows for more rapid hypothesis testing, traditional statistical methods often fall short in accurately estimating uncertainty. Analyses of high-frequency data frequently exhibit substantial nonstandard errors – meaning the usual assumptions underlying statistical tests are violated due to factors like serial correlation, heteroscedasticity, and the impact of market microstructure noise. Consequently, researchers must employ specialized techniques, including robust standard error estimation and alternative inference methods like bootstrapping or permutation tests, to avoid drawing spurious conclusions and ensure the reliability of findings derived from these complex datasets. Addressing these statistical nuances is crucial for advancing quantitative finance and building truly robust automated trading strategies.

Automated Inquiry: Deploying AI Agents for Financial Analysis

The research process was fully automated through the deployment of AI Agents utilizing the Claude Code platform. This involved constructing agents capable of independently performing tasks typically completed by human researchers, including data acquisition, analysis, and report generation. Claude Code served as the execution environment, enabling the agents to process information and produce outputs without manual intervention. The implementation aimed to increase research throughput and reduce potential biases inherent in manual analysis, offering a scalable solution for financial investigations.

The research process utilized two large language model (LLM) variants, Sonnet 4.6 and Opus 4.6, deployed through the Claude Code platform. Sonnet 4.6, known for its cost-effectiveness and speed, was employed for initial data processing and preliminary analysis. Opus 4.6, Anthropic’s most powerful model, was then leveraged for more complex analytical tasks, including the generation of comprehensive reports and the interpretation of nuanced market data. This tiered approach allowed for optimization of both computational resources and analytical rigor, ensuring efficient and thorough investigation of the SPY ETF and related financial metrics.

AI Agents were utilized to analyze the SPY ETF, with a primary focus on quantifying market efficiency through the examination of bid-ask spread characteristics. Specifically, the agents assessed both quoted and realized bid-ask spreads, noting a low variation of 0.43% in the quoted spread. This metric provides insight into the liquidity and transaction costs associated with trading the SPY ETF, and its relatively narrow spread suggests a highly efficient market for this particular financial instrument. Analysis of the realized spread, alongside the quoted spread, allows for a more comprehensive understanding of the true costs incurred by traders.

Across simulation stages, agents increasingly adopted the variance ratio method-indicated by a rise in adoption from 1% to 96%-which correlated with improvements in key performance indicators, including a drop in autocorrelation from 58% to 17% and a rise in price impact from 56% to 99%.

Uncertainty Quantified: Dissecting Nonstandard Errors

Analysis of AI-generated results revealed the presence of Nonstandard Errors (NSE), a phenomenon also observed in human research where different analytical approaches yield varying results. Specifically, key market quality measures exhibited an interquartile range (IQR) of up to 10.70%, indicating a substantial degree of variability attributable to analytical choices rather than solely to random error. This NSE suggests that the reported findings are sensitive to the specific analytical methods employed and that a range of plausible values exists for the measured market qualities. The magnitude of this IQR serves as a quantifiable metric for the uncertainty inherent in the analysis.

Nonstandard errors (NSE) observed in the AI-generated results are directly attributable to the ‘Measure Choice Fork’, which represents systematic disagreements arising from differing methodological selections during analysis. Specifically, variations in the choice of statistical techniques – such as differing regression methods or time series analyses – contribute to a range of possible outcomes even when analyzing the same dataset. This isn’t random error, but rather a consequence of legitimate analytical choices that produce demonstrably different, yet defensible, results. The magnitude of this effect is reflected in the interquartile range (IQR) of key market quality measures, which reached up to 10.70%, indicating a substantial degree of variability stemming from these methodological differences.

Multiverse Analysis was implemented to assess the sensitivity of results to analytical choices. This involved systematically varying the methodologies used by the AI agents, specifically exploring different implementations of Linear Regression, Autocorrelation, and Variance Ratio techniques, alongside a broad range of parameter settings for each. By executing the analysis across this spectrum of options, we quantified the degree to which findings remained consistent despite alterations in the analytical approach, providing a measure of robustness beyond traditional statistical significance testing and accounting for the observed Nonstandard Errors stemming from methodological differences.

In Stage 3, AI agents utilized a suite of analytical methods – including Linear Regression, Autocorrelation, and Variance Ratio – each tested across a range of parameter settings to assess the sensitivity of results. This process yielded an 80-99% reduction in the Interquartile Range (IQR) for measure families that exhibited convergence, indicating increased robustness of the findings. Specifically, analysis of price impact demonstrated an IQR of 10.34% following this multivariate approach, suggesting a quantifiable level of variability even after accounting for methodological choices.

The Value of Scrutiny: Peer Review by Artificial Intelligence

To bolster the reliability of generated financial reports, an automated peer review system was integrated into the research pipeline. This process utilized independent AI agents tasked with critically assessing the methodologies and conclusions drawn by the primary research AI. Rather than human oversight, the validation relied entirely on algorithmic scrutiny, allowing for a scalable and objective evaluation of each report’s internal consistency and logical soundness. This approach facilitated the identification of potential flaws or biases that might otherwise go unnoticed, effectively creating a self-correcting system within the AI research framework and pushing the boundaries of automated financial analysis.

To bolster the reliability of AI-driven financial analysis, a system of automated peer review was implemented, employing independent AI agents to scrutinize the work of their counterparts. These secondary agents didn’t simply confirm results, but actively assessed the methodologies used by the primary research agents, examining the logical flow of analysis and the validity of assumptions. This process moved beyond mere replication, functioning as a critical appraisal of the research process itself, identifying potential weaknesses or biases in the initial findings. The system created a feedback loop, allowing for iterative improvement in the AI’s analytical capabilities and fostering a more robust and trustworthy research environment.

The AI agents didn’t simply accept initial research outputs; instead, they rigorously scrutinized methodologies by cross-validating findings against key financial metrics like Trading Volume and Intraday Volatility. This peer review process actively sought out potential biases within the models, and the resulting analysis uncovered a strong negative correlation – specifically, a coefficient of -0.601 – between the AI agents’ evaluation ratings and the presence of nonstandard errors in price impact calculations. This suggests that the AI peer review system effectively identified research with questionable statistical validity, demonstrating its capacity to enhance the reliability and trustworthiness of AI-driven financial analysis.

The integration of artificial intelligence into financial research presents a compelling shift towards scalable and reproducible modeling. Recent studies indicate that AI-driven analyses, when subjected to stringent validation protocols – such as peer review by other AI agents – yield robust and reliable results. This approach moves beyond the limitations of traditional financial modeling, which often relies on manual processes and can be susceptible to human bias or error. The capacity for AI to rapidly analyze vast datasets and cross-validate findings offers a significant advantage, while the implementation of automated peer review ensures methodological rigor. Ultimately, this combined approach promises a future where financial insights are generated with increased efficiency, transparency, and consistency, potentially reshaping the landscape of quantitative finance.

The study meticulously details how subtle choices in measurement – the ‘measure choice forks’ – introduce significant nonstandard errors within AI agents conducting empirical research. This echoes Vinton Cerf’s sentiment: “The Internet treats everyone the same.” In a similar vein, the AI agents, unburdened by human intuition, treat each measurement option with equal weight, leading to divergent results. The paper highlights a fascinating paradox: while these agents can rapidly process information, they lack the critical judgment to discern meaningful variation from statistical noise, demonstrating a need for refinement beyond sheer computational power. The exploration of multiverse analysis, in particular, reveals the sensitivity of conclusions to these arbitrary starting points.

What Remains to Be Resolved

The observed susceptibility of automated agents to measure choice forks is not merely a technical problem; it reveals a fundamental limitation in equating computational efficiency with epistemic rigor. The proliferation of nonstandard errors suggests that scaling automated research does not automatically yield improved knowledge – quite the opposite. The current paradigm prioritizes quantity of results over the quality of inference. A necessary correction involves developing metrics for epistemic stability, quantifying the sensitivity of conclusions to seemingly innocuous methodological choices.

The limited efficacy of AI peer review is a pointed rebuke to simplistic notions of automated quality control. The observation that exposure to high-impact publications can both converge and diverge agent behavior is particularly unsettling. It implies that the existing literature, rather than serving as a foundation for truth, may itself propagate systematic errors. Further investigation must explore the conditions under which agents internalize bias from curated datasets, and whether mechanisms can be devised to encourage genuinely critical assessment.

Ultimately, the task is not to perfect automated research, but to understand its inherent limitations. The pursuit of “artificial intelligence” should not distract from the fact that intelligence – of any kind – is inextricably linked to a framework of values. A purely computational approach, divorced from considerations of justification and error, risks amplifying noise rather than illuminating signal. Density of meaning, not volume, remains the ultimate metric.

Original article: https://arxiv.org/pdf/2603.16744.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Tradition: Errors in Financial Modeling

Automated Inquiry: Deploying AI Agents for Financial Analysis

Uncertainty Quantified: Dissecting Nonstandard Errors

The Value of Scrutiny: Peer Review by Artificial Intelligence

What Remains to Be Resolved

See also: