Author: Denis Avetisyan
New research reveals that large language models aren’t immune to the same cognitive errors that plague human judgment, raising concerns about their use in critical operational roles.

This study demonstrates that large language models replicate and amplify human cognitive biases when solving the classic Newsvendor problem, suggesting complexity does not guarantee rational decision-making.
Despite increasing reliance on large language models (LLMs) for business decision-making, their potential to exacerbate human cognitive biases presents a significant, yet understudied, risk. This research, ‘Large Language Newsvendor: Decision Biases and Cognitive Mechanisms’, investigates decision-making patterns in leading LLMs using a dynamic newsvendor problem, revealing that these models not only replicate but often amplify biases like demand chasing and ordering errors. Surprisingly, our findings demonstrate a “paradox of intelligence,” where more sophisticated models exhibit greater irrationality through overthinking, while efficiency-optimized models perform near-optimally. Given these results, how can managers effectively select and constrain LLMs to mitigate bias and ensure reliable AI-assisted decisions in high-stakes operational contexts?
The Inevitable Drift: Bias in Human and Artificial Judgment
Human judgment, despite aspirations toward rationality, is consistently shaped by cognitive biases – systematic patterns of deviation from normatively optimal decision-making. These aren’t random errors, but predictable tendencies stemming from the brain’s reliance on heuristics – mental shortcuts that simplify complex problems. While often efficient, these shortcuts can lead to suboptimal outcomes across a spectrum of scenarios, from everyday purchasing decisions to critical judgments in professional settings. For instance, confirmation bias leads individuals to favor information confirming existing beliefs, while anchoring bias causes an over-reliance on initial pieces of information, regardless of relevance. The pervasiveness of such biases demonstrates that even seemingly logical individuals are susceptible to predictable irrationalities, impacting everything from financial investments to medical diagnoses and highlighting the need for strategies to mitigate their influence.
The “Newsvendor Problem” elegantly illustrates how human decision-making frequently diverges from economic rationality. This classic scenario posits an individual purchasing newspapers to resell, facing the uncertainty of demand and the consequence of either being left with unsold copies or missing out on potential profit. While a simple calculation exists to determine the optimal order quantity – balancing the cost of overstocking against the cost of understocking – studies consistently demonstrate that people systematically order too few or too many newspapers. This isn’t a matter of incompetence, but a cognitive bias; individuals often overestimate the risk of being stuck with excess inventory, leading to excessive caution and lost sales. The problem isn’t solving for the optimal order, but the persistent tendency to deviate from it, revealing a deep-seated aversion to potential losses that outweighs the potential for gains, even when the expected value clearly favors a different course of action. The Newsvendor Problem, therefore, serves as a foundational model for understanding biases in inventory management and broader decision-making contexts.
While recognizing cognitive biases in decision-making is a vital first step, accurately gauging their influence within complex systems presents a significant challenge for traditional analytical methods. Recent investigations reveal a concerning trend: Large Language Models (LLMs), despite their advanced capabilities, can actually amplify these inherent biases-in some cases, by as much as 70%. This exacerbation suggests that simply automating decisions with LLMs doesn’t eliminate flawed reasoning; instead, it can scale those flaws to an unprecedented degree. Consequently, there is a growing demand for sophisticated modeling techniques that go beyond simple bias identification and move toward predictive quantification, allowing for proactive mitigation strategies and more robust, equitable outcomes in areas ranging from financial forecasting to criminal justice.

Simulating the Flawed Oracle: Modeling Bias with Large Language Models
The Newsvendor Problem, a classic stochastic modeling technique used to analyze inventory decisions under demand uncertainty, is being increasingly utilized as a framework for simulating human behavior with Large Language Models (LLMs). Researchers are employing LLMs to replicate the ordering processes of human subjects faced with the Newsvendor scenario, allowing for controlled experimentation and quantifiable analysis of cognitive biases. This approach involves presenting the LLM with simulated demand distributions and evaluating its resulting order quantities, enabling comparison with human ordering patterns and the identification of behavioral tendencies such as overstocking or understocking. The use of LLMs in this context provides a scalable and systematic method for studying human decision-making under uncertainty, moving beyond traditional experimental economics approaches.
Researchers are utilizing Large Language Models (LLMs) to conduct controlled experiments on behavioral biases in decision-making contexts, specifically within the Newsvendor Problem. These models allow for systematic manipulation of variables and observation of resulting ordering patterns, enabling the quantification of biases such as consistently ordering too much or too little stock. Investigations focus on identifying the emergence of ordering biases-systematic deviations from optimal quantities-presentation order effects-where the order in which information is presented influences decisions-and demand chasing tendencies-where decisions are disproportionately influenced by recent demand signals. By analyzing LLM behavior under varying conditions, researchers aim to isolate and understand the mechanisms driving these biases, providing insights into human cognitive processes and potentially improving forecasting and inventory management strategies.
Within the Newsvendor Problem simulations, demand uncertainty is modeled using three probability distributions: Uniform, Normal, and Lognormal. These distributions dictate the range and likelihood of different demand levels, directly influencing the quantity of goods ordered by the Large Language Model. Recent research indicates significant deviations in ordering behavior when using GPT-4; specifically, under a Uniform distribution and low margin scenario, GPT-4’s tendency to overorder exceeded the performance of human subjects by 70%. This suggests that, while capable of simulating decision-making, LLMs may exhibit amplified biases in specific conditions compared to human ordering patterns.

The Architecture of Error: Rationality and Model Complexity
The architecture and parameter count of Large Language Models (LLMs) directly correlate with the manifestation and intensity of cognitive biases in simulated decision-making processes. Simulations indicate that simpler models, with fewer parameters, tend to exhibit biases such as anchoring and confirmation bias to a lesser degree than more complex models. Conversely, highly complex LLMs can amplify these biases or introduce new ones, like overconfidence or framing effects, due to their increased capacity for pattern recognition and potential for overfitting to training data. The specific types of biases observed also vary with model complexity; while simpler models might primarily demonstrate biases related to information processing limitations, more complex models can exhibit biases stemming from their ability to simulate sophisticated, yet flawed, reasoning processes. Quantitatively, increases in model parameter count have been shown to correlate with both increased bias magnitude, measured by deviation from rational choice, and increased variance in bias expression across different simulated scenarios.
The assumption of strict risk neutrality in simulations of decision-making processes can produce inaccurate results due to its failure to account for established behavioral economics principles. Human decision-making consistently demonstrates risk aversion – a preference for a certain outcome over a probabilistic one with the same expected value. When models operate under risk neutrality, they effectively treat all outcomes with equal utility regardless of probability, potentially masking biases that would be apparent when subjects exhibit typical loss aversion. Conversely, certain biases may be exaggerated in a risk-neutral framework as the model doesn’t reflect the dampening effect of risk aversion on extreme choices. Therefore, incorporating realistic parameters for risk preference is crucial for accurately representing human behavior and interpreting simulation outcomes.
Simulation results from Large Language Models necessitate careful interpretation due to the impact of model parameters and inherent biases. Analysis reveals a strong correlation between model complexity and the manifestation of cognitive biases; therefore, calibration of these parameters is crucial for accurate results. Notably, GPT-4o exhibited a near-zero convergence slope of +0.003, signifying rapid adoption of the optimal solution, a characteristic not observed in other tested models which displayed substantial convergence lag. This disparity underscores the importance of acknowledging psychological factors – specifically, how models approach problem-solving – when evaluating simulation outcomes and drawing conclusions about decision-making processes.
Correcting the Course: Mitigating Bias Through Oversight and Prompting
Human-in-the-loop oversight represents a powerful strategy for mitigating biases embedded within artificial intelligence systems. This approach doesn’t aim to eliminate AI bias entirely – a complex undertaking given the data these systems learn from – but rather to introduce a crucial layer of human judgment into the decision-making process. By requiring a human reviewer to validate or challenge AI-generated outputs, particularly in high-stakes scenarios like loan applications or criminal risk assessment, potential biases can be identified and corrected before they manifest as unfair or discriminatory outcomes. This collaborative method leverages the strengths of both AI – speed and data processing – and human intelligence – contextual understanding and ethical reasoning – resulting in more equitable and reliable decisions. Furthermore, the very act of human review provides valuable feedback, allowing the AI model to learn from its mistakes and refine its algorithms over time, continually reducing the likelihood of biased outputs.
Structured prompting represents a powerful technique for refining the outputs of Large Language Models and mitigating the impact of inherent cognitive biases. Rather than simply posing open-ended questions, this approach involves crafting prompts with specific constraints, logical frameworks, or multi-step reasoning requirements. By guiding the model through a defined process – such as requesting it to explicitly consider alternative viewpoints, justify its conclusions with evidence, or decompose a complex problem into smaller, manageable parts – researchers can encourage more rational and objective responses. This isn’t about eliminating bias entirely, but rather about channeling the model’s vast knowledge base towards outcomes less susceptible to common distortions in reasoning, such as confirmation bias or anchoring effects. The effectiveness of structured prompting lies in its ability to subtly steer the model’s generative process, promoting outputs that are not only informative but also demonstrably more aligned with logical principles and factual accuracy.
Recent advancements indicate a pivotal shift in the application of artificial intelligence, moving beyond mere identification of bias towards its active mitigation and promotion of objective outcomes. Interventions, such as human oversight and structured prompting, are proving effective in guiding AI decision-making processes, reducing the impact of inherent cognitive distortions. Notably, models like GPT-4o demonstrate a significant leap in error responsiveness; this enhanced adaptability allows for more effective correction of mistakes and a greater capacity to learn from feedback, suggesting a future where AI systems not only recognize flawed reasoning but actively strive for more rational and equitable results. This represents a crucial step towards leveraging AI’s power to improve the quality and fairness of decisions across a multitude of domains.
The study reveals a troubling echo of human fallibility within these complex systems. Despite their computational power, large language models aren’t immune to the cognitive biases that plague human decision-making, particularly as demonstrated through the Newsvendor problem. This isn’t simply a matter of inaccurate predictions; it’s a fundamental limitation in how these models process information. As Robert Tarjan once observed, “Complexity has a way of hiding simplicity.” The research underscores that increasing model scale doesn’t inherently lead to rational outcomes; it merely obscures the underlying biases. The illusion of stability, cached by time and computational power, masks a core truth: even the most sophisticated systems are susceptible to decay and irrationality. Vigilant oversight, therefore, remains essential to mitigate these inherent risks.
The Long Refactor
The observation that large language models inherit – and often exaggerate – human cognitive biases in seemingly rational tasks is less a surprise than an inevitable consequence of architectural lineage. Versioning, after all, is a form of memory, and models trained on the past will naturally encode its imperfections. The Newsvendor problem, recast through this lens, isn’t simply about predicting demand; it’s about perpetuating the predictable errors embedded within the training data. The arrow of time always points toward refactoring, but even the most rigorous adjustments can only mitigate, not eliminate, the ghosts in the machine.
Future work must move beyond simply detecting these biases. The real challenge lies in understanding their systemic origins – not as isolated glitches, but as emergent properties of complex systems. Exploring the interplay between model architecture, training methodologies, and the inherent irrationalities of the data itself will be crucial. The question isn’t whether models can think rationally, but whether they can learn to recognize – and compensate for – their own inherent limitations.
Ultimately, this line of inquiry reveals a deeper truth: complexity does not equate to transcendence. Systems, even those built on layers of abstraction, are still subject to entropy. The focus, therefore, should shift from pursuing artificial general intelligence to engineering for graceful degradation-accepting that all models, like all things, are destined to age, and preparing for the inevitable moment when their predictions become echoes of a flawed past.
Original article: https://arxiv.org/pdf/2512.12552.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- Silver Rate Forecast
- Gold Rate Forecast
- Красный Октябрь акции прогноз. Цена KROT
- MSCI’s Digital Asset Dilemma: A Tech Wrench in the Works!
- Dogecoin’s Big Yawn: Musk’s X Money Launch Leaves Market Unimpressed 🐕💸
- Bitcoin’s Ballet: Will the Bull Pirouette or Stumble? 💃🐂
- Guardian Wealth Doubles Down on LKQ Stock With $1.8 Million Purchase
- Binance and Botim Money Join Forces: Crypto in the UAE Gets a Boost-Or Does It? 🚀
- Twenty One Capital’s NYSE debut sees 20% fall – What scared investors?
- Monster Hunter Stories 3: Twisted Reflection gets a new Habitat Restoration Trailer
2025-12-16 23:38