Can AI Feel Inflation? Modeling Price Perceptions with Large Language Models

Author: Denis Avetisyan


New research explores whether large language models can accurately simulate how consumers perceive and anticipate changes in pricing, offering insights into economic modeling and the potential biases of AI.

The study demonstrates that large language models, even those with a knowledge cut-off of September 2021, exhibit fluctuating perceptions of inflation over time, as evidenced by the varying means calculated from repeated responses of a consistent respondent pool-a phenomenon suggesting an inherent instability in their simulated understanding of economic concepts.
The study demonstrates that large language models, even those with a knowledge cut-off of September 2021, exhibit fluctuating perceptions of inflation over time, as evidenced by the varying means calculated from repeated responses of a consistent respondent pool-a phenomenon suggesting an inherent instability in their simulated understanding of economic concepts.

This study investigates the capacity of a large language model to replicate human responses to questions about inflation expectations, utilizing Shapley values to assess feature importance and highlight the need for careful validation of AI-driven economic predictions.

While macroeconomic models often struggle to capture nuanced consumer perceptions, this paper, ‘Inflation Attitudes of Large Language Models’, investigates the capacity of a large language model (GPT) to simulate household inflation expectations and perceptions based on macroeconomic signals. Findings reveal that GPT can replicate key empirical regularities observed in human survey responses, particularly concerning demographic factors, yet lacks a fully consistent model of price dynamics. This approach, utilizing Shapley values for explainability, offers a novel framework for evaluating LLMs in social science contexts-but how can we best validate and ethically deploy such models for forecasting and policy analysis?


The Mirage of Consensus: Gauging Inflation’s Shadow

Accurate assessment of how the public perceives inflation is fundamental to crafting effective economic policy; misjudging these perceptions can lead to misguided interventions and destabilize financial systems. However, conventional methods of gauging public expectation, such as surveys, frequently suffer from limitations in both scope and timeliness. These surveys often capture a snapshot from a limited demographic, struggle to represent the full spectrum of economic understanding, and are hampered by the delays inherent in data collection and analysis. Consequently, policymakers may operate with an incomplete or outdated understanding of prevailing beliefs about price changes, hindering their ability to proactively address economic challenges and maintain stability. The responsiveness of these traditional methods is particularly problematic in rapidly evolving economic climates, where public sentiment can shift dramatically in short periods.

Recent research explores the application of Large Language Models (LLMs) to model the complex perceptions of economic agents regarding inflation, offering a dynamic alternative to conventional survey-based methods. These models, trained on vast datasets of text and economic indicators, can simulate how individuals might react to economic news and form expectations about future price changes with a granularity previously unattainable. This approach doesn’t merely replicate aggregate trends; it captures nuanced responses based on contextual understanding, potentially revealing subtle shifts in sentiment that influence actual economic behavior. Furthermore, the simulations generated by LLMs serve as a powerful analytical tool, providing a benchmark against which to assess the validity and interpret the underlying patterns within traditional survey data, thereby enriching the insights derived from existing economic indicators and offering a more comprehensive view of public inflation expectations.

Histograms reveal that human and GPT-generated perceptions of inflation, as measured in February 2023, align at time T=0 but diverge at T=1.5, based on data from IAS, ONS, and the authors’ calculations.
Histograms reveal that human and GPT-generated perceptions of inflation, as measured in February 2023, align at time T=0 but diverge at T=1.5, based on data from IAS, ONS, and the authors’ calculations.

Conditioning the Oracle: Anchoring LLMs in Economic Reality

Accurate simulation of economic agents within Large Language Models (LLMs) requires ‘economic conditioning’ – the process of incorporating relevant macroeconomic data. Specifically, the Consumer Price Index including owner occupiers’ housing costs (CPIH) serves as a critical input. CPIH, unlike the traditional CPI, includes a measure of housing costs associated with owner-occupation, providing a more comprehensive assessment of inflation and the cost of living. By training LLMs on CPIH data, the models develop a statistically grounded understanding of price levels and can better reflect the economic circumstances of consumers, leading to more realistic and reliable simulations of economic behavior and forecasting.

Economically conditioning Large Language Models (LLMs) with data such as the Consumer Price Index including owner occupiers’ housing costs (CPIH) directly improves the accuracy of price-level predictions. By exposing the LLM to official statistical measures, the model’s output becomes more congruent with established economic indicators and reported values. This conditioning process allows the LLM to move beyond purely linguistic associations and ground its responses in quantifiable economic realities, enabling more informed projections of current and future price dynamics. Consequently, the model’s outputs are better suited for applications requiring alignment with official economic reporting and forecasting.

The TemperatureParameter within Large Language Models (LLMs) governs the randomness of token selection during response generation, directly impacting the diversity of simulated economic agent behavior. Lower temperature values – closer to zero – result in more deterministic outputs, favoring the most probable tokens and leading to predictable, albeit potentially less innovative, responses. Conversely, higher temperature values increase the probability of less likely tokens being selected, introducing greater variability and potentially more realistic heterogeneity among simulated agents. Sensitivity analyses have quantified this relationship, demonstrating that alterations to the TemperatureParameter, when combined with economic conditioning data such as the Consumer Price Index, produce measurable changes in model predictions regarding economic indicators like price levels and consumer spending, highlighting its critical role in calibrating model behavior.

Sensitivity analysis reveals that GPT inflation perceptions are most strongly influenced by food and restaurant prices, with the 90th percentile range of these components (shaded area) encompassing the economic conditions used in the primary experiment (purple dashed line), as determined by analysis of data from IAS and ONS.
Sensitivity analysis reveals that GPT inflation perceptions are most strongly influenced by food and restaurant prices, with the 90th percentile range of these components (shaded area) encompassing the economic conditions used in the primary experiment (purple dashed line), as determined by analysis of data from IAS and ONS.

Testing the Limits: Validating LLM Performance and Robustness

Rigorous evaluation of the Large Language Model (LLM) necessitates the use of techniques such as Out-of-Sample Evaluation and Cross-Validation to determine its capacity to generalize to previously unobserved data. Out-of-Sample Evaluation involves training the model on a subset of the available data and then assessing its predictive accuracy on the remaining, held-out data. Cross-Validation extends this by partitioning the data into multiple folds, iteratively training on a subset of folds and validating on the remaining ones, providing a more robust estimate of performance. These methods are critical for preventing overfitting and ensuring that the LLM’s performance accurately reflects its ability to predict inflation perceptions and expectations in real-world scenarios, rather than simply memorizing the training data.

The Large Language Model (LLM) utilizes a Generative Pre-trained Transformer (GPT) architecture, which is fundamentally designed to process and generate sequential data. This architecture employs a multi-layered structure of attention mechanisms allowing the model to weigh the importance of different parts of the input sequence when predicting future elements, crucial for understanding the temporal dynamics of inflation perceptions and expectations. Specifically, the GPT architecture’s capacity to learn complex relationships within textual data, combined with its ability to extrapolate from observed patterns, enables it to effectively model the nuanced language used when discussing economic forecasts and consumer price expectations. The model’s performance is directly linked to the depth and breadth of its pre-training data, and its ability to capture subtle linguistic cues indicative of inflationary sentiment.

Bias correction methods were implemented to address potential systematic errors in the LLM’s output, improving the fairness and accuracy of generated simulations. This involved techniques designed to reduce the influence of skewed or unrepresentative data used during training. Model performance was then rigorously validated by comparing its outputs to both human responses – assessing qualitative alignment with perceived inflation – and the results of established regression models, providing a quantitative benchmark for accuracy and identifying any remaining discrepancies that require further refinement of the bias correction process.

Histograms comparing human and GPT-generated perceptions of inflation at the initial time point (T=0, upper panel) and after 1.5 months (T=1.5, lower panel) reveal differences in forecasting behavior, based on data from November 2022.
Histograms comparing human and GPT-generated perceptions of inflation at the initial time point (T=0, upper panel) and after 1.5 months (T=1.5, lower panel) reveal differences in forecasting behavior, based on data from November 2022.

Dissecting the Oracle: Unveiling the Drivers of LLM Predictions

The increasing reliance on large language models for economic forecasting and simulation necessitates a thorough understanding of model explainability. Without insight into the factors driving an LLM’s predictions, stakeholders are left with a ‘black box’ – hindering trust and informed decision-making. Determining which economic indicators, demographic data, or specific conditions most heavily influence the model’s outputs is therefore paramount. This isn’t simply about verifying accuracy; it’s about uncovering the model’s internal logic and ensuring its conclusions are grounded in reasonable economic principles. A transparent understanding of these influential factors allows for robust validation, identification of potential biases, and ultimately, greater confidence in the simulations’ predictive power and policy implications.

Shapley Value, a concept originating in cooperative game theory, offers a robust methodology for dissecting the complex decision-making processes within large language models. Rather than simply identifying correlations, it calculates each input feature’s marginal contribution to a prediction, averaging the results across all possible feature combinations. This allows researchers to move beyond acknowledging that a variable influences an outcome and instead determine how much it matters, expressed as a quantifiable value. For instance, when simulating economic behavior, Shapley Value can pinpoint whether changes in the Consumer Price Index (CPIH) or specific demographic data points exert a greater impact on the LLM’s predicted agent responses. The result is a transparent, mathematically grounded attribution of influence, enabling a deeper understanding of the model’s internal logic and bolstering confidence in its predictions.

By leveraging Shapley value decomposition, researchers can pinpoint specific economic conditions – termed ‘treatment effects’ – that significantly influence the behavior of simulated agents within large language models. This analytical approach doesn’t simply identify that a condition matters, but quantifies how much it contributes to a given prediction. Crucially, this method enables a direct comparison between the LLM’s internal perception of economic drivers and those identified by traditional regression models. Discrepancies reveal potentially novel insights into economic relationships as understood by the AI, while consistencies validate existing economic theory. The ability to dissect the LLM’s reasoning process, and to compare it with established statistical techniques, builds confidence in the model’s outputs and unlocks opportunities for discovering previously unrecognised economic sensitivities.

The integration of demographic data into large language models significantly enhances their capacity to simulate realistic economic agent behavior. By moving beyond homogenous agent representations, these models can now account for variations in perceptions and responses based on factors like age, education, and income level. This nuanced approach allows for a more granular understanding of how different segments of the population react to economic stimuli, leading to more accurate and robust simulations. Consequently, researchers can investigate how disparities in demographic characteristics contribute to varied economic outcomes, and explore the potential impacts of policies on specific population groups – a level of detail previously unattainable with simplified agent models. This improved fidelity is crucial for building economic simulations that accurately reflect the complexities of real-world human behavior.

Shapley values reveal that index-weighted sub-component inflation (purple) and a regression model (sand) are the primary drivers of GPT, exceeding the baseline of 0.07 (blue), based on data from IAS, ONS, and the authors’ calculations.
Shapley values reveal that index-weighted sub-component inflation (purple) and a regression model (sand) are the primary drivers of GPT, exceeding the baseline of 0.07 (blue), based on data from IAS, ONS, and the authors’ calculations.

The pursuit of simulating human economic perceptions with large language models reveals a humbling truth: order is merely a cache between two outages. This research, attempting to model inflation expectations, demonstrates the model’s capacity to reproduce certain human responses, yet simultaneously exposes inherent biases and inconsistencies. It echoes a fundamental principle – systems aren’t tools, but ecosystems – where predicting emergent behavior remains profoundly challenging. As Robert Tarjan observed, “The most effective way to design a complex system is to assume it will fail.” This isn’t pessimism, but a recognition that even sophisticated simulations, like this investigation using Shapley Values, are imperfect prophecies, constantly tested by the chaotic reality they attempt to represent. The model’s limitations aren’t flaws, but inherent characteristics of any attempt to impose structure on inherently unpredictable phenomena.

The Garden Grows

This work doesn’t so much solve a problem as illuminate a new patch of garden. The model’s capacity to mimic inflation perceptions is less a triumph of engineering, and more an observation of the inherent pattern-seeking within any complex system – be it silicon or synapse. The biases revealed aren’t bugs, but the inevitable shadows cast by the training data-a reminder that every projection of ‘rationality’ is, at its core, a distillation of past inconsistencies.

The pursuit of ‘explainable AI’ here feels less like dissecting a mechanism, and more like charting the flow of water through a root system. Shapley values offer a glimpse, but complete transparency remains a phantom. Future work will likely center not on correcting the model’s flaws, but on understanding the nature of its divergences – embracing the fact that prediction, especially concerning human belief, is fundamentally an act of controlled hallucination.

Resilience won’t be found in isolating the model from ‘noise’, but in designing systems that forgive its errors, and incorporate the unpredictable flourishes of genuine human irrationality. The garden grows best when left slightly wild.


Original article: https://arxiv.org/pdf/2512.14306.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-17 13:03