Bridging the African Economics Knowledge Gap

Author: Denis Avetisyan


A new dataset reveals that large language models struggle with specialized African economic data, highlighting the need for enhanced retrieval mechanisms.

AfriEconQA, a benchmark dataset based on World Bank reports, demonstrates the limitations of current language models in African economic analysis and the benefits of retrieval-augmented generation.

Despite advances in large language models, specialized domain knowledge-particularly concerning African economic data-remains a significant challenge. To address this gap, we introduce AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports, a curated dataset of 8,937 question-answer pairs grounded in comprehensive World Bank documentation. Our evaluation reveals a substantial parametric knowledge deficit in current LLMs, with zero-shot models failing to answer over 90% of queries-even retrieval-augmented generation pipelines struggle with precision. This highlights the need for robust, domain-specific information retrieval and question answering systems; can we effectively bridge the knowledge gap and unlock the potential of LLMs for African economic analysis?


The Challenge of Discerning Economic Signals

Economic analysis fundamentally relies on discerning intricate connections within data that is inherently linked to specific points in time. Understanding not just what happened, but when, and how events unfold sequentially, is crucial for accurate forecasting and policy evaluation. Macroeconomic indicators, such as inflation rates or unemployment figures, don’t exist in isolation; their significance is determined by their trajectory, their relationship to previous values, and their correlation with other contemporaneous variables. Consequently, effective economic reasoning demands analytical tools capable of processing temporal data, identifying leading and lagging indicators, and accounting for the dynamic interplay between various economic forces – a task that requires moving beyond static snapshots to embrace the evolving nature of economic systems.

Current question answering systems, while proficient with factual recall, often falter when confronted with the intricacies of macroeconomic data and policy analysis. These systems typically struggle to discern the subtle relationships between indicators – how changes in inflation might influence unemployment, or the lagged effects of monetary policy. The nuances of economic reasoning demand an understanding of not just what the data shows, but why these trends occur and what potential future implications exist. Unlike questions with definitive answers, economic queries frequently require synthesizing information from multiple sources, interpreting ambiguous language within reports, and accounting for constantly evolving contexts – challenges that exceed the capabilities of most existing natural language processing models. Consequently, automated systems often provide superficial or inaccurate responses when tasked with complex economic reasoning, highlighting a critical gap in artificial intelligence’s ability to tackle real-world economic challenges.

Interpreting reports from institutions like the World Bank presents a unique challenge for automated systems due to the prevalence of temporal reasoning and contextual dependencies. These reports aren’t static snapshots; they chronicle evolving economic conditions, forecast future trends, and assess the impact of policies over time. Consequently, a system must not only identify key data points – such as GDP growth or inflation rates – but also accurately determine when those figures apply and how they relate to specific historical events or policy interventions. Simply recognizing numbers is insufficient; robust methods are needed to disambiguate temporal references – distinguishing, for example, between “projected growth for 2024” and “growth observed in the first quarter of 2023” – and to understand the broader economic context informing each data point, ensuring a nuanced and accurate interpretation of complex macroeconomic analyses.

Introducing AfriEconQA: A Focused Benchmark

AfriEconQA is a newly created benchmark dataset consisting of 8,937 question-answer pairs. These pairs are sourced directly from publicly available World Bank reports, ensuring content relevance and factual grounding in established economic data. The dataset’s construction involved extracting questions and corresponding answers as explicitly stated within the reports, rather than relying on paraphrasing or external knowledge. This direct derivation aims to provide a standardized and verifiable resource for evaluating question answering systems on African economic contexts.

AfriEconQA concentrates exclusively on economic data pertaining to the African continent, comprising information sourced from World Bank reports. This focused scope differentiates it from general question answering benchmarks and allows for targeted evaluation of systems specifically on African economic contexts. By limiting the subject matter, AfriEconQA facilitates a more granular assessment of a system’s performance in understanding and reasoning about African economic indicators, trends, and policies, providing a more relevant and precise metric than broader, multi-domain datasets.

AfriEconQA is specifically engineered to move beyond simple fact retrieval and evaluate a system’s capacity for complex reasoning within the domain of African economic data. The dataset’s questions require systems to synthesize information from multiple sentences or paragraphs, perform calculations based on reported figures, and draw inferences regarding economic trends and relationships. Evaluation metrics will focus on assessing not just the correctness of answers, but also the system’s ability to justify its reasoning process based on evidence present in the source World Bank reports, thereby gauging a deeper understanding of the underlying economic context.

Hybrid Retrieval: Combining Strengths

A hybrid retrieval system was implemented, combining BM25, a sparse retrieval method based on keyword frequency and inverse document frequency, with dense retrieval techniques utilizing vector embeddings. Dense retrieval employed models including BAAI/BGE-m3 and Google GenAI Embeddings to represent text as vectors, enabling semantic similarity comparisons. This approach leverages the strengths of both methods; BM25 provides reliable results based on lexical matching, while dense retrieval captures nuanced semantic relationships often missed by keyword-based searches. The combined system aims to improve retrieval accuracy and relevance by integrating both lexical and semantic information.

Dense Retrieval and BM25 represent distinct approaches to information retrieval. BM25, a lexical matching algorithm, identifies documents containing query keywords based on term frequency and inverse document frequency, providing reliable results when keyword overlap is strong. Conversely, Dense Retrieval utilizes vector embeddings – numerical representations of text – to encode both queries and documents into a vector space; similarity is then determined by calculating the cosine similarity between these vectors. This allows Dense Retrieval to identify documents semantically related to the query, even if they lack direct keyword matches, capturing nuanced meaning and context that BM25 may miss.

Reciprocal Rank Fusion (RRF) was implemented to consolidate results from both BM25 and dense retrieval models, improving ranking accuracy by prioritizing highly-ranked documents from either method. RRF calculates a combined score based on the reciprocal rank of each document in the respective retrieval lists; documents appearing higher in either list contribute more significantly to the final score. Evaluation demonstrated that utilizing Google Dense Retrieval in conjunction with RRF yielded the highest Mean Reciprocal Rank (MRR) of 0.763, indicating superior performance in retrieving relevant documents compared to other configurations tested.

Establishing a Baseline: The Limits of Parametric Knowledge

Evaluation of GPT-5 Mini on the AfriEconQA dataset established a crucial parametric baseline for assessing knowledge of African economic data. Results indicated a remarkably low LLM-Judge Accuracy of just 0.081, revealing limited inherent parametric understanding within the model. This suggests that, without external knowledge supplementation, the language model struggles with even fundamental queries related to African economics, highlighting a significant gap in its pre-trained knowledge base and justifying the exploration of retrieval-augmented methods to enhance performance on this specific domain.

GPT-5 Mini was intentionally utilized as a foundational control in this study, establishing a clear point of reference for evaluating the gains achieved through retrieval-augmented generation. By first assessing the model’s inherent limitations – a low accuracy on the AfriEconQA dataset demonstrating minimal pre-existing knowledge of African economic data – researchers could then rigorously quantify the improvements derived from integrating external knowledge sources. This approach allowed for a direct comparison, isolating the contribution of the retrieval mechanisms themselves and validating their effectiveness in addressing the knowledge gaps present in a standalone large language model. Consequently, any performance increase observed in subsequent tests could be confidently attributed to the implemented retrieval strategies, rather than inherent model capabilities.

Evaluations reveal that relying solely on the inherent parametric knowledge of large language models proves insufficient for accurately addressing complex queries, particularly within specialized domains like African economics. A standalone GPT-5 Mini model, for instance, achieved a remarkably low LLM-Judge accuracy of 0.081. However, employing retrieval-augmented methods-specifically, strategies that combine information retrieved from external sources with the model’s existing knowledge-yields substantial performance gains. The most effective approach, a hybrid system leveraging both Google search and GPT-4o, attained a significantly higher LLM-Judge accuracy of 0.512, demonstrating that accessing and integrating external data is crucial for improving the reliability and precision of responses in knowledge-intensive tasks.

The creation of AfriEconQA highlights a critical failing in current Large Language Models: a lack of nuanced understanding regarding African economic data. This necessitates retrieval-augmented generation to bridge the knowledge gap. As Edsger W. Dijkstra observed, “It’s not enough to do things right; one must do the right things.” The dataset isn’t merely about providing answers; it’s about ensuring the models ask the correct questions of the data, and that requires a foundation of specific, verified knowledge. The benchmark’s focus on temporal reasoning further emphasizes this need – accurate analysis demands understanding not just what happened, but when, and the relationships between economic events. Clarity is the minimum viable kindness, and that clarity begins with data integrity.

What Remains to be Seen

The construction of AfriEconQA serves not as a culmination, but as a precise articulation of absence. Current large language models, when confronted with specialized economic data – specifically that concerning African economies – reveal a predictable failure mode: not malice, but a lack of structural information. Retrieval augmentation mitigates this, yet the necessity of external knowledge injection is itself an admission. The models do not know; they locate. This is a distinction often obscured by the illusion of fluency.

Future work must move beyond simply demonstrating this deficiency. The dataset invites investigation into how this knowledge gap manifests – whether it’s a consequence of data scarcity, algorithmic bias, or inherent limitations in the models’ capacity for temporal reasoning regarding developing economies. A focus on knowledge representation – beyond simple text ingestion – may prove more fruitful than continued scaling of parameters.

Ultimately, the value of this endeavor lies not in building better question-answering systems, but in refining the question itself. What does it mean for a machine to ‘understand’ an economy? Perhaps the pursuit of artificial intelligence will, ironically, force a more rigorous definition of intelligence itself. Emotion is a side effect of structure; clarity is compassion for cognition.


Original article: https://arxiv.org/pdf/2601.15297.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-23 22:14