Can AI Truly Research? A New Benchmark Puts Agents to the Test

Author: Denis Avetisyan

A challenging new dataset, DeepResearch-9K, reveals significant limitations in current artificial intelligence systems when it comes to performing complex, multi-step research tasks.

Model performance was evaluated on the DeepResearch-9K test set to demonstrate comparative efficacy.

DeepResearch-9K assesses agentic capabilities requiring extensive web search, information synthesis, and multi-hop reasoning, highlighting the need for advancements in AI-driven deep research.

Despite advances in large language models, complex, multi-step reasoning tasks demanding extensive web search and information synthesis remain a significant challenge. To address this, we introduce DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent, comprising 9,000 questions with associated search trajectories and verifiable answers, alongside the open-source DeepResearch-R1 training framework. Empirical results demonstrate that agents trained on this dataset achieve state-of-the-art performance on challenging deep-research benchmarks, yet reveal persistent limitations in current models. Will these resources pave the way for truly autonomous, deep-research agents capable of tackling complex real-world problems?

The Evolving Landscape of Autonomous Reasoning

Conventional natural language processing models, while adept at tasks like sentiment analysis or simple translation, frequently falter when confronted with problems demanding intricate, sequential reasoning. These models typically process information in a single pass, limiting their ability to break down a complex objective into manageable steps, remember prior actions, and adjust strategies accordingly. Real-world challenges – from planning a multi-leg journey to debugging software or conducting scientific research – necessitate a capacity for iterative problem-solving that extends beyond pattern recognition. The inherent limitations of these traditional approaches stem from their architecture; they lack the mechanisms to maintain state, explore different solution paths, and learn from the consequences of their actions – qualities crucial for navigating the complexities of dynamic environments and achieving long-term goals.

Autonomous Agents represent a significant leap beyond traditional Natural Language Processing by combining the ability to utilize external tools with a capacity for iterative problem-solving. Unlike systems designed for single-turn responses, these agents can decompose complex tasks into manageable steps, leveraging tools – such as search engines, calculators, or APIs – to gather information and execute actions. This process isn’t linear; agents can analyze results, refine strategies, and repeat steps until a desired outcome is achieved. The architecture enables them to tackle tasks demanding reasoning, planning, and adaptation – scenarios where static models falter. Consequently, Autonomous Agents demonstrate potential in automating intricate processes, from conducting research and writing reports to managing schedules and providing personalized assistance, effectively bridging the gap between language understanding and real-world action.

The emergence of autonomous agents as a distinct approach to artificial intelligence demands a departure from conventional evaluation metrics. Existing benchmarks, often designed for static tasks and single-response models, prove inadequate for assessing an agent’s ability to navigate complex, multi-step problems. Consequently, researchers are actively developing new methodologies, emphasizing long-horizon evaluation and measuring success not just on final outputs, but on the reasoning process itself. This includes creating environments that require agents to utilize tools, learn from failures, and adapt strategies over extended interactions. Furthermore, training paradigms are shifting toward reinforcement learning techniques and the generation of synthetic datasets specifically designed to challenge an agent’s planning, exploration, and generalization capabilities – ensuring progress isn’t merely reflected in narrow performance gains, but in genuine advancements towards robust and reliable autonomous behavior.

DeepResearch-9K: A Rigorous Test of Agent Intellect

DeepResearch-9K is a benchmark dataset constructed to assess the capabilities of artificial intelligence agents in performing complex, multi-step research tasks. The dataset necessitates substantial interaction with web resources, requiring agents to formulate search queries, navigate websites, and extract relevant information over extended sessions. Unlike datasets focused on single-step question answering, DeepResearch-9K presents tasks demanding agents synthesize information from multiple sources to arrive at a final answer, simulating real-world research scenarios. The dataset’s design focuses on evaluating an agent’s ability to not only retrieve information, but also to strategically manage the research process itself, including refining search strategies and identifying credible sources.

DeepResearch-9K employs a hierarchical composition structure wherein each task is broken down into multiple sub-questions, and the answers to these sub-questions are then synthesized to arrive at a final solution. This approach contrasts with single-step question answering and necessitates multi-hop reasoning and information aggregation. The dataset organizes information into layers of abstraction; initial sub-questions require basic fact retrieval, while subsequent levels demand the integration of information from multiple sources and the application of more complex reasoning skills. This compositional design ensures that the difficulty of a task is not simply a function of the amount of information required, but also of the cognitive steps necessary to process and combine that information, creating a more nuanced and challenging benchmark for agent evaluation.

Performance on the DeepResearch-9K benchmark is directly correlated with an agent’s ability to optimize search tool usage as task complexity increases. The dataset is structured into levels – ℒ1, ℒ2, and ℒ3 – and analysis demonstrates a significant rise in the average number of search tool calls required for successful completion. Agents averaged 4.30 tool calls on ℒ1 tasks, but this increased to 10.74 calls for ℒ2 and further to 20.23 calls for ℒ3. This progression indicates that higher-level tasks within DeepResearch-9K necessitate more extensive web interaction and present greater challenges related to entity obfuscation, demanding more efficient search strategies to locate relevant information.

Analysis of the DeepResearch-9K dataset reveals that search tool call frequency varies with task difficulty, increasing from <span class="katex-eq" data-katex-display="false">\mathcal{L}_{1}</span> to <span class="katex-eq" data-katex-display="false">\mathcal{L}_{3}</span>. — Analysis of the DeepResearch-9K dataset reveals that search tool call frequency varies with task difficulty, increasing from $\mathcal{L}_{1}$ to $\mathcal{L}_{3}$ .

LLM-as-Judge: Automating the Assessment of Agent Reasoning

Utilizing a Large Language Model (LLM) as a judge offers a method for automated evaluation of agent responses that circumvents the limitations of manual assessment. This approach enables the processing of large datasets and frequent evaluations, providing a scalable solution for performance monitoring and improvement. By automating the correctness assessment, developers can rapidly iterate on agent designs and identify areas for optimization without being constrained by the time and cost associated with human review. This is particularly valuable in scenarios requiring continuous evaluation, such as A/B testing different agent configurations or tracking performance drift over time.

DeepSeek-V3 is utilized as the foundational language model for automated evaluation due to its demonstrated capabilities in complex reasoning tasks and nuanced understanding of natural language. This model’s architecture and training data enable it to effectively assess the correctness and coherence of agent responses against established ground truths. Specifically, DeepSeek-V3’s parameter size and training regime contribute to its ability to discern subtle differences in response quality, providing a reliable basis for quantitative performance measurement. Its integration into the judging process allows for scalable and consistent evaluation, mitigating the subjectivity inherent in manual assessment and ensuring statistically significant results, as evidenced by its baseline accuracy of 20.18% on the DeepResearch-9K dataset.

Evaluation using an LLM-as-judge methodology on the DeepResearch-9K dataset is designed to verify that observed improvements in agent performance are attributable to enhanced reasoning and information retrieval, rather than superficial factors. Initial validation testing established a baseline accuracy of 20.18% when utilizing DeepSeek-V3 as the judging model against this dataset; therefore, any subsequent increase in score demonstrates a quantifiable gain in the agent’s core capabilities as measured by the LLM-as-judge system. This provides a standardized metric for comparing different agent configurations and tracking progress on the DeepResearch-9K benchmark.

Refining Agent Capabilities: A Synergistic Training Paradigm

The SFT+RL training paradigm combines supervised fine-tuning (SFT) with reinforcement learning (RL) to iteratively improve agent performance. Initially, the model undergoes SFT, leveraging a dataset of demonstrated optimal behaviors to establish a foundational policy. Subsequently, RL techniques, such as Proximal Policy Optimization (PPO) or Generalized Reward Prediction Optimization (GRPO), are employed to further refine this policy through trial-and-error interaction with an environment. The RL phase utilizes reward signals to incentivize desired actions and penalize undesirable ones, allowing the agent to learn from its experiences and optimize its behavior beyond the scope of the initial supervised data. This combined approach leverages the benefits of both methodologies: SFT provides a strong starting point, while RL enables adaptation and improvement in complex scenarios.

Proximal Policy Optimization (PPO) and Generalized Advantage Estimation (GRPO) are both policy gradient methods utilized within reinforcement learning (RL) to iteratively improve agent policies. PPO stabilizes training by employing a clipped surrogate objective function, limiting the policy update step to prevent drastic changes that could destabilize learning. GRPO, conversely, focuses on reducing the variance of policy gradient estimates by using generalized advantage estimation, which provides a more accurate assessment of the value of taking specific actions in given states. Both algorithms function by collecting experiences through interaction with an environment, calculating advantages based on observed rewards, and updating the policy to favor actions with higher estimated returns, ultimately refining the agent’s decision-making process.

Qwen2.5-3B and Llama3.2-3B have been utilized as base language models within agent training and evaluation frameworks. Performance metrics demonstrate that Llama-3.2-3B, when trained using the Proximal Policy Optimization (PPO) algorithm, achieved a peak accuracy of 22.50%. This result represents an improvement over the performance of the DeepSeek V3 baseline model under the same training conditions, indicating the efficacy of the Llama-3.2-3B architecture and the PPO training methodology for this task.

Towards General Intelligence: Beyond the Benchmarks

Recent advancements in artificial intelligence have yielded agents capable of sophisticated web-based tasks, and their efficacy extends beyond the datasets used for initial training. Agents honed on the DeepResearch-9K benchmark consistently exhibit improved performance across a spectrum of related challenges, notably BrowseComp, HotpotQA, and GAIA. This transfer of learning suggests a robust understanding of information seeking and reasoning, rather than mere memorization of training examples. The ability to generalize to these diverse benchmarks-which assess capabilities like multi-hop reasoning, complex question answering, and web navigation-demonstrates a significant step toward building agents that can autonomously tackle real-world information needs with greater reliability and adaptability.

The capacity of these agents to navigate complex tasks extends beyond static datasets through the implementation of web simulation environments. By interacting with simulated web interfaces, agents can practice information retrieval and decision-making in a dynamic and unpredictable setting, significantly improving their robustness. Crucially, integration with frameworks like Search-R1 provides a structured approach to web interaction, enabling agents to formulate search queries, parse results, and iteratively refine their strategies. This synergistic combination of simulated experience and a robust search architecture doesn’t simply improve performance on specific benchmarks; it cultivates a broader adaptability, allowing these agents to tackle previously unseen challenges with greater efficiency and reliability – a crucial step towards truly generalizable artificial intelligence.

Recent advancements in agent training have demonstrated a notable leap in generalizability, exemplified by the performance achieved with the Llama-3.2-3B model on the ℒ3 benchmark. This model attained an accuracy of 23.73%, a figure that directly aligns with the performance observed on the more complex BrowseComp-Plus dataset. This parity suggests that training on datasets like DeepResearch-9K, combined with models such as Llama-3.2-3B, doesn’t simply optimize for specific tasks, but cultivates a broader capacity for reasoning and information retrieval across diverse web-based challenges. The ability to transfer knowledge effectively, achieving comparable results on distinct benchmarks, represents a significant step toward building agents capable of robust and adaptable performance in real-world scenarios.

The creation of DeepResearch-9K highlights a fundamental principle in system design: structure dictates behavior. This dataset isn’t merely a collection of research tasks; it’s a meticulously crafted environment intended to expose the limitations of current Large Language Models in complex reasoning. As David Hilbert famously stated, “We must be able to answer the question: what are the ultimate foundations of mathematics?”-a sentiment mirrored in the pursuit of robust agentic capabilities. DeepResearch-9K, much like Hilbert’s program, attempts to establish a clear foundation for evaluating and improving the ability of agents to perform multi-hop reasoning and synthesize information from extensive web searches. The dataset’s challenge lies not in the individual steps, but in the orchestration of those steps-a holistic view essential for scalable intelligence.

What Lies Ahead?

The introduction of DeepResearch-9K exposes a familiar truth: scaling model parameters alone does not confer genuine understanding. Current large language models, while adept at surface-level pattern matching, falter when confronted with the systemic demands of deep research-the iterative process of question formulation, evidence gathering, and synthesis. The benchmark highlights not a failure of search, but a failure of integration. The ability to retrieve information is, in itself, insufficient; the architecture must accommodate a robust mechanism for contextualizing, validating, and ultimately, believing-or disbelieving-the retrieved data.

Future work must address this architectural deficiency. A promising avenue lies in exploring systems that explicitly model epistemic states – confidence levels, source credibility, and the recognition of conflicting evidence. Modifying one component-improving search, for instance-will yield only incremental gains if the underlying framework cannot reconcile disparate information. The challenge isn’t simply to build a more powerful search engine, but to create a system capable of building, maintaining, and revising a coherent worldview.

Ultimately, DeepResearch-9K serves as a reminder that intelligence is not about knowing more, but about organizing what is known. The benchmark implicitly demands a shift in focus, from optimizing for raw performance to prioritizing structural integrity. The true measure of progress will not be a higher score on this, or any other, dataset, but a demonstrable capacity for robust, adaptive reasoning in the face of complexity.

Original article: https://arxiv.org/pdf/2603.01152.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/