Remembering to Think: A New Test for Agent Reasoning

Author: Denis Avetisyan


Researchers have created a benchmark to rigorously evaluate how well AI agents can maintain context and reason over extended interactions.

A formalized memory system unlocks agentic capabilities, allowing for the construction of complex behaviors and adaptive responses to dynamic environments - a foundational step towards truly autonomous systems.
A formalized memory system unlocks agentic capabilities, allowing for the construction of complex behaviors and adaptive responses to dynamic environments – a foundational step towards truly autonomous systems.

AMA-Bench introduces a new evaluation framework and AMA-Agent, a memory system leveraging causality graphs and retrieval-augmented generation to enhance long-horizon reasoning in AI agents.

Current benchmarks inadequately assess long-horizon memory-a critical capability for large language model agents operating in complex, real-world scenarios. To address this gap, we introduce ‘AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications’, a new benchmark comprising both real and synthetic agent trajectories paired with expert-curated question answering. Our analysis reveals that existing memory systems struggle due to limitations in causal reasoning and reliance on lossy similarity-based retrieval, prompting us to propose AMA-Agent-a system leveraging a causality graph and tool-augmented retrieval that achieves a 11.16\% performance gain on AMA-Bench. Can these advancements pave the way for more robust and reliable autonomous agents capable of sustained, complex problem-solving?


The Memory Bottleneck: Beyond Chatbots and Towards True Agency

The trajectory of Large Language Models (LLMs) extends far beyond their initial implementation as conversational chatbots. Current development focuses on transforming these models into fully autonomous agents capable of independent action and complex problem-solving. This evolution, however, necessitates a significant upgrade to their memory capabilities. Unlike traditional chatbots which often operate within the confines of a single interaction, autonomous agents require the ability to retain and process information over extended periods, learning from past experiences and adapting to changing circumstances. A robust memory system isn’t merely about increasing storage capacity; it demands efficient mechanisms for indexing, retrieving, and applying relevant knowledge to inform decision-making in dynamic environments. Consequently, research is increasingly concentrated on architectures that allow LLMs to not simply access information, but to actively curate, refine, and utilize it as a core component of their autonomous functionality.

Conventional memory systems, often relying on fixed-size context windows or simplistic retrieval mechanisms, demonstrably falter as autonomous agents engage in extended dialogues or complex tasks. These limitations impede an agent’s capacity to effectively reason over long horizons, as crucial information from earlier interactions gets lost or diluted. The inability to reliably recall and integrate past experiences hinders consistent decision-making and prevents the development of genuinely adaptive behavior. Consequently, agents struggle with tasks requiring sustained attention to detail, nuanced understanding of evolving situations, or the ability to learn from prior mistakes – effectively creating a ‘memory bottleneck’ that restricts their overall intelligence and problem-solving capabilities.

Simply increasing the parameter count of Large Language Models will not resolve the inherent memory limitations facing truly autonomous agents. While scaling improves performance on existing tasks, it doesn’t address the core challenge of retaining and effectively utilizing information over extended periods – a necessity for complex, long-horizon reasoning. Current LLM architectures treat context as a fixed-size window, discarding valuable information as interactions progress. A fundamental rethinking of memory management is therefore crucial, moving beyond passive storage to systems capable of actively filtering, summarizing, and retrieving relevant knowledge. This necessitates exploring mechanisms like hierarchical memory structures, external knowledge bases, and sophisticated attention mechanisms that allow agents to dynamically prioritize and recall information, effectively augmenting their internal parameters with a persistent, accessible external memory.

Agent trajectories, unlike those of reasoning or chatbot applications, are characterized by causal grounding, diverse symbolic artifacts, and a high density of objective information.
Agent trajectories, unlike those of reasoning or chatbot applications, are characterized by causal grounding, diverse symbolic artifacts, and a high density of objective information.

AMA-Agent: Architecting Memory for Intelligent Systems

AMA-Agent addresses memory management in agentic systems by providing a structured framework designed for both information retention and efficient retrieval. This framework moves beyond basic storage by focusing on maintaining a complete history of agent interactions and observations. The system prioritizes preserving the context surrounding information, allowing the agent to access not just what happened, but why and how it happened. This comprehensive approach facilitates improved reasoning, decision-making, and task completion compared to methods relying solely on compressed or similarity-matched data. The architecture is intended to be adaptable across diverse agent applications and modalities, supporting long-term memory and continuous learning.

The AMA-Agent framework utilizes a Causality Graph as its central data structure for memory management. This graph explicitly represents objective information derived from agent interactions and, critically, the causal relationships between those pieces of information. Nodes within the graph represent factual statements or observations, while directed edges denote causal influence; for example, an action taken by the agent and its subsequent observed effect. This structure allows the agent to not simply recall past events, but to understand why those events occurred, facilitating more robust reasoning, planning, and prediction capabilities beyond those offered by methods reliant solely on temporal or semantic similarity.

Traditional memory systems for agents often rely on techniques like compressing past interactions or retrieving information based on similarity to the current query. AMA-Agent distinguishes itself by utilizing a structured knowledge representation, specifically a Causality Graph, to store and access information. This structured approach allows the agent to not only recall what happened, but also why, preserving the causal relationships between events. By representing information as interconnected nodes and edges, AMA-Agent enables more robust context understanding and improved reasoning capabilities, as the agent can infer new knowledge based on established causal links rather than simply pattern matching.

The AMA-agent transforms trajectories into a causality graph <span class="katex-eq" data-katex-display="false">
</span> and employs tool-augmented search for information retrieval.
The AMA-agent transforms trajectories into a causality graph and employs tool-augmented search for information retrieval.

AMA-Bench: A Rigorous Testbed for Agent Memory

AMA-Bench is a newly developed benchmark suite created to specifically assess the performance of memory systems within the context of agent-based applications. Unlike general language model benchmarks, AMA-Bench focuses on the retrieval and utilization of information crucial for agents completing tasks. The suite is designed to provide a standardized method for comparing different memory architectures and retrieval techniques, enabling quantitative evaluation of their efficacy in agent workflows. This targeted approach allows researchers and developers to isolate and improve the memory components critical for building more effective and reliable autonomous agents.

The AMA-Bench benchmark suite utilizes a dual-dataset approach to comprehensively evaluate agent memory systems. The real-world subset consists of question-answer pairs meticulously annotated by subject matter experts, providing a realistic assessment of performance on complex informational tasks. Complementing this, a synthetically generated subset allows for controlled experimentation and scalability testing; by varying parameters within the synthetic data, researchers can systematically analyze how memory systems perform under different conditions and at increasing data volumes, independent of the limitations of real-world data availability.

Evaluation on the AMA-Bench benchmark demonstrates that AMA-Agent achieves an average accuracy of 0.5722. This performance surpasses that of the strongest Retrieval-Augmented Generation (RAG) baseline, HippoRAG2, which attained an accuracy of 0.4480 on the same benchmark. Furthermore, AMA-Agent also outperforms the leading memory method, MemoRAG, which achieved an average accuracy of 0.4606. These results indicate that AMA-Agent represents a significant improvement in performance when evaluated against established baselines and current state-of-the-art memory techniques within agent-centric applications.

The model demonstrates consistent performance across diverse agent task families within the AMA-Bench benchmark.
The model demonstrates consistent performance across diverse agent task families within the AMA-Bench benchmark.

Hybrid Retrieval: Extracting Insight Beyond Simple Search

AMA-Agent distinguishes itself through a novel approach to information extraction-Hybrid Tool-Augmented Retrieval. Unlike conventional methods that rely solely on pre-trained models or keyword searches, this system dynamically integrates external tools into the retrieval process. This allows the agent to not simply find information, but to actively process and refine it, yielding more accurate and contextually relevant results. By combining the strengths of large language models with specialized tools-such as calculators, search engines, or APIs-AMA-Agent overcomes limitations inherent in static knowledge bases and significantly improves its capacity to handle complex queries and nuanced information needs. This hybrid methodology represents a substantial step towards more robust and adaptable information extraction systems.

The AMA-Agent distinguishes itself through a strategic integration of external tools to refine information retrieval. Rather than relying solely on internal knowledge, the agent actively queries specialized resources – such as search engines, knowledge bases, or APIs – to augment its understanding of a given query. This external consultation isn’t simply about gathering more data; it’s a focused process designed to filter noise and prioritize information directly relevant to the task at hand. By cross-referencing and validating information from these external sources, the agent significantly improves the precision and reliability of its retrieved data, enabling more accurate and insightful responses to complex prompts. This hybrid approach represents a shift from passive knowledge recall to active information synthesis, ultimately bolstering the agent’s capacity for nuanced and contextually aware information extraction.

Evaluations using the AMA-Bench benchmark reveal that AMA-Agent achieves a Recall score of 0.6238, signifying a substantial capacity for identifying and retrieving pertinent information when faced with intricate tasks. This metric demonstrates the system’s effectiveness in not simply accessing data, but in pinpointing precisely the information necessary to address complex queries. The achieved Recall indicates a high degree of accuracy in minimizing missed relevant details, suggesting AMA-Agent’s potential to significantly improve performance in applications requiring comprehensive and nuanced information gathering – a critical capability for advanced AI systems operating in real-world scenarios.

Towards True Intelligence: The Memory-Driven Future of Autonomous Systems

The development of genuinely intelligent autonomous systems hinges significantly on their capacity to effectively manage and utilize memory. Unlike traditional artificial intelligence which often relies on immediate data processing, human intelligence excels at retaining, organizing, and retrieving past experiences to inform present decisions. Replicating this capability in machines requires more than simply increasing storage capacity; it demands sophisticated architectures capable of long-term dependency modeling, efficient knowledge representation, and robust handling of noisy or incomplete information. A system’s ability to learn from extended interactions, adapt to changing environments, and reason about complex scenarios is directly proportional to the quality of its memory management-making it a cornerstone for achieving true autonomy and general intelligence.

The development of genuinely intelligent autonomous systems hinges on advanced memory capabilities, and recent work with AMA-Agent and AMA-Bench establishes a significant step forward. This novel framework offers a standardized platform for evaluating and constructing memory architectures designed to handle complex, real-world scenarios. By providing a robust benchmark and a performant agent, researchers gain a crucial foundation for exploring innovative approaches to memory management. AMA-Bench’s comprehensive suite of tests allows for rigorous assessment of an agent’s ability to retain, update, and abstract information over extended sequences, ultimately accelerating the creation of more scalable and reliable autonomous systems capable of tackling increasingly intricate challenges.

Recent advancements in autonomous agent technology are being rigorously tested and quantified through benchmarks like AMA-Bench, with the AMA-Agent demonstrating promising results across key cognitive abilities. The agent achieves a Causal Inference score of 0.6145, indicating its capacity to understand cause-and-effect relationships, alongside a State Updating score of 0.5305, reflecting its ability to accurately maintain an internal representation of the world. Furthermore, the agent’s State Abstraction capability, measured at 0.4719, showcases its capacity to generalize from specific instances. Critically, AMA-Agent doesn’t falter as complexity increases; it sustains consistently high accuracy even when processing sequences of up to 128,000 tokens – a substantial leap towards handling the extended reasoning and long-term dependencies necessary for true intelligence in complex environments.

Performance is primarily determined by the memory architecture, with scaling the model backbone contributing only marginal improvements.
Performance is primarily determined by the memory architecture, with scaling the model backbone contributing only marginal improvements.

The pursuit of robust agent memory, as detailed in this work concerning AMA-Bench, inherently demands a willingness to challenge existing limitations. This mirrors the spirit of rigorous inquiry-a systematic dismantling of assumptions to reveal underlying truths. Paul Erdős famously stated, “A mathematician knows a lot of things, but a physicist knows the deep things.” The study meticulously constructs a benchmark not merely to measure long-horizon reasoning, but to actively stress-test the boundaries of current LLM agent capabilities. By deliberately seeking out failure points via the Causality Graph and retrieval-augmented generation, the researchers are effectively reverse-engineering the very essence of memory and reasoning, akin to a physicist probing the fundamental laws governing a system. The entire undertaking embodies a commitment to understanding not just what works, but why it works-and, crucially, what happens when it doesn’t.

Beyond Recall: Charting the Course

The construction of AMA-Bench, and systems like AMA-Agent, reveal a predictable truth: that merely having a long-term memory isn’t the challenge. The bottleneck resides in discerning what deserves to be remembered, and, more crucially, how to reconstruct a coherent causal narrative from fragmented recollections. Current LLM agents excel at pattern matching, but struggle when faced with novelty-a predictable failure given their training data. True agency demands not just recall, but a capacity to extrapolate from incomplete information, to build predictive models of the world, even-and perhaps especially-when those models are demonstrably wrong.

The emphasis on causality graphs within AMA-Agent is a sensible direction, yet it implicitly concedes a limitation: that the world isn’t neatly organized into such structures. The next iteration of this work will undoubtedly require confronting the inherent messiness of real-world interactions-the spurious correlations, the hidden variables, and the sheer unpredictability of complex systems. Benchmarking against perfectly defined tasks only postpones the inevitable reckoning with genuine ambiguity.

Ultimately, the value of benchmarks like AMA-Bench lies not in achieving ever-higher scores, but in precisely identifying the points of failure. The system doesn’t become intelligent by conquering the test; it becomes intelligent by revealing the test’s inadequacies. It’s a process of iterative demolition, a dismantling of assumptions. And that, one suspects, is where the real progress lies.


Original article: https://arxiv.org/pdf/2602.22769.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-01 07:49