The Self-Improving AI: Agents That Learn and Remember

Author: Denis Avetisyan


A new approach to artificial intelligence focuses on building agents that actively seek out and retain knowledge, enhancing their performance over time without traditional parameter updates.

Memory agents, rather than passively storing information, actively cultivate and refine knowledge, embodying a shift from static recall to dynamic, self-directed learning.
Memory agents, rather than passively storing information, actively cultivate and refine knowledge, embodying a shift from static recall to dynamic, self-directed learning.

This review details U-Mem, a cost-aware memory agent leveraging Thompson Sampling to achieve state-of-the-art autonomous learning with large language models.

Existing memory agents enhance large language models with external knowledge, yet remain largely passive recipients of information, limiting their ability to proactively address uncertainty or overcome knowledge gaps. This paper, ‘Towards Autonomous Memory Agents’, introduces U-Mem, a novel agent designed to actively acquire, validate, and curate knowledge using a cost-aware cascade and semantic-aware Thompson sampling. Experimental results demonstrate that U-Mem consistently outperforms prior memory baselines – improving HotpotQA and AIME25 by up to 14.6 and 7.33 points, respectively – even surpassing reinforcement learning-based optimization. Could this approach unlock truly lifelong learning capabilities for large language models, enabling them to adapt and improve continuously without parameter updates?


The Illusion of Context: Why LLMs Struggle to Remember

Despite the remarkable advancements in Large Language Models (LLMs), their ability to process information is inherently constrained by a finite context window – the maximum length of text an LLM can consider at one time. This limitation poses a significant hurdle when dealing with complex tasks requiring reasoning over extensive knowledge, as LLMs struggle to retain and effectively utilize information exceeding this window. While increasing context window size is a current area of research, it quickly becomes computationally expensive and doesn’t fully address the problem of information loss or dilution within the window. Essentially, even the most powerful LLMs can exhibit diminished performance when faced with tasks demanding long-term memory or comprehensive understanding beyond their immediate contextual grasp, highlighting a fundamental architectural challenge in achieving truly robust artificial intelligence.

Effective reasoning often demands synthesizing information exceeding the immediate grasp of any single processing unit; thus, architectures are evolving beyond simple context windows. These systems don’t simply ingest vast datasets, but instead prioritize selective retrieval – pinpointing and incorporating only the most relevant knowledge when needed. This approach tackles the computational burden of processing lengthy inputs while also addressing the challenge of maintaining semantic consistency as information accumulates. Instead of a static snapshot of data, these agents build dynamic knowledge bases, allowing them to adapt and refine their understanding as new information becomes available, mirroring the human capacity to draw upon long-term memory and integrate it with present circumstances.

Current methods attempting to overcome limited context windows in large language models face significant hurdles. Simply increasing the amount of retrieved information proves computationally expensive, demanding substantial resources for processing and analysis as the knowledge base expands. Beyond cost, a critical challenge lies in preserving semantic coherence; as models integrate information from numerous sources, the risk of introducing contradictions or losing the thread of reasoning increases dramatically. The tendency to prioritize breadth over depth can lead to superficial connections and ultimately diminish the quality of generated outputs. Effectively balancing the need for extensive knowledge with the capacity to synthesize it meaningfully remains a key obstacle in building truly robust and scalable language-based agents.

The inherent limitations of current large language models regarding context length demand a shift in how artificial intelligence constructs and utilizes memory. Traditional approaches, reliant on processing entire inputs at once, become computationally unsustainable and semantically fragile when dealing with extensive knowledge bases. Consequently, research is pivoting toward agent architectures that prioritize selective information retrieval and integration-systems capable of dynamically accessing and synthesizing relevant data as needed. This emerging paradigm envisions agents possessing robust, adaptable memory systems that mimic human cognitive processes, allowing them to reason effectively over prolonged interactions and complex scenarios without being constrained by fixed context windows. The focus is no longer solely on model size, but on building intelligent systems that know what they need to know, and can access it efficiently.

U-MEM demonstrates significantly lower average token usage compared to ReasoningBank, ReMe, MemRL, and a baseline without memory, indicating improved efficiency in processing information.
U-MEM demonstrates significantly lower average token usage compared to ReasoningBank, ReMe, MemRL, and a baseline without memory, indicating improved efficiency in processing information.

U-Mem: A Necessary Decoupling of Memory and Mind

U-Mem is a Memory Agent architecture that addresses the limitations of Large Language Models (LLMs) regarding knowledge retention and reasoning over large datasets. Traditional LLMs are constrained by their context window size, hindering their ability to effectively utilize extensive information. U-Mem overcomes this by decoupling knowledge storage from the LLM itself, utilizing an external memory store. This allows the LLM to access and process information beyond its immediate context window, enabling more comprehensive reasoning and improved performance on tasks requiring broad knowledge bases. The architecture facilitates the augmentation of LLM capabilities with persistent, retrievable knowledge, effectively expanding their working memory and improving their ability to handle complex queries and tasks.

The Cost-Aware Knowledge Extraction Cascade within U-Mem is designed to optimize knowledge acquisition by dynamically prioritizing information sources based on associated computational costs. This cascade doesn’t simply retrieve data; it evaluates the resource expenditure – measured in API calls, token usage, and processing time – required to obtain specific knowledge. Lower-cost sources are preferentially queried first, and the system employs a tiered approach where higher-cost sources are only accessed if the necessary information isn’t found in the more efficient tiers. This prioritization is crucial for maintaining operational efficiency, particularly when dealing with large knowledge domains and limited computational resources, and allows U-Mem to maximize the amount of relevant information gathered within a given budget.

Contrastive Reflection within U-Mem’s Cost-Aware Knowledge Extraction Cascade functions by analyzing the outcomes of knowledge acquisition attempts. Both successful extractions – those that demonstrably improve LLM performance on downstream tasks – and failed attempts are used to adjust the cascade’s parameters. Specifically, the system maintains a record of feature vectors associated with both positive and negative examples, utilizing this data to refine the scoring function that determines which knowledge sources are prioritized. This process effectively trains the cascade to identify and extract information more efficiently, minimizing resource expenditure while maximizing the relevance and utility of acquired knowledge. The system updates these feature vectors based on the observed impact on LLM reasoning, creating a feedback loop that continuously optimizes knowledge extraction strategies.

U-Mem employs an External Memory Store to address the limitations of Large Language Model (LLM) context windows. Traditional LLMs are constrained by a fixed input size, hindering their ability to process and retain extensive knowledge. By offloading knowledge storage to a separate, scalable memory store – independent of the LLM’s parameters – U-Mem circumvents this constraint. This decoupling allows the LLM to access and reason over a significantly larger knowledge base than would be possible within its context window, enabling more complex and informed responses without requiring the entire knowledge base to be present in the current input. The External Memory Store facilitates persistent knowledge retention and retrieval, enhancing the LLM’s long-term reasoning capabilities and reducing computational costs associated with repeatedly providing the same information within the context window.

U-Mem is an architecture leveraging a unified memory space to facilitate efficient data sharing and processing across different modules.
U-Mem is an architecture leveraging a unified memory space to facilitate efficient data sharing and processing across different modules.

Navigating the Knowledge Landscape: Exploration and Exploitation

U-Mem employs Thompson Sampling, a Bayesian algorithm, to dynamically regulate the balance between exploring novel information and exploiting existing knowledge during retrieval. This probabilistic approach maintains a distribution over the potential reward-or utility-associated with each memory. At each retrieval step, a sample is drawn from these distributions to select a memory; memories with higher estimated rewards are more likely to be selected, representing exploitation. However, Thompson Sampling inherently allocates some probability to selecting memories with lower estimated rewards, facilitating exploration of potentially valuable, yet currently uncertain, information. This adaptive strategy allows U-Mem to efficiently refine its knowledge base and improve performance over time by continuously updating the reward estimates based on observed outcomes.

Semantic Relevance within U-Mem’s Thompson Sampling framework utilizes vector embeddings to quantify the relationship between the current task and stored memories. These embeddings, generated from both the task description and memory content, are used to calculate a similarity score – typically using cosine similarity – which directly influences the probability of a memory being retrieved. Higher similarity scores indicate a stronger semantic connection, increasing the likelihood that the memory will be sampled during the retrieval process, thereby prioritizing information most pertinent to the present task. This ensures the exploration phase focuses on potentially relevant memories, while exploitation leverages previously successful, semantically-aligned experiences.

Utility-Driven Retrieval within U-Mem functions by assigning a utility score to each retrievable memory, reflecting its predicted contribution to task performance. This scoring is determined by evaluating the memory’s relevance to the current state and its potential to reduce prediction error or improve decision-making. During retrieval, memories with higher utility scores are prioritized, increasing the probability of selecting information that directly addresses the present challenge. This contrasts with purely similarity-based retrieval and allows the system to proactively seek out memories demonstrably useful for achieving optimal results, even if those memories are not the most semantically similar to the current input.

U-Mem distinguishes between two primary memory stores: Global Procedural Memory and Local Corrective Memory. Global Procedural Memory functions as a repository for generalized knowledge and skills acquired through experience, providing a broad base of information applicable to a range of tasks. In contrast, Local Corrective Memory is dedicated to storing specific, targeted fixes for previously encountered errors or suboptimal performance. This allows the system to rapidly adapt to novel situations by applying previously learned corrections rather than relearning from scratch, improving efficiency and robustness in dynamic environments. The interaction between these two memory types enables both long-term knowledge retention and immediate performance optimization.

A strong positive correlation (<span class="katex-eq" data-katex-display="false">r = 0.888</span>) between task similarity (measured by AMCS) and performance gain demonstrates that U-Mem benefits most from tasks closely related to those it has previously learned.
A strong positive correlation (r = 0.888) between task similarity (measured by AMCS) and performance gain demonstrates that U-Mem benefits most from tasks closely related to those it has previously learned.

Demonstrating Robustness: A Benchmark of Performance

To rigorously assess its capabilities, U-Mem underwent evaluation across a suite of challenging benchmarks designed to probe different facets of AI agent performance. These included HotpotQA, a question answering dataset requiring multi-hop reasoning; AIME, which tests the ability to follow complex instructions; AdvancedIF, focused on interactive, multi-turn dialogues; and HelpSteer3, evaluating helpfulness and steering capabilities. This comprehensive benchmarking strategy ensured a robust and nuanced understanding of U-Mem’s strengths and weaknesses, providing a solid foundation for comparing its performance against established baseline models and state-of-the-art techniques in the field.

Rigorous evaluation demonstrates that U-Mem consistently surpasses the performance of existing baseline models across a suite of challenging benchmarks. Specifically, when paired with the Qwen2.5-7B language model, U-Mem achieves substantial gains in complex reasoning and information retrieval tasks; improvements exceed 14% on the HotpotQA benchmark, which requires multi-hop reasoning over knowledge sources, and reach 7.3% on the AIME25 benchmark, designed to assess agent interaction and memory capabilities. These results indicate a significant advancement in AI agent performance, showcasing U-Mem’s ability to effectively leverage and retain information for improved decision-making and task completion.

Evaluations on the HotpotQA benchmark demonstrate U-Mem’s capacity for complex reasoning and knowledge integration, achieving an accuracy of 52.4%. This result surpasses the performance of previously established state-of-the-art methods, indicating a substantial advancement in multi-hop question answering capabilities. HotpotQA, known for its challenging questions requiring synthesis of information from multiple supporting documents, serves as a rigorous testbed for AI agents. The achieved accuracy suggests U-Mem effectively navigates this complexity, identifying and combining relevant details to provide accurate responses and represents a significant step towards more reliable and insightful AI systems.

Evaluations on the AIME25 benchmark demonstrate U-Mem’s competitive performance in complex multi-turn dialogue, achieving an accuracy of 18.67%. This result is particularly noteworthy as U-Mem not only matches but, in several instances, surpasses the performance of established Reinforcement Learning (RL) baselines. The AIME25 benchmark presents a significant challenge due to its requirement for agents to maintain context and provide coherent responses across extended conversations, and U-Mem’s ability to effectively navigate this complexity positions it as a strong contender in the development of more engaging and capable conversational AI systems.

To ensure reliable and impartial assessment, U-Mem’s performance isn’t simply judged by arbitrary metrics; evaluations are meticulously grounded in established ‘Ground Truth’ datasets, representing definitive answers or ideal responses. Beyond simple accuracy, a ‘Preference Score’ system is employed, quantifying the degree to which outputs are favored over alternatives – mirroring human judgment and capturing nuanced quality. This dual approach-objective verification against known standards coupled with a preference-based assessment-provides a robust and comprehensive evaluation framework, allowing for meaningful comparisons against existing models and validating the improvements demonstrated by U-Mem across various benchmarks. The combination ensures that observed gains reflect genuine enhancements in AI agent capability, rather than statistical anomalies or biases inherent in single-metric evaluations.

The demonstrated performance of U-Mem across multiple demanding benchmarks suggests a substantial advancement in the capabilities of artificial intelligence agents. Rigorous evaluation on datasets like HotpotQA, AIME, and AdvancedIF revealed consistent outperformance compared to existing models, with gains exceeding 14% on complex reasoning tasks. This improvement isn’t merely incremental; U-Mem achieves state-of-the-art accuracy on HotpotQA and matches or surpasses reinforcement learning baselines on AIME25, indicating a heightened capacity for both knowledge retrieval and nuanced understanding. The ability to consistently deliver enhanced performance, measured through both Ground Truth and Preference Scores, positions U-Mem as a promising architecture for building more effective and intelligent systems capable of tackling complex challenges.

The U-Memon HotpotQA dataset demonstrates scaling trends with increased model parameters.
The U-Memon HotpotQA dataset demonstrates scaling trends with increased model parameters.

Towards Lifelong Learning: The Future of Autonomous Agents

Current research endeavors are concentrating on expanding the capabilities of U-Mem by substantially increasing its capacity to accommodate vastly larger knowledge bases and tackle increasingly intricate tasks. This scaling process isn’t merely about increasing storage; it requires innovations in memory access and retrieval algorithms to maintain efficiency and prevent bottlenecks as the knowledge base grows. The objective is to enable U-Mem to process and integrate information from diverse sources, reason about complex scenarios, and ultimately, perform tasks demanding a higher level of cognitive ability. Successfully scaling U-Mem promises to unlock the potential for more robust and adaptable autonomous agents capable of operating effectively in dynamic and unpredictable real-world environments.

The development of genuinely autonomous agents hinges on their capacity for continuous learning and adaptation, moving beyond static knowledge to embrace ongoing refinement. Current AI systems often struggle with situations not explicitly encountered during training; however, future research prioritizes mechanisms allowing agents to assimilate new information and adjust their behavior in real-time. This necessitates exploring techniques like meta-learning – enabling agents to learn how to learn – and incremental learning, where knowledge is accumulated without catastrophic forgetting. Such approaches will allow agents to navigate dynamic and unpredictable environments, improving performance over time and exhibiting a form of cognitive flexibility essential for true autonomy. Ultimately, the ability to learn and adapt continuously will define the next generation of intelligent systems, enabling them to function effectively across a widening range of complex tasks and scenarios.

Combining U-Mem with reinforcement learning presents a compelling avenue for advancing artificial intelligence. While reinforcement learning excels at learning through trial and error, it often struggles with complex, long-horizon tasks requiring substantial background knowledge. U-Mem can address this limitation by providing a structured, readily accessible knowledge base that informs the agent’s decision-making process. This synergistic approach allows the agent to leverage past experiences and generalize more effectively, potentially accelerating learning and improving performance in novel situations. The integration isn’t merely about supplementing reinforcement learning; it’s about creating a system where procedural knowledge – learned through interaction – and declarative knowledge – stored in U-Mem – work in concert, fostering a more robust and adaptable intelligence capable of tackling increasingly challenging real-world problems.

The development of this unified memory framework represents a significant step towards realizing truly adaptable and intelligent agents capable of thriving in complex, real-world scenarios. By consolidating knowledge representation and reasoning into a single, accessible system, these agents move beyond pre-programmed responses and towards genuine understanding. This allows for not only the recall of past experiences but also the flexible application of that knowledge to novel situations, fostering robust problem-solving capabilities. The potential impact extends to a wide range of applications, from personalized robotics and assistive technologies to advanced automation and scientific discovery, ultimately paving the way for systems that can learn, evolve, and operate autonomously within dynamic environments.

U-MEM demonstrates substantially faster training than reinforcement learning (GRPO) while achieving comparable or improved performance.
U-MEM demonstrates substantially faster training than reinforcement learning (GRPO) while achieving comparable or improved performance.

The pursuit of autonomous agents, as demonstrated by U-Mem, isn’t about constructing perfect systems, but cultivating resilience within complex ecosystems. This work, focusing on cost-aware knowledge acquisition and efficient memory management, acknowledges the inevitable entropy inherent in any dynamic system. As John von Neumann observed, “There are no best practices – only survivors.” U-Mem doesn’t attempt to prevent memory failures or information overload; instead, it learns to navigate them through continuous exploration and selective retention, echoing the principle that order is merely a temporary reprieve before the next cascade of events. The agent’s capacity for autonomous learning highlights a shift from rigid architectures to adaptable systems that thrive amidst uncertainty.

What Lies Ahead?

U-Mem, like all such constructions, merely postpones the inevitable entropy. It builds a better forgetting curve, a more efficient decay. The achievement isn’t intelligence, but prolonged functionality in the face of overwhelming data. The paper demonstrates a capacity to select what to remember, but selection implies a future justification – a prophecy of relevance. What happens when that prophecy fails? The cost-awareness is a palliative, not a solution; a budgeting of inevitable loss. Every successful retrieval is a temporary reprieve from the void.

The focus on Thompson sampling, while effective, feels like an attempt to impose order on a fundamentally chaotic process. The agent learns what to remember, but not why remembering matters. Future work will inevitably grapple with the question of value – a value not defined by immediate reward, but by some more abstract, long-term coherence. Can an agent develop a sense of its own epistemic horizons? Or is it destined to endlessly optimize for locally defined goals, a sophisticated echo chamber?

The real challenge isn’t building agents that remember more, but agents that know when to stop remembering. A truly autonomous system will need to cultivate a graceful surrender – a willingness to let go, to forget, to admit the limits of its own knowledge. Documentation of this process, of course, will be sparse. No one writes prophecies after they come true.


Original article: https://arxiv.org/pdf/2602.22406.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-01 09:33