Building Agents That Learn and Improve Themselves

Author: Denis Avetisyan

A new approach reframes agent self-improvement as the reliable accumulation of skills, focusing on verifiable evidence and controlled generalization.

This work introduces ASG-SI, a system for agentic reinforcement learning built on audited skill graphs, verifiable rewards, experience synthesis, and continual memory.

Despite advances in agentic reinforcement learning, ensuring the security and reproducibility of self-improving large language models remains a critical challenge. This paper introduces ‘Audited Skill-Graph Self-Improvement for Agentic LLMs via Verifiable Rewards, Experience Synthesis, and Continual Memory’, a framework that reframes agent self-improvement as the iterative compilation of auditable, reusable skills grounded in verifiable evidence. By decomposing rewards, synthesizing experiences for robust testing, and controlling memory, ASG-SI facilitates measurable progress and independent auditability. Could this approach pave the way for truly governable and reliable autonomous AI agents operating over extended horizons?

Deconstructing the Limits of Traditional Reinforcement Learning

Contemporary reinforcement learning algorithms often falter when confronted with tasks demanding extended sequences of actions and intricate planning. These systems frequently develop brittle policies – solutions that perform well within the narrow confines of their training environment but degrade rapidly when faced with even slight variations. This limited generalization stems from an over-reliance on memorizing specific state-action pairings rather than learning robust, underlying principles. Consequently, agents struggle to adapt to novel situations or unseen environmental changes, necessitating extensive retraining for each new challenge. The inherent difficulty in exploring vast state spaces within long-horizon tasks further exacerbates this problem, hindering the development of policies capable of consistently achieving desired outcomes beyond the immediately observable consequences of an action.

Simply increasing the parameters of a reinforcement learning model doesn’t address the fundamental challenges of complex tasks. Current approaches often result in brittle policies – those that perform well in training but fail spectacularly when faced with even minor variations in the environment. A more robust solution lies in prioritizing the development of reusable skills – modular components of behavior that can be combined and adapted across different situations. Crucially, these skills must also be verifiable, meaning their performance can be rigorously assessed and trusted before deployment. This emphasis on composability and auditability represents a shift from training monolithic agents to building systems where capabilities are explicitly defined, tested, and assembled, ultimately fostering more reliable and generalizable intelligence.

A fundamental limitation of current reinforcement learning approaches lies in their reliance on monolithic training – agents learned as single, end-to-end systems. Increasingly, research suggests a more robust path forward involves compositional architectures, where complex behaviors emerge from the assembly of simpler, reusable skills. This paradigm shifts the focus from training a single, all-encompassing policy to building and verifying individual capabilities, much like assembling building blocks. Crucially, this compositional approach isn’t solely about modularity; it demands rigorous auditing of each skill to ensure reliability and safety before integration. By prioritizing verifiable behavior at each component level, developers can create agents that are not only more adaptable to novel situations, but also demonstrably predictable and trustworthy – addressing a critical challenge in deploying AI systems in real-world applications.

The Skill Graph: Architecting for Auditable Intelligence

The Audited Skill Graph (ASG) within ASG-SI represents skills as discrete, explicitly defined nodes connected by relationships indicating dependencies and interfaces. Each skill is characterized not simply by its function, but by a formal specification of its inputs, outputs, and preconditions for execution. This explicit definition allows the system to verify the compatibility of skills before composing them into more complex behaviors. Dependencies are represented as directed edges, indicating which skills require the output of others to operate, creating a structured and auditable network of capabilities. The interfaces define the format and type of data exchanged between skills, ensuring interoperability and enabling the system to reason about skill compatibility and potential integration pathways.

The Audited Skill Graph (ASG) architecture promotes modularity by representing skills as discrete nodes with defined inputs and outputs, facilitating independent development and testing. Reusability is achieved through the graph’s ability to connect existing skill modules in novel combinations, avoiding redundant implementation. Compositional generalization arises from the system’s capacity to learn new skills by composing known skills, enabling adaptation to previously unseen tasks without requiring entirely new training data; this compositional approach accelerates learning as the system builds upon existing knowledge rather than learning from scratch for each new problem.

The Skill Compiler functions by analyzing successful task trajectories – sequences of states and actions resulting in goal achievement – to identify and isolate repeatable behavioral patterns. This process involves extracting state-action pairs and generalizing them into reusable skill primitives. Normalization is achieved through a consistent representation of skill interfaces, defining clear input expectations and output consequences. The compiler outputs these normalized skill primitives as nodes within the Audited Skill Graph (ASG), effectively building the ASG from observed successful behaviors and providing the foundational components for subsequent learning and generalization.

Grounding Trust: Verifying and Auditing Agent Skills

The Agent Skill Grounding and Safety Infrastructure (ASG-SI) utilizes a Verifier-Auditor component to assess candidate skills through controlled replay. This process involves re-executing the agent’s skill under defined conditions, allowing for detailed observation and data collection. The output of this replay is compiled into an Evidence Bundle, a structured record of the skill’s execution, including relevant states, actions, and observations. This bundle serves as the foundational data for subsequent verification, auditing, and reward signal reconstruction, providing a comprehensive audit trail of the agent’s capabilities and behavior.

Replay-based verification operates by subjecting agent skills to repeated execution under controlled, deterministic conditions. This process allows for rigorous testing of skill correctness by comparing observed behavior against expected outcomes, and validates safety by identifying potentially hazardous actions before deployment. The methodology involves re-executing the skill with the same initial conditions and inputs, enabling precise analysis of each step and the identification of deviations from the intended functionality. By systematically replaying skills, the system can detect edge cases, logical errors, and unintended consequences, thereby increasing confidence in the agent’s reliable and safe operation. This contrasts with single-execution testing, which may not expose subtle flaws or vulnerabilities present in the skill’s implementation.

Verifiable reward mechanisms within the Agent Skills Generator – Skill Interpreter (ASG-SI) are designed to provide a transparent and auditable signal for both agent learning and performance evaluation. These mechanisms operate by associating rewards with an Evidence Bundle, a record of skill execution generated during replay-based verification. The integrity of reward gains is quantified using an ‘Evidence-based reconstruction rate’ metric, which assesses the proportion of reward that can be accurately reconstructed from the evidence contained within the bundle. A high reconstruction rate indicates a strong correlation between observed skill execution and the assigned reward, bolstering confidence in the learning process and providing a reliable audit trail for skill validation and safety assessment.

Effective operation of the Verifier-Auditor component within ASG-SI necessitates robust memory operations to accurately track skill execution. These operations involve recording the complete state of the environment and the agent’s actions during skill replay, forming the basis of the generated Evidence Bundle. Specifically, memory stores must capture input observations, agent actions, resulting rewards, and internal state variables at each step of the skill execution. The fidelity of this recorded data directly impacts the accuracy of skill verification and the reliability of reconstructed reward signals; incomplete or inaccurate memory capture will lead to flawed auditing and potentially unsafe skill deployment. The system utilizes these memory operations to establish a verifiable audit trail, enabling reconstruction of the skill’s behavior and ensuring the integrity of learned rewards.

Toward a Future of Continual Learning and Robust Adaptation

The Architecture for Skill Generalization and Skill Integration (ASG-SI) demonstrates a compelling capacity for continual learning, a crucial ability for agents operating in dynamic environments. Unlike traditional approaches prone to ‘catastrophic forgetting’ – where learning new skills erases previously acquired ones – ASG-SI facilitates the seamless integration of novel competencies without compromising existing knowledge. This is achieved through the framework’s emphasis on modular skill representation and composition; new skills are added as independent modules, building upon, rather than overwriting, the established skillset. Consequently, an agent can progressively acquire a diverse repertoire of abilities, adapting to evolving demands and demonstrating sustained performance across a lengthening sequence of tasks, effectively mimicking a capacity for lifelong learning.

To overcome the limitations of relying solely on real-world experience, agents can benefit from synthesized experiences – artificially generated data designed to broaden an agent’s understanding of its environment. Techniques like DreamGym facilitate this by creating diverse and challenging scenarios that complement limited real-world data, effectively augmenting the training process. This approach is particularly crucial for improving generalization – the ability to perform well in unseen situations – and robustness, allowing the agent to maintain performance even when faced with unexpected disturbances or variations. By strategically generating data that explores edge cases and rare events, synthesized experiences proactively address potential weaknesses, leading to more adaptable and reliable artificial intelligence systems.

The architecture enables agents to tackle intricate challenges not by learning monolithic policies, but through the strategic assembly of pre-verified skill modules. This decomposition allows for greater flexibility; complex tasks are broken down into manageable components, each possessing a clearly defined purpose and audited for safe and reliable execution. Consequently, an agent can rapidly adapt to novel situations by recombining existing skills in innovative ways, or by learning and integrating new, specialized modules without requiring complete retraining. This modular approach not only accelerates the learning process but also enhances robustness, as failures within one skill are less likely to cascade and compromise the entire system – fostering a more resilient and adaptable artificial intelligence.

Ensuring artificial intelligence systems operate safely requires more than just performance metrics; it demands verifiable adherence to predefined constraints. Outcome-Driven Constraint Violation Benchmarks address this need by shifting the focus from simply achieving a goal to how that goal is achieved. These benchmarks don’t merely assess success or failure, but meticulously track any violations of critical safety guidelines during task execution. This approach allows researchers to rigorously test agents across a spectrum of scenarios, identifying potential risks and vulnerabilities before deployment. By quantifying constraint violations – such as collisions, exceeding speed limits, or entering restricted zones – these benchmarks provide a clear and objective measure of an agent’s robustness and reliability, paving the way for trustworthy and responsible AI systems.

Scaling Intelligence: AgentRL, Agent Lightning, and the Path Forward

AgentRL and Agent Lightning represent a pivotal advancement in the adaptability of the ASG-SI framework, offering the necessary infrastructure to navigate increasingly intricate scenarios. These systems facilitate the scaling of agent capabilities beyond simple, single-turn interactions, allowing for sustained engagement in multi-turn conversations and the concurrent handling of multiple, diverse tasks. By providing robust reinforcement learning algorithms and optimized training pipelines, AgentRL and Agent Lightning enable ASG-SI agents to learn and refine their strategies across a broader spectrum of challenges, demonstrating a significant step toward building truly versatile and intelligent systems capable of operating in dynamic, real-world environments. This scalability is not merely about handling more data; it’s about fostering a capacity for continuous learning and adaptation, ensuring the agent’s performance improves with increasing complexity.

The Agent framework extends beyond simple task completion to encompass sophisticated Tool Learning, allowing agents to dynamically integrate and utilize external resources. This capability is not simply about accessing tools, but ensuring their valid application; the system rigorously assesses tool usage through a suite of metrics. Specifically, ‘schema-correctness rate’ confirms that tools receive inputs in the expected format, while ‘argument-type correctness’ validates that the provided data types align with tool requirements. Critically, ‘tool-output utilization consistency’ measures whether the agent effectively incorporates the tool’s response into its reasoning process, preventing ignored or misinterpreted outputs. This multi-faceted evaluation ensures that the agent doesn’t just use tools, but leverages them reliably and meaningfully to achieve its objectives, paving the way for complex problem-solving capabilities.

SWE-Bench-CL serves as a compelling demonstration of ASG-SI’s practical utility, specifically within the challenging domain of software engineering. This benchmark suite, designed to evaluate agents on complex, multi-stage coding tasks, reveals ASG-SI’s capacity to not merely generate code snippets, but to engage in complete software development lifecycles – from understanding requirements and designing solutions, to implementing, testing, and debugging. The framework’s performance on SWE-Bench-CL highlights its potential to automate significant portions of the software engineering process, promising increased efficiency and reduced development time. By successfully navigating tasks requiring complex reasoning and the integration of multiple skills, ASG-SI showcases a pathway towards intelligent agents capable of augmenting human developers and tackling real-world software challenges.

Ongoing development centers on bolstering agent robustness through automated skill discovery and a more refined auditing process. Current systems often rely on manually defined skills, limiting adaptability; future iterations aim to allow agents to independently identify and acquire necessary competencies. Crucially, evaluating these self-discovered skills necessitates a robust verification method, and researchers propose ‘Verifier reproducibility’ as a key metric. This assesses the consistency with which an independent verifier – or audit process – confirms the agent’s performance across multiple attempts, ensuring reliability and minimizing spurious successes. By prioritizing both automated skill acquisition and verifiable performance, the framework seeks to create agents capable of consistently delivering accurate results in dynamic and complex environments.

“`html

The pursuit of robust agentic systems, as detailed in the research, inherently demands a willingness to challenge established boundaries. ASG-SI’s focus on verifiable rewards and skill composition echoes this sentiment-a system isn’t understood until its components are rigorously tested and reassembled. As G. H. Hardy noted, “The essence of mathematics lies in its elegance and logical rigor.” This principle extends to agent design; a truly intelligent system isn’t simply complex, but demonstrably sound, built on a foundation of auditable evidence. The paper’s commitment to evidence integrity isn’t merely a technical detail; it’s an acknowledgment that knowledge, like a mathematical proof, must withstand scrutiny to be considered valid.

Uncharted Territory

The pursuit of agentic self-improvement, as framed by this work, isn’t about building smarter machines; it’s about reverse-engineering intelligence itself. The ASG-SI system represents a pragmatic attempt to build a legible system – one where the accumulation of skills isn’t a black box, but a traceable, verifiable process. However, the question of ‘what constitutes verifiable evidence’ remains stubbornly open. The current approach focuses on rewards, but reality, as always, is more nuanced. The true test will be whether these ‘audited skills’ generalize beyond the contrived environments, and more importantly, whether that generalization remains predictable.

A critical limitation lies in the assumption of skill atomicity. The boundaries between skills are, at best, convenient fictions. The most interesting behaviors will likely emerge from unexpected skill compositions – emergent properties that are difficult, if not impossible, to anticipate during the audit phase. This isn’t a bug; it’s a feature of complex systems. The code is always more elegant – and more surprising – than the documentation.

Future work must confront the inherent tension between control and creativity. Building governable agents requires rigorous auditability, but imposing excessive constraints risks stifling the very ingenuity the system seeks to cultivate. The challenge, then, isn’t simply to teach agents how to learn, but to establish the conditions under which they can discover things we haven’t yet conceived. Because, ultimately, reality is open source – it’s just that no one has finished reading the code.

Original article: https://arxiv.org/pdf/2512.23760.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/