Beyond Calculation: Teaching Machines to Reason with Math

Author: Denis Avetisyan

A new framework, AgentMath, dramatically improves the mathematical reasoning abilities of large language models by letting them actively use and learn from code execution.

AgentMath demonstrates a case study in the emergent behavior of complex systems, where seemingly simple mathematical foundations give rise to sophisticated agent interactions and unpredictable outcomes-a testament to the inherent limitations of predictive modeling in dynamic ecosystems.

AgentMath leverages tool-augmented agents and asynchronous reinforcement learning to achieve state-of-the-art performance on complex mathematical benchmarks.

Despite recent advances in large reasoning models, complex mathematical problem-solving remains computationally expensive and prone to errors. This paper introduces AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent, a novel framework that bridges the gap between language understanding and computational precision. By integrating language models with code execution through an agentic reinforcement learning paradigm-and leveraging innovations in training efficiency-AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks. Could this approach unlock a new generation of scalable and robust mathematical reasoning agents capable of tackling even more complex problems?

The Inevitable Augmentation: Beyond Pattern Matching

Despite their impressive ability to generate human-quality text, traditional Large Language Models often falter when confronted with tasks demanding rigorous computation or intricate, multi-step problem-solving. These models, trained primarily on vast datasets of text, excel at identifying patterns and relationships within language, but lack the inherent capacity for precise calculation or systematic deduction. Consequently, even relatively simple arithmetic problems, or scenarios requiring logical sequencing, can present significant challenges. The limitations stem from the fundamental architecture of LLMs, which operate by predicting the most probable continuation of a given text sequence – a process fundamentally different from the symbolic manipulation required for reliable reasoning. This means that while LLMs can discuss a complex problem, they frequently struggle to solve it accurately, highlighting the need for approaches that extend their capabilities beyond pattern recognition.

The continued pursuit of larger language models, while yielding impressive gains in fluency and pattern recognition, is increasingly recognized as an insufficient path towards true artificial general intelligence. Current architectures, fundamentally limited by their reliance on parametric knowledge, struggle with tasks demanding precise computation, logical deduction, or access to constantly evolving information. A paradigm shift is therefore essential, one that moves beyond simply scaling model size and instead focuses on augmenting LLMs with external tools – calculators, search engines, code interpreters, and specialized APIs. This integration isn’t merely about adding functionalities; it’s about creating a synergistic system where the LLM’s strengths in natural language understanding and reasoning are combined with the reliability and precision of executable code and external data sources, effectively bypassing the limitations inherent in storing all knowledge within the model’s parameters.

A robust architecture for advanced reasoning involves more than simply increasing model size; it demands a synergistic integration of natural language processing with the precision of executable code. This framework allows the language model to leverage external tools – calculators, search engines, or specialized algorithms – not merely as data sources, but as active components within the reasoning process. Instead of attempting to solve complex problems solely through statistical correlations within its parameters, the model can formulate a plan, decompose the problem into manageable steps, and then execute those steps using appropriate tools, verifying results at each stage. This approach effectively extends the model’s cognitive reach, enabling it to tackle tasks requiring accuracy and systematic computation that are beyond the capabilities of even the most powerful, scaled language models. The result is a system that doesn’t just appear intelligent, but demonstrably reasons and solves problems with a level of reliability previously unattainable.

A three-stage pipeline automatically converts textual reasoning into verified, executable agent trajectories by leveraging code injection, multi-faceted quality refinement, and self-correction capabilities.

The Necessary Symbiosis: AgentMath Takes Form

AgentMath utilizes Large Language Models (LLMs) as its primary reasoning component, but crucially integrates a Code Interpreter to perform and verify calculations. This architecture moves beyond the limitations of LLMs, which can struggle with arithmetic and logical operations, by offloading these tasks to a dedicated computational tool. The Code Interpreter receives instructions formulated by the LLM, executes the corresponding code (typically Python), and returns the results back to the LLM for incorporation into its reasoning process. This allows AgentMath to handle quantitative problems and complex calculations with a significantly higher degree of accuracy and reliability than LLMs operating independently. The framework is designed such that the LLM directs the Code Interpreter, effectively using it as a verifiable computational resource within its overall reasoning chain.

Supervised Fine-tuning (SFT) with tool-augmented data is a critical initialization step for AgentMath. This process involves training the Large Language Model (LLM) on a dataset specifically designed to demonstrate effective interaction with the integrated Code Interpreter. The training data consists of problem statements paired with demonstrations of the LLM formulating a plan, generating appropriate code for execution by the interpreter, and then utilizing the interpreter’s output to refine its reasoning and arrive at a solution. This targeted SFT teaches the agent not only to call the Code Interpreter, but also to strategically integrate its use within a broader reasoning process, improving both the accuracy and reliability of its responses by grounding them in verifiable computation.

AgentMath builds upon Long Chain-of-Thought (CoT) reasoning by integrating a code execution capability directly into the agent’s deliberation process. Traditional CoT relies on generating textual reasoning steps; AgentMath allows the agent to not only formulate a plan involving computation, but also to execute that plan using a Code Interpreter and incorporate the results back into its reasoning chain. This enables the agent to handle complex calculations, data manipulation, and verification tasks that would be impractical or unreliable through textual reasoning alone, effectively extending the scope and accuracy of CoT-based problem solving.

This case study demonstrates AgentMath's ability to autonomously correct errors in its own code. — This case study demonstrates AgentMath’s ability to autonomously correct errors in its own code.

The Inevitable Optimization: Refining the System’s Core

Reinforcement Learning (RL) is utilized to enhance the agent’s decision-making process regarding Code Interpreter usage. This involves training the agent to strategically select when and how to employ the Code Interpreter to solve problems, moving beyond simple tool invocation. The RL framework defines a reward function that incentivizes correct problem-solving while penalizing inefficient or unnecessary Code Interpreter calls. Through iterative training, the agent learns an optimal policy – a mapping from problem states to actions involving the Code Interpreter – maximizing its ability to solve complex tasks effectively and reliably. This refined policy dictates not only whether to use the Code Interpreter, but also how to formulate queries and interpret results for optimal performance.

Asynchronous Rollout Scheduling and Agentic Partial Rollout are employed to accelerate Reinforcement Learning training by decoupling the policy evaluation and improvement steps and strategically managing computational resources. Asynchronous scheduling allows for the parallel execution of rollouts, reducing idle time and overall latency. Agentic Partial Rollout further optimizes resource allocation by selectively evaluating portions of the rollout trajectory, focusing on the most informative segments. Combined, these techniques achieve a 4.0-5.0x speedup in training compared to conventional batch-synchronous rollout methods, which require complete trajectory evaluation before policy updates.

The Group Relative Policy Optimization (GRPO) algorithm streamlines agent training by removing the requirement for value function approximation, a common component in reinforcement learning that introduces complexity and potential instability. GRPO achieves this simplification by directly optimizing the policy, resulting in a more stable learning process. Benchmarking indicates that this approach yields a substantial reduction in per-step latency, achieving a speedup of 4.0 to 5.0 times compared to conventional batch-synchronous rollout methods. This performance gain is attributable to the elimination of the computational overhead associated with value function estimation and updates.

Our agentic reinforcement learning system utilizes an Asynchronous Scheduler to enable continuous agent loops-terminated only when content length exceeds <span class="katex-eq" data-katex-display="false">32k</span> tokens or tool call limits are reached-resulting in improved performance. — Our agentic reinforcement learning system utilizes an Asynchronous Scheduler to enable continuous agent loops-terminated only when content length exceeds $32k$ tokens or tool call limits are reached-resulting in improved performance.

The Inevitable Demonstration: Performance and the Path Forward

AgentMath establishes a new benchmark in automated mathematical problem-solving, demonstrably exceeding prior performance on challenging competition benchmarks. Utilizing the AgentMath-30B-A3B model, the system achieves an impressive 90.6% accuracy on the AIME24 dataset, alongside 86.4% on AIME25 and 73.8% on the notoriously difficult HMMT25. These results signify a substantial leap forward in the field, indicating the potential for artificial intelligence to not only compute solutions but to effectively navigate the complexities of competitive mathematical reasoning. This level of proficiency suggests a future where AI tools can serve as powerful aids for students and researchers alike, pushing the boundaries of mathematical exploration and discovery.

AgentMath distinguishes itself through a robust self-correction capability, allowing the system to not merely generate solutions, but to critically evaluate and refine its own work. This isn’t simply about avoiding errors; the framework actively identifies flaws in its reasoning process and within the code it generates to arrive at an answer. When inconsistencies or logical errors are detected, AgentMath initiates a corrective cycle, re-examining its steps and revising its approach until a consistent and accurate solution is achieved. This iterative self-assessment significantly enhances the reliability of the system, particularly in complex mathematical problem-solving where subtle errors can easily propagate through multiple stages of calculation. The ability to autonomously debug and improve its performance represents a key advancement towards more dependable and trustworthy artificial intelligence in mathematical domains.

Continued development of AgentMath centers on broadening its functional capabilities through the integration of a more diverse toolkit. Researchers aim to move beyond current computational and symbolic manipulation resources, potentially incorporating access to specialized databases, advanced numerical solvers, and even probabilistic reasoning engines. Simultaneously, investigations are underway to refine the agent’s reasoning processes, exploring techniques like hierarchical planning, abductive reasoning, and the ability to construct and evaluate multiple solution pathways concurrently. These enhancements are anticipated to not only improve performance on existing mathematical challenges, but also to enable the agent to tackle problems demanding greater creativity, adaptability, and nuanced understanding of mathematical principles.

AgentMath outperforms a text-based model in the reinforcement learning phase on the AIME24/25 benchmark, despite both being initialized from strong supervised fine-tuning (SFT) checkpoints trained on 20k samples.

The pursuit of robust mathematical reasoning, as demonstrated by AgentMath, isn’t about achieving perfect stability-it’s about embracing the inevitable evolution of complex systems. Long stability, in this context, would be a false signal, masking the subtle ways the agent adapts-or fails to adapt-to novel mathematical challenges. As Marvin Minsky observed, “You can’t always get what you want, but you can’t get what you don’t ask for.” AgentMath, by actively probing the space of possible solutions through tool augmentation and asynchronous training, asks for more than a static language model could ever deliver. It doesn’t seek to eliminate error, but to cultivate a system capable of learning from its own unexpected shapes, recognizing that the most interesting failures often reveal the path toward genuine intelligence.

What Lies Ahead?

AgentMath, and systems of its kind, represent a familiar pattern: the pursuit of competence through accretion. One adds layers – code execution, reinforcement learning, asynchronous training – hoping to coax intelligence from the underlying substrate. Yet, the benchmarks themselves become the limitations. Success on a curated dataset merely postpones the inevitable encounter with novelty, with problems genuinely unlike those previously solved. The architecture isn’t structure-it’s a compromise frozen in time, a brittle response to a transient landscape.

The true challenge isn’t building agents that can reason, but understanding what constitutes genuine mathematical understanding. Current approaches excel at mimicry, at pattern completion. But the leap from correct answer to conceptual grasp remains elusive. The field fixates on scaling performance; it should concern itself with the nature of failure. Each solved problem reveals not a path to intelligence, but a deeper appreciation of ignorance.

Technologies change, dependencies remain. The promise of tool-augmented agents is not seamless integration, but rather the elegant management of inevitable friction. Future work will likely focus on dynamic tool selection, adaptive training regimes, and perhaps, a belated acknowledgement that the most powerful tools are those discarded, the assumptions abandoned. The ecosystem will grow, regardless. The question isn’t whether it will succeed, but what form its failures will take.

Original article: https://arxiv.org/pdf/2512.20745.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Augmentation: Beyond Pattern Matching

The Necessary Symbiosis: AgentMath Takes Form

The Inevitable Optimization: Refining the System’s Core

The Inevitable Demonstration: Performance and the Path Forward

What Lies Ahead?

See also: