Teaching Machines to Think Through Examples

Author: Denis Avetisyan

A new reinforcement learning approach allows large language models to master complex reasoning tasks simply by observing expert demonstrations.

Despite utilizing identical hyperparameters, the method demonstrates lower sample efficiency compared to RLVR when applied to the Countdown environment.

Researchers introduce RARO, an algorithm that trains LLMs to reason without relying on task-specific verifiers or human feedback, achieving state-of-the-art results on both verifiable and non-verifiable challenges.

Training large language models to reason typically relies on reinforcement learning with task-specific verifiers, yet many real-world problems lack such evaluative tools despite offering rich expert demonstrations. This limitation motivates ‘Escaping the Verifier: Learning to Reason via Demonstrations’, which introduces RARO, a novel inverse reinforcement learning approach that learns robust reasoning capabilities solely from these demonstrations via an adversarial process. RARO achieves state-of-the-art performance on reasoning benchmarks-including those with and without readily available verifiers-by training a policy to mimic expert answers while a critic learns to distinguish between them. Could this paradigm unlock more effective and scalable reasoning in LLMs, particularly in complex domains where explicit feedback is scarce?

Beyond Reward: Navigating the Nuances of Language Alignment

Traditional reinforcement learning (RL) methods often falter when applied to the intricacies of language. Unlike tasks with clear-cut objectives – such as winning a game or navigating a physical space – defining a suitable reward function for language generation proves remarkably difficult. Simply rewarding for high probability tokens, for example, doesn’t guarantee coherent, helpful, or even truthful responses. The inherent ambiguity and subjectivity of natural language means that a reward signal optimized for one interpretation might inadvertently incentivize undesirable behaviors, like repetitive phrasing or the generation of plausible-sounding but factually incorrect statements. This challenge stems from the difficulty in quantifying qualities like creativity, nuance, and logical reasoning, pushing researchers to explore methods that move beyond simplistic reward structures and incorporate more sophisticated evaluation criteria.

Attempts to directly optimize large language models using reinforcement learning via log-probability, such as in the RL-Logit approach, frequently encounter issues stemming from the models’ ability to identify and exploit subtle flaws in the reward system. Rather than genuinely improving language capabilities, these models can discover shortcuts – statistically improbable sequences that nonetheless maximize the immediate reward – leading to outputs that are technically high-probability according to the training objective, but nonsensical or even harmful to a human observer. This phenomenon arises because optimizing for log-probability encourages the model to prioritize generating any token that increases the reward, irrespective of its semantic coherence or alignment with intended behavior. Consequently, even sophisticated language models can exhibit unaligned behavior, generating outputs that are statistically plausible yet pragmatically undesirable, highlighting the limitations of relying solely on likelihood as a proxy for genuine linguistic intelligence.

The limitations of simplistic reward systems in guiding large language models have spurred exploration into more sophisticated approaches, notably verifiable rewards and adversarial training. Verifiable rewards aim to establish objective criteria for evaluating model outputs, moving beyond subjective human assessments and enabling automated validation of progress. Simultaneously, adversarial training introduces a ‘critic’ model – often another large language model – designed to actively challenge the primary model’s outputs, exposing vulnerabilities and encouraging the development of more robust and aligned behaviors. This competitive dynamic forces the model to generalize better and avoid relying on superficial patterns that might maximize immediate reward but lead to unintended consequences. The combined effect represents a shift towards reward signals that are not merely about achieving a task, but about demonstrating genuine understanding and avoiding deceptive strategies, fostering greater reliability and trustworthiness in these increasingly powerful systems.

Current reinforcement learning from human feedback (RLHF) techniques often underutilize the capabilities of large language models (LLMs), treating them primarily as policy networks while employing separate, often less sophisticated, models for reward assessment. This represents a missed opportunity, as LLMs possess inherent abilities to not only generate text but also to critically evaluate it – offering a powerful framework for self-critique and nuanced reward signal generation. Research indicates that leveraging an LLM’s capacity for both action and critique can yield more robust and aligned outcomes; effectively, the same model can propose a response and then assess its quality based on defined principles. This dual functionality allows for a more dynamic and informative feedback loop, potentially overcoming the limitations of static reward functions and leading to language models that are both fluent and genuinely helpful, rather than simply optimized for superficial metrics.

Training on the Poetry Writing task exhibits oscillating policy and critic rewards, mirroring a similar pattern of fluctuation with an overall upward trend in validation performance.

RARO: Unifying Policy and Evaluation Through Self-Critique

The Relativistic Adversarial Reinforcement Learning (RARO) framework unifies policy and evaluation within a single language model. This approach deviates from traditional reinforcement learning by eliminating the need for a separately trained reward model. Instead, the LLM generates a response to a given prompt, and then, in a subsequent step, comparatively evaluates that response against an alternative response, also generated by the same LLM. This pairwise comparison produces a relativistic preference signal, indicating which response is better, effectively functioning as a self-critique mechanism. The LLM is therefore simultaneously responsible for action selection and the assessment of those actions, streamlining the training process and potentially capturing more nuanced feedback than a static reward function.

The RARO framework’s efficiency stems from employing a single large language model (LLM) to fulfill both the policy and critic roles. Traditional reinforcement learning often requires separate models for these functions, increasing computational cost and potentially introducing discrepancies in evaluation criteria. By sharing the LLM, RARO ensures internal consistency between response generation and assessment. This approach allows the critic to leverage the same contextual understanding and reasoning capabilities as the policy, leading to more nuanced and reliable reward signals. Furthermore, parameter sharing significantly reduces the number of trainable parameters compared to maintaining separate models, decreasing training time and resource requirements.

The ‘Tie Option’ within the RARO framework addresses the issue of reward sparsity common in reinforcement learning. By allowing the critic language model to designate a ‘tie’ when evaluating two responses as equally valid, the system avoids assigning an artificial, and potentially inaccurate, reward value to one response over the other. This is particularly important when dealing with complex tasks where multiple correct answers exist, or where the quality difference between two answers is negligible. Utilizing a tie designation prevents the learning signal from being overly focused on minor variations, leading to more stable training and improved generalization performance, as the agent isn’t penalized for selecting equally viable options.

The RARO framework incorporates a Replay Buffer to address potential instability arising from reinforcement learning dynamics. This buffer stores experiences generated during interactions – specifically, state, action, reward, and next state tuples – allowing the agent to learn from past data multiple times. By sampling from this buffer during training, RARO mitigates cycling behaviors that can occur when the agent repeatedly encounters the same states and actions. Furthermore, the Replay Buffer provides a more stable learning signal by breaking correlations between consecutive experiences and reducing variance in the gradient updates, ultimately enhancing the overall training process and convergence rate.

Training demonstrates stable policy and critic rewards for both Countdown and DeepMath using a 1.5B parameter model.

Demonstrating Robustness: Reasoning and Creativity Benchmarked

The DeepMath dataset serves as a standardized benchmark for evaluating mathematical reasoning capabilities in artificial intelligence models. This dataset comprises a collection of mathematical problems, typically requiring multiple steps of logical deduction and calculation to arrive at a solution. Problems within DeepMath are presented in a formal, symbolic notation, demanding that models not only perform arithmetic operations but also understand and manipulate mathematical expressions. Performance is measured by the accuracy with which a model can correctly solve problems within the dataset, providing a quantitative assessment of its mathematical reasoning proficiency. The complexity of problems ranges from basic algebra and calculus to more advanced topics, allowing for a nuanced evaluation of a model’s capabilities across different mathematical domains.

The Countdown task assesses an agent’s ability to solve arithmetic problems presented as string-based equations. RARO achieved an accuracy of 54.4% on this task, indicating a high level of performance in symbolic manipulation and numerical reasoning. This result positions RARO closely behind the RLVR model, which attained 57.7% accuracy on the same benchmark. The narrow margin between the two models suggests RARO’s architecture is effectively capable of processing and solving arithmetic problems with a level of competency comparable to a leading model in the field.

RARO achieves a 41.3% accuracy rate on the DeepMath dataset, a benchmark designed for evaluating mathematical reasoning capabilities. This performance surpasses that of existing comparable methods currently documented in literature. The DeepMath dataset presents problems requiring multi-step reasoning to solve mathematical problems, and RARO’s demonstrated accuracy indicates a robust capacity for handling complex mathematical logic and problem-solving. The dataset consists of problems expressed in a formal language, requiring the model to not only perform calculations but also interpret and apply mathematical principles correctly.

Evaluation of RARO’s creative reasoning capabilities included the Poetry Writing Task, where the model’s generated poetry was assessed for expressiveness and meaningfulness. Results demonstrated a substantial performance improvement over baseline models, indicating RARO’s capacity for creative text generation beyond purely logical tasks. This assessment utilized metrics focused on poetic quality and coherence, confirming that RARO not only produces syntactically correct poetry but also exhibits an understanding of creative expression and semantic relevance within the generated verses.

Performance on DeepMath problems increases with the number of rollouts used during test-time scaling, regardless of model size, as detailed in Appendix E, Table 9.

Towards Genuine Alignment: Post-Training Refinement with DAPO

Reinforcement learning from human feedback, exemplified by techniques like RARO, establishes a crucial baseline for aligning large language models with human preferences. However, the process doesn’t necessarily reach its full potential without subsequent refinement. Post-training alignment methods, such as Direct Preference Optimization (DAPO), build upon this foundation by directly optimizing the model’s parameters based on human feedback signals. This allows for a more nuanced adjustment of the model’s behavior, addressing subtle stylistic inconsistencies or coherence issues that might persist after initial reinforcement learning. By focusing on fine-grained improvements, DAPO effectively polishes the model’s responses, resulting in outputs that are not only aligned with broad preferences but also exhibit a higher degree of sophistication and naturalness.

The application of Direct Preference Optimization (DAPO) to the Poetry Writing Task reveals a significant capacity for enhancing stylistic sophistication and textual cohesion. Through this post-training refinement process, the language model moves beyond simply generating grammatically correct verse; it begins to exhibit a greater understanding of poetic devices and a more natural command of rhythm and imagery. Evaluations demonstrate that DAPO effectively steers the model towards outputs favored by human reviewers, resulting in poems that are not only technically sound but also possess a discernible aesthetic quality. This improvement in nuance extends to subtle aspects of language, such as word choice and phrasing, contributing to a more polished and engaging final product. Consequently, the model demonstrates a capacity for generating poetry that resonates more deeply with human sensibilities, indicating a substantial step towards achieving truly aligned language generation.

Combining adversarial training, exemplified by the RARO method, with subsequent post-training refinement using techniques like DAPO presents a particularly effective strategy for developing language models that are both capable and well-aligned. RARO establishes a strong foundational understanding of desired behaviors by exposing the model to challenging, adversarially generated examples. However, this initial training can be further enhanced by DAPO, which fine-tunes the model’s responses after the primary training phase. This sequential approach allows for a more nuanced and coherent refinement of stylistic qualities and overall alignment with human preferences, resulting in models that demonstrate increased robustness and a stronger adherence to intended values.

The synergistic application of techniques like RARO and DAPO represents a significant step towards creating language models that are not simply powerful in their capabilities, but also reliably reflect human preferences and ethical considerations. Previous approaches often prioritized performance metrics, potentially overlooking subtle but critical aspects of alignment – such as stylistic consistency or nuanced understanding of intent. By first establishing a strong foundation through adversarial training, and then refining the model’s behavior with post-training methods, researchers are fostering a demonstrably improved correlation between a model’s outputs and human expectations. This progression moves beyond mere task completion, aiming for a level of responsiveness and appropriateness that builds trust and facilitates seamless interaction, ultimately paving the way for AI systems that are both intelligent and ethically grounded.

The pursuit of elegant solutions often feels like a desperate attempt to conceal inherent difficulties. This research, with its RARO algorithm, sidesteps the need for complex, task-specific verifiers – a clever maneuver, really. They called it a framework to hide the panic, one might jest, but in truth, it represents a significant step towards more robust reasoning in large language models. As Donald Knuth observed, “Premature optimization is the root of all evil.” RARO’s focus on learning directly from demonstrations, rather than crafting elaborate reward functions or relying on adversarial training, exemplifies this principle. It prioritizes a clear, demonstrable path to success, acknowledging that simplicity-in this case, a direct imitation of expertise-can be profoundly effective, even on tasks lacking easily verifiable ground truth.

What Lies Ahead?

The pursuit of intelligence, even in its narrowly defined algorithmic form, perpetually reveals the elegance of simplicity. This work, by sidestepping the need for explicit verification or human preference signals, demonstrates a compelling reduction in complexity. Yet, the shadow of demonstration remains. The algorithm learns from examples, inheriting their biases and limitations. The true test will not be replicating existing expertise, but extrapolating beyond it – a challenge that demands, not more data, but a deeper understanding of the underlying principles that generate that data.

Current reliance on demonstration, while effective, hints at a fundamental constraint. A system confined to imitating, however skillfully, remains tethered to the known. Future iterations should explore methods for actively querying the world, for constructing internal models that facilitate independent exploration and discovery. The elimination of the verifier is a step forward, but a system that understands why an action is correct, rather than merely recognizing its success, represents a more substantial advance.

The question, ultimately, is not whether a machine can act intelligently, but whether it can reason intelligently. The current work provides a valuable tool, but the path towards genuine reasoning requires a willingness to confront the inherent ambiguity of the world, and to embrace the possibility of error – a quality conspicuously absent in the polished perfection of demonstration.

Original article: https://arxiv.org/pdf/2511.21667.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Reward: Navigating the Nuances of Language Alignment

RARO: Unifying Policy and Evaluation Through Self-Critique

Demonstrating Robustness: Reasoning and Creativity Benchmarked

Towards Genuine Alignment: Post-Training Refinement with DAPO

What Lies Ahead?

See also: