Unlocking Reasoning: How to Train Smarter AI with Less Data

Author: Denis Avetisyan


A new framework, Miner, boosts the reasoning capabilities of large language models by cleverly repurposing existing data and turning uncertainty into a powerful learning signal.

The system cultivates robust behavior through intrinsic rewards-sequence-level uncertainty derived from a prior policy-that reinforce exploration of initially successful but still tentative actions, while token-level focal credit assignment selectively amplifies learning signals from critical elements within those sequences, all balanced by advantage calibration to a predefined threshold, ultimately improving data efficiency without disrupting established learning.
The system cultivates robust behavior through intrinsic rewards-sequence-level uncertainty derived from a prior policy-that reinforce exploration of initially successful but still tentative actions, while token-level focal credit assignment selectively amplifies learning signals from critical elements within those sequences, all balanced by advantage calibration to a predefined threshold, ultimately improving data efficiency without disrupting established learning.

Miner leverages uncertainty-driven rewards from positive homogeneous prompts to achieve data-efficient reinforcement learning in large reasoning models.

Current reinforcement learning methods struggle with data efficiency when training large reasoning models on seemingly perfect, yet uninformative, prompts. This limitation motivates ‘Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models’, which introduces a novel framework that repurposes a model’s own uncertainty as a self-supervised reward signal. By dynamically focusing learning on critical, uncertain tokens, Miner achieves state-of-the-art performance and significant gains in reasoning benchmarks without external supervision or increased inference costs. Does exploiting latent uncertainty represent a sufficient pathway towards scalable and efficient reinforcement learning for increasingly complex language models?


The Fragile Facade of Reasoning

Despite their remarkable ability to generate human-quality text, large language models frequently falter when confronted with tasks demanding complex reasoning. This isn’t simply a matter of lacking knowledge; even with vast datasets informing their responses, these models exhibit brittle performance – meaning they are easily derailed by slight variations in problem framing or unexpected inputs. This limitation stems from a reliance on statistical correlations within the training data, rather than a genuine understanding of underlying principles. Consequently, their reasoning capabilities often lack the robustness needed to generalize beyond the specific examples encountered during training, hindering their application to novel situations or problems requiring abstract thought. The models may excel at mimicking reasoning patterns but struggle when asked to apply them flexibly or creatively, revealing a gap between superficial fluency and true cognitive ability.

Applying traditional reinforcement learning to enhance reasoning in large language models presents significant hurdles, primarily due to the immense data requirements and the notorious ‘credit assignment problem’. These models often necessitate countless examples to learn even basic reasoning skills, a limitation given the vastness of potential logical pathways. More critically, when a model eventually arrives at a correct conclusion, determining which specific steps within its reasoning process were genuinely crucial – and which were merely coincidental or redundant – proves exceptionally difficult. This ambiguity hinders effective learning; the model struggles to refine its approach, as it cannot accurately pinpoint which actions deserve reinforcement and which should be avoided, ultimately limiting its ability to generalize to novel, complex problems. Consequently, alternative strategies are being explored to overcome these limitations and build more robust reasoning capabilities.

By introducing intrinsic rewards, Miner reduces redundant rollouts in reinforcement learning, achieving comparable peak performance to traditional methods with half the training steps and up to a <span class="katex-eq" data-katex-display="false">23%</span> performance increase on Qwen3-4B-Base.
By introducing intrinsic rewards, Miner reduces redundant rollouts in reinforcement learning, achieving comparable peak performance to traditional methods with half the training steps and up to a 23% performance increase on Qwen3-4B-Base.

Cultivating Internal Validation

Miner represents a departure from traditional Reinforcement Learning from Human Feedback (RLHF) methods by incorporating uncertainty-driven intrinsic rewards as a primary learning signal. This approach aims to diminish the dependency on scarce and costly external human feedback. Instead of solely relying on external validation, Miner quantifies the model’s own uncertainty during the reasoning process and utilizes this quantification as a reward. This self-generated reward signal facilitates a more efficient learning loop, allowing the model to improve its performance based on its internal assessment of confidence, effectively supplementing or even substituting for human-provided rewards. The resulting framework is categorized as a Reinforcement Learning from Value-based Rewards (RLVR) variant due to its emphasis on internal value estimation.

The Miner framework establishes a self-supervised learning loop through the use of Positive Homogeneous (PH) prompts, which consistently request the model to both generate an answer and estimate its own confidence in that answer. This confidence estimation is quantified as an ‘Uncertainty-Driven Intrinsic Reward’, calculated based on the model’s internal probability distribution over possible tokens at each reasoning step; lower entropy values, indicating higher confidence, result in increased reward signals. This intrinsic reward mechanism allows Miner to learn from its own predictions without requiring external human feedback, effectively utilizing the model’s inherent understanding of its own uncertainty as a learning signal. The magnitude of the reward is directly proportional to the model’s estimated confidence, providing a continuous and differentiable signal for reinforcement learning.

Miner utilizes Adaptive Advantage Calibration and Token-Level Focal Credit Assignment to effectively distribute uncertainty-driven intrinsic rewards during reinforcement learning. Adaptive Advantage Calibration dynamically scales the intrinsic rewards based on the current advantage estimate, preventing reward saturation and stabilizing training. Token-Level Focal Credit Assignment then refines gradient propagation by down-weighting gradients from confidently predicted tokens and focusing them on tokens where the model exhibits higher uncertainty, thereby highlighting crucial reasoning steps and improving learning efficiency. This targeted credit assignment ensures that the model prioritizes refining its understanding in areas where it is most needed, maximizing the impact of the intrinsic reward signal.

Ablation studies demonstrate that the proposed innovations-intrinsic reward, focal weighting, and advantage calibration-enable Miner to achieve significant and consistent performance gains <span class="katex-eq" data-katex-display="false"> (+5 absolute points) </span> over baseline algorithms, as evidenced by both performance dynamics with sufficient inference budgets and parallel test-time scaling across 10 runs <span class="katex-eq" data-katex-display="false"> (±1 standard deviation) </span>.
Ablation studies demonstrate that the proposed innovations-intrinsic reward, focal weighting, and advantage calibration-enable Miner to achieve significant and consistent performance gains (+5 absolute points) over baseline algorithms, as evidenced by both performance dynamics with sufficient inference budgets and parallel test-time scaling across 10 runs (±1 standard deviation) .

The Metrics of Self-Awareness

In the Miner framework, the Negative Log-Likelihood (NLL) functions as the primary metric for quantifying model uncertainty during the learning process. NLL is calculated based on the probability assigned by the model to the correct answer; a lower probability corresponds to a higher NLL value. Consequently, higher NLL values directly indicate greater uncertainty in the model’s prediction. This uncertainty is then leveraged as a component of the intrinsic reward signal, incentivizing the model to explore and refine its reasoning in areas where it exhibits the most uncertainty. The scale of the NLL is therefore critical, as it directly modulates the strength of the intrinsic motivation driving the learning process.

Training stability in Miner is assessed via monitoring KL Divergence and Entropy. KL Divergence measures the difference between the distribution of intrinsic rewards and a prior distribution, indicating potential shifts in the reward signal that could destabilize learning. Elevated KL Divergence suggests the agent is exploring significantly different reasoning paths. Entropy quantifies the randomness or unpredictability of the intrinsic reward distribution; high entropy indicates inconsistent reward assignment, potentially leading to erratic behavior. Tracking these metrics during training allows for adjustments to the intrinsic reward function or learning rate, ensuring the rewards consistently guide the model towards reliable and predictable reasoning processes.

Evaluation of Miner utilized the Pass@K metric, which assesses the probability of a correct solution appearing within the top K generated responses. Results indicate an absolute improvement of +4.58 in Pass@1 – the probability of the first generated response being correct – and a +4.23 absolute improvement in the overall Pass@K metric when applied to the Qwen3-4B model. This demonstrates that Miner’s methodology effectively enhances the likelihood of generating correct solutions, both as a first attempt and within a broader set of potential responses.

Applying Miner and GRPO algorithms to Qwen3-4B demonstrates a swift transition from exploration-characterized by increasing entropy and gradient norms-to exploitation, resulting in improved performance on downstream benchmarks (as shown in Fig. 8).
Applying Miner and GRPO algorithms to Qwen3-4B demonstrates a swift transition from exploration-characterized by increasing entropy and gradient norms-to exploitation, resulting in improved performance on downstream benchmarks (as shown in Fig. 8).

Scaling the Seeds of Self-Improvement

To rigorously assess Miner’s potential, the framework was implemented utilizing both a smaller language model, Qwen3-4B-Base, and a larger counterpart, Qwen3-8B-Base. This deliberate scaling approach allowed researchers to evaluate Miner’s adaptability and performance consistency across varying model sizes. By testing Miner on both a computationally efficient, smaller model and a more complex, larger model, the study aimed to determine if the framework’s benefits extend beyond specific model architectures and parameter counts, demonstrating its broad applicability within the landscape of large language models and reasoning tasks.

Evaluations reveal that Miner consistently enhances reasoning capabilities regardless of model scale. Specifically, when implemented with the Qwen3-8B model, Miner achieves a notable performance increase, demonstrating a +2.37 absolute improvement in Pass@1 – a metric measuring single-attempt accuracy – and a substantial +6.66 absolute gain in Pass@K, which assesses accuracy across multiple attempts. These results indicate Miner’s ability to effectively refine reasoning processes within larger language models, leading to a significant boost in overall problem-solving performance and reliability, even as model dimensions expand.

Evaluations reveal Miner to be a highly efficient reasoning enhancement technique when contrasted with the established DAPO baseline. Specifically, Miner achieves a noteworthy +2.48 absolute improvement in Pass@1 – a metric assessing single-attempt success – and a +2.14 absolute gain in Pass@K, which measures success across multiple attempts. These results underscore Miner’s ability to significantly boost the probability of correct responses, demonstrating its potential for optimizing large language model performance without requiring substantial computational resources or architectural changes. The observed gains suggest Miner represents a practical advancement in reasoning capabilities for various language-based applications.

Despite utilizing models of varying sizes (<span class="katex-eq" data-katex-display="false">Qwen3-4B</span> and <span class="katex-eq" data-katex-display="false">Qwen3-8B</span>), Miner consistently improves Pass@K performance as <span class="katex-eq" data-katex-display="false">K</span> increases, indicating its effectiveness in retrieving relevant information.
Despite utilizing models of varying sizes (Qwen3-4B and Qwen3-8B), Miner consistently improves Pass@K performance as K increases, indicating its effectiveness in retrieving relevant information.

The pursuit of data efficiency, as demonstrated by Miner, reveals a fundamental truth about complex systems. It isn’t simply about maximizing signal, but about acknowledging the inevitable entropy. The framework transforms uncertainty – a natural byproduct of reasoning within large language models – into a usable reward. This echoes Claude Shannon’s insight: “The most important thing in communication is to convey the meaning, not the signal.” Miner doesn’t attempt to eliminate uncertainty, but rather to harness it, recognizing that information isn’t about perfection, but about managing imperfection. The system, therefore, doesn’t build reasoning; it cultivates it from the inherent noise, predicting eventual dependency on managing that very noise.

What’s Next?

The pursuit of data efficiency in large language models, as demonstrated by this work, isn’t a quest for optimization – it’s an exercise in controlled entropy. Miner doesn’t so much solve the problem of sample complexity as it reshapes the landscape, finding signal in what was previously discarded. But every refined reward function is a new prophecy of failure. What unforeseen biases are now encoded in the model’s perception of ‘uncertainty’? The garden will always grow weeds, and the careful pruning of one variety merely encourages another.

The focus on ‘forgiveness’ between components – allowing the model to tentatively explore, to almost err without catastrophic consequence – hints at a deeper truth. Resilience lies not in isolation, but in the graceful degradation of the whole. Future work shouldn’t solely chase higher scores, but investigate the topology of these error landscapes. How can a model learn not just what is correct, but how to fail productively, to recover with elegance?

Ultimately, the true challenge isn’t building intelligent systems, but cultivating them. Miner offers a valuable tool for tending this garden, but it’s a reminder that the most sophisticated architecture is still vulnerable to the unpredictable forces of complexity. The question remains: can a system designed to learn from its mistakes also learn to anticipate, and even embrace, its own inevitable limitations?


Original article: https://arxiv.org/pdf/2601.04731.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-10 15:58