The Logic of Language: Guiding Models Towards Truth

Author: Denis Avetisyan

Researchers are developing new techniques to train language models to not just generate text, but to actively reason and verify their conclusions.

The study demonstrates that employing diverse training techniques on the Qwen-2.5-MATH-1.5B model significantly impacts its performance on the Minerva dataset, as evidenced by variations in Pass@256 versus Pass@1 metrics-a quantifiable measure of successful problem-solving at different complexity levels.

A novel reinforcement learning approach using α-divergences balances precision and diversity in formal theorem proving with language models.

While reinforcement learning has become standard for equipping large language models with reasoning abilities, it often comes at the cost of diminished diversity in generated solutions. This paper, ‘Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity’, argues this stems from an implicit optimization towards high-probability outcomes, neglecting potentially valid alternatives. We introduce a method that explicitly defines a target distribution by filtering incorrect answers, then utilizes α-divergences to balance precision and coverage during training. Demonstrating state-of-the-art performance on a Lean theorem-proving benchmark, our approach raises the question: can explicitly shaping the target distribution unlock more robust and creative reasoning in language models?

The Illusion of Fluency: Reasoning and Distributional Misalignment

Despite their remarkable ability to generate human-quality text, large language models frequently falter when confronted with tasks demanding complex reasoning. This isn’t a deficit in linguistic skill, but rather a misalignment between the model’s generative process and the distribution of desirable outputs. Essentially, these models are optimized for predicting the next word in a sequence, not necessarily for arriving at logically sound or factually accurate conclusions. Consequently, improving performance requires more than simply increasing model size; it necessitates innovative training techniques that better guide the model towards producing responses that are not only fluent but also consistent with desired reasoning patterns and knowledge constraints. This pursuit of ‘alignment’ aims to shape the probability distribution of generated text, pushing it away from plausible but incorrect answers and towards those that reflect robust, verifiable reasoning.

Increasing the size of language models, while consistently yielding performance gains, is proving insufficient to address fundamental limitations in reliability and trustworthiness. Researchers are discovering that simply adding more parameters doesn’t automatically translate to improved reasoning or a reduction in harmful outputs; instead, it often amplifies existing biases and inconsistencies. Consequently, the focus is shifting toward more sophisticated training methodologies – including techniques like reinforcement learning from human feedback and contrastive learning – and robust evaluation metrics that move beyond simple accuracy. These advanced approaches aim to better align model behavior with human values and expectations, fostering a system where generated content is not only fluent but also demonstrably truthful, safe, and consistently aligned with the intended purpose, necessitating a move from scale alone to a holistic refinement of model development and assessment.

Language models often face a critical dilemma: striving for highly precise, predictable outputs can stifle creativity and exploration, while encouraging diverse responses risks sacrificing accuracy and factual consistency. This inherent trade-off stems from the training objectives themselves; methods optimized for minimizing prediction error tend to favor the most probable tokens, leading to conservative and sometimes repetitive text. Conversely, techniques designed to promote varied generation – such as increasing sampling temperature or employing nucleus sampling – can introduce irrelevant or nonsensical content. Consequently, developers continually navigate this tension, seeking strategies to balance the need for reliable, grounded responses with the desire for engaging and imaginative text generation, recognizing that a truly versatile language model requires both precision and diversity, not one at the expense of the other.

Our α-DPG method balances exploration and exploitation by achieving high precision and coverage, positioning it along a Pareto frontier compared to other reinforcement learning approaches that either focus on limited regions or sacrifice quality for diversity.

Constructing Verifiable Intelligence: Distributional Policy Optimization

Distributional Matching with Verifiable Rewards (DMVR) establishes a training framework where policies are optimized to align with predefined target distributions. This is achieved by employing a binary Verifier component that evaluates the correctness of generated outputs. The Verifier provides a discrete signal – either accepting or rejecting an output – which is then used to guide the policy’s learning process. Unlike traditional reward functions that might offer nuanced feedback, the Verifier provides a clear, boolean assessment, effectively creating a supervised signal for policy improvement. This approach allows for training policies that not only maximize a reward but also adhere to specific distributional constraints, ensuring outputs conform to desired characteristics and reducing the likelihood of generating undesirable or invalid results.

Rejection Sampling Fine-tuning (RS-FT) and Reinforcement Learning with Verifier Rewards (RLVR) both integrate the binary verification process central to Distributional Matching with Verifiable Rewards (DMVR). In RS-FT, the verifier acts as a filter, accepting only outputs deemed correct and using these for subsequent fine-tuning iterations, effectively biasing the model towards verifiable solutions. RLVR, conversely, utilizes the verifier’s output as a reward signal within a reinforcement learning framework; a positive verification results in a reward, guiding the policy to generate more verifiable outputs. Both methods directly leverage the verifier to shape the policy’s output distribution, although they differ in their implementation – RS-FT uses a deterministic acceptance/rejection mechanism, while RLVR employs a gradient-based optimization approach guided by the verifier reward.

Exclusive reliance on a verifier reward during policy training can negatively impact the diversity of generated outputs. This occurs because the policy is optimized to maximize reward from the verifier, leading it to consistently select high-probability outputs that are easily verified. This reinforces mode-seeking behavior, where the model converges on a limited subset of the possible output space, and diminishes the exploration of less common, yet potentially creative, alternatives. Consequently, the resulting text tends to be predictable and lacks the variability expected from more robust generative models, even if those models are less directly optimized for verifiability.

Training curves demonstrate that incorporating high KL or entropy regularization into the DR-GRPO baseline improves sequence entropy and reward, although some runs were prematurely terminated and resumed.

The α-Divergence: Balancing Precision and Exploration

Alpha-DPG generalizes Distributional Policy Gradient (KL-DPG) by incorporating a family of $f$-divergences parameterized by $\alpha$. Traditional KL-DPG utilizes the Kullback-Leibler divergence, which represents a specific case within this broader framework. By adjusting the $\alpha$ parameter, Alpha-DPG enables a continuous transition between the forward Kullback-Leibler divergence ($\alpha$ approaches 0) and the reverse Kullback-Leibler divergence ($\alpha$ approaches 1). This parameterization allows for the definition of a divergence measure that prioritizes either matching the modes of the target distribution or ensuring coverage of the support, offering a tunable balance between precision and diversity in policy optimization.

Alpha-DPG leverages the properties of $\alpha$-divergences to modulate the balance between coverage and precision in policy optimization. Traditional Distributional Policy Gradient methods often utilize the Kullback-Leibler (KL) divergence, which inherently favors solutions prioritizing precision over broad exploration. By interpolating between the forward KL divergence and its reverse counterpart – controlled by the $\alpha$ parameter – Alpha-DPG enables a tunable trade-off. A lower $\alpha$ value emphasizes coverage of the target distribution, encouraging exploration, while a higher $\alpha$ value prioritizes precision by minimizing divergence from the target. This interpolation allows for fine-grained control over the policy update, enabling the generation of policies that can achieve a desired balance between exploring diverse actions and exploiting optimal strategies.

Experimental results demonstrate that the Alpha-DPG approach achieves Pareto-optimal performance regarding both coverage and precision. This is evidenced by the generated models consistently lying on the Pareto frontier, indicating no trade-off exists between these two metrics. Comparative analysis shows Alpha-DPG surpasses the performance of existing methods in this regard. Furthermore, evaluations indicate a statistically significant improvement in diversity of generated samples when compared to the GRPO algorithm, suggesting a more comprehensive exploration of the policy space.

Alpha-DPG models achieve a Pareto-optimal trade-off between precision and coverage, as demonstrated by bootstrap variance estimates.

Formal Verification: A Gold Standard for Reasoning Evaluation

The pursuit of reliable benchmarks for evaluating language model reasoning has led to the integration of formal proof assistants, notably Lean. Unlike traditional datasets which may contain ambiguous or subjective answers, Lean provides a system for constructing mathematically rigorous proofs, establishing verifiable ground truth. This allows for a precise assessment of a model’s ability to not simply generate plausible text, but to engage in logically sound deduction. By framing reasoning tasks as formal proofs, researchers can move beyond statistical metrics and evaluate whether a model truly understands the underlying principles. The use of Lean, therefore, represents a shift towards more robust and trustworthy evaluation, offering a gold standard against which language models can be rigorously tested and improved, ultimately driving progress towards artificial general intelligence.

Assessing the reasoning capabilities of large language models requires more than simple accuracy scores; the Pass@K metric offers a nuanced approach by measuring the probability that a model generates at least one correct solution within $K$ attempts. When integrated with formal verification tools like Lean, Pass@K becomes particularly powerful. Lean provides a definitive ground truth – a mathematically verifiable proof of correctness – against which the model’s generated samples are evaluated. This combination allows researchers to move beyond subjective assessments and quantify a model’s ability to consistently produce logically sound outputs, even if it doesn’t succeed on every single try. A higher Pass@K score, therefore, indicates a greater likelihood of finding a valid solution, offering a robust benchmark for comparing different models and tracking improvements in their reasoning abilities.

Recent advancements in enhancing language model outputs center on algorithms designed to refine the generation process and regularization techniques that promote higher-quality results. Gradient Reverse KL (GRPO) demonstrates notable performance improvements by inverting the typical Kullback-Leibler divergence, effectively guiding the model to generate outputs closer to the desired distribution. Conversely, High-KL regularization encourages diversity and avoids overly confident, yet potentially incorrect, predictions by penalizing outputs that deviate significantly from the learned data distribution. Both approaches, while differing in their methodology – GRPO focuses on optimizing the generation itself, while High-KL aims to constrain the output space – contribute to a more robust and reliable performance in complex reasoning tasks, ultimately improving the overall quality and trustworthiness of the generated content.

Kimina-Prover-Distill-1.7B achieves high verification rates, as demonstrated by its Pass@k curve on a held-out set of 200 problems.

The pursuit of verifiable reasoning, as demonstrated in the paper’s exploration of Reinforcement Learning from Verifiable Rewards, echoes a fundamental tenet of computational elegance. Donald Davies observed, “The computer is not just a tool, but a medium.” This statement resonates with the work’s focus on shaping the language model’s output distribution using α-divergences. The paper meticulously crafts a system where the ‘medium’ – the LLM – doesn’t merely appear to reason, but does so within a formally verifiable framework. By prioritizing mathematical discipline in the training process, the research establishes a foundation where solutions aren’t judged by empirical success alone, but by their inherent correctness-a principle central to Davies’ vision of computing’s potential.

What’s Next?

The pursuit of verifiable intelligence within large language models reveals a curious tension. This work, by framing theorem proving as a distributional problem and leveraging α-divergences, offers a pragmatic step forward. However, the fundamental question remains: Let N approach infinity – what remains invariant? The balance struck between precision and diversity, while adjustable via the α parameter, feels less like a solved problem and more like a carefully managed tradeoff. The elegance of formal verification demands more than simply achieving high scores on a benchmark; it demands a demonstrable guarantee of correctness, a property currently approximated, not possessed.

Future work must address the limitations inherent in relying on reward signals derived from theorem provers themselves. The Lean Theorem Prover, while powerful, is still a tool built on human intuition. Can a truly autonomous system, divorced from such biases, navigate the landscape of mathematical truth? The exploration of alternative divergence measures, beyond α-divergences, and a deeper investigation into the properties of the target distributions themselves, seem crucial.

Ultimately, the goal is not merely to build a language model that acts like a mathematician, but one that is a mathematician – one that operates on the bedrock of logical necessity, not probabilistic approximation. This necessitates a shift in perspective – from training models to proving models – and a relentless focus on the invariant properties that define true intelligence.

Original article: https://arxiv.org/pdf/2512.05962.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Illusion of Fluency: Reasoning and Distributional Misalignment

Constructing Verifiable Intelligence: Distributional Policy Optimization

The α-Divergence: Balancing Precision and Exploration

Formal Verification: A Gold Standard for Reasoning Evaluation

What’s Next?

See also: