Can Language Models Learn to Find Flaws?

Author: Denis Avetisyan

New research explores training large language models to automatically generate formal counterexamples, pushing the boundaries of automated reasoning.

The system tackles formal counterexample generation by first employing informal reasoning to pinpoint a valid counterexample, then constructing a corresponding formal proof automatically verified through theorem provers such as Lean 4.

This paper introduces a framework for LLM-based formal counterexample generation using data synthesis and multi-reward learning to overcome challenges in theorem proving and formal verification.

While artificial intelligence has made strides in formalizing mathematical proof construction, the complementary skill of rigorously disproving false statements-finding counterexamples-remains largely underexplored. This limitation motivates the work ‘Learning to Disprove: Formal Counterexample Generation with Large Language Models’, which introduces a novel framework for training large language models to generate formally verifiable counterexamples. By combining symbolic data synthesis-creating diverse training instances by systematically modifying existing theorems-with a multi-reward learning scheme, this approach substantially enhances both the efficiency and effectiveness of LLM-driven counterexample search. Could this paradigm shift in formal reasoning unlock new capabilities in automated theorem proving and beyond?

The Elusive Truth: Challenging Automated Counterexample Discovery

The bedrock of automated theorem proving and formal verification rests upon the swift and reliable identification of counterexamples – instances that disprove a proposed statement. This process, however, has historically presented a significant challenge. Unlike pattern recognition tasks where abundant data fuels learning, verifying mathematical claims often requires exploring a vast, complex search space with limited guidance. A successful counterexample not only demonstrates the falsehood of a theorem but also provides insight into why it fails, demanding a level of logical reasoning and precision that is difficult to achieve computationally. The efficiency with which these disproving examples can be generated directly impacts the scalability and practicality of automated verification systems, hindering progress in areas ranging from software security to hardware design.

While large language models demonstrate impressive abilities in natural language and even some mathematical reasoning, applying them to automated theorem proving reveals critical limitations. These models, trained on vast datasets of text and code, often lack the precision and logical depth necessary to navigate the complexities of formal mathematics. Current LLM-based approaches frequently generate counterexamples that appear plausible but fail to satisfy the rigorous requirements of a formal proof, exhibiting difficulties with nuanced definitions, axiomatic systems, and the need for absolute certainty. The inherent probabilistic nature of LLM predictions contrasts sharply with the deterministic demands of mathematical verification, leading to outputs that, while syntactically correct, are often semantically flawed in a formal context – a subtle but crucial distinction hindering their effectiveness in this domain.

The efficacy of large language models in automated counterexample discovery is significantly constrained by a lack of sufficient training data and the challenge of sparse reward signals. Unlike tasks with abundant labeled examples, formal mathematics operates with a relatively small corpus of proven theorems and attempted proofs, limiting the model’s ability to generalize. Furthermore, even when a model generates a potential counterexample, determining its validity often requires extensive verification, providing only infrequent positive feedback-a signal indicating a successful discovery. This scarcity of instructive signals makes it difficult for the model to learn effectively and refine its counterexample generation strategies, hindering progress in automated theorem proving and verification systems. The model essentially operates in a landscape where correct answers are rare, making the learning process considerably more difficult than in areas with more readily available and frequent feedback.

Counterexample training enhances learning by synthetically generating challenging problems through symbolic mutation and then rewarding the agent based on its ability to both prove the mutated theorem and validate the original dropped hypothesis. — Counterexample training enhances learning by syntically generating challenging problems through symbolic mutation and then rewarding the agent based on its ability to both prove the mutated theorem and validate the original dropped hypothesis.

Strategic Mutation: Expanding the Problem Space

The Mutation Strategy generates novel counterexample problems through a targeted process of hypothesis removal from established, provable theorems. This technique operates by systematically discarding individual hypotheses within a theorem’s antecedent – the ‘if’ part of an ‘if-then’ statement – effectively creating a modified theorem. The resulting mutated theorem, while no longer provable under the original conditions, then serves as a new problem instance for counterexample search. This approach differs from random problem generation as it is guided by existing logical structures, ensuring the generated problems retain a degree of relevance to the original search space and are syntactically valid.

The mutation strategy expands the counterexample search space by systematically altering the problem definition based on existing, provable theorems. This is achieved by discarding hypotheses within those theorems, creating variations of the original problem that retain core logical structure but introduce new potential failure points for the tested LLM. This broadened search isn’t random; it’s guided by the structure of proven theorems, ensuring the generated counterexamples are logically related to the initial problem. Consequently, solutions that were previously obscured due to the limitations of the original search space become accessible, increasing the probability of identifying edge cases and vulnerabilities within the LLM’s reasoning capabilities.

Expanding the problem space through mutation allows Large Language Models (LLMs) to improve counterexample search efficiency by concentrating computational resources on a more targeted, yet broader, set of possibilities. Traditional search methods can be hampered by initially narrow or incomplete problem definitions, leading to failed attempts even when a solution exists. This technique increases the probability of success by generating variations of existing problems, effectively increasing the likelihood that a solvable instance will be encountered during the search process. Empirical results demonstrate a quantifiable improvement in success rates when LLMs are applied to these mutated problem spaces compared to searches performed on the original, unmutated problems.

Our counterexample training framework successfully instantiates a challenging subgoal-extracted from the miniF2F formalization of AIME-II 2001 Problem 3-to facilitate counterexample generation.

Formal Rigor: Validation Through Theorem Proving

The Lean 4 theorem prover serves as a foundational component in formal verification by rigorously assessing the validity of generated counterexamples. Unlike traditional testing methods, Lean 4 employs a formal logic system to mathematically prove or disprove claims about a system’s behavior. This involves translating the counterexample – a specific input demonstrating a potential failure – into a statement that Lean 4 can analyze. The prover then attempts to construct a formal proof demonstrating the counterexample’s correctness, effectively confirming the existence of a bug. If Lean 4 cannot find such a proof, it indicates the counterexample is likely invalid, potentially stemming from an error in the counterexample generation process or a misunderstanding of the system’s specifications. This deterministic and mathematically sound verification process provides a significantly higher level of assurance than empirical testing.

Large language models, specifically Deepseek-Prover-V2 7B and Qwen3 8B, are utilized to automate aspects of the formal verification process. Deepseek-Prover-V2 7B focuses on generating formal proofs – rigorous, logically structured arguments – while Qwen3 8B contributes through informal reasoning capabilities which assist in evaluating the plausibility of potential proof steps. These models are not intended to replace theorem provers entirely, but rather to act as assistants, suggesting proof strategies and identifying potentially correct solutions for subsequent verification by a formal system like Lean 4. The models’ output is assessed based on its logical validity and completeness, with successful proofs contributing to the overall verification pipeline.

Autoformalization addresses the challenge of adapting human-readable problem statements into the formal languages required by theorem provers like Lean 4. This process involves translating natural language descriptions of mathematical problems – including axioms, definitions, and conjectures – into a logically precise and machine-interpretable format. Techniques employed in autoformalization include parsing natural language, identifying key mathematical concepts, and constructing formal representations using the syntax and semantics of the target theorem proving system. Successful autoformalization is critical for enabling automated theorem proving on problems expressed in natural language, bridging the gap between human intuition and machine verification.

The Multi-Reward Function enhances the learning process in automated theorem proving by providing dual incentives. It assigns positive rewards not only when a valid proof is generated for a mutated theorem – a variation of the original problem – but also when the system successfully identifies and discards a false hypothesis. This dual-reward mechanism encourages both constructive proof generation and the ability to recognize invalid statements, improving the overall efficiency and robustness of the learning algorithm by penalizing unproductive lines of inquiry and focusing resources on viable proof paths.

Evaluation of For-Counter on five neural theorem provers demonstrates performance convergence with increasing <span class="katex-eq" data-katex-display="false">k</span> values, nearing stability at <span class="katex-eq" data-katex-display="false">k=10</span>. — Evaluation of For-Counter on five neural theorem provers demonstrates performance convergence with increasing $k$ values, nearing stability at $k=10$ .

Scaling Reason: Towards Robust Automated Validation

The generation of effective counterexamples – crucial for both debugging and validating automated reasoning systems – benefits significantly from a strategically implemented mutation strategy coupled with formal verification. This approach doesn’t simply test for errors, but actively creates challenging scenarios by subtly altering problem instances. These mutations, guided by the principles of formal verification, introduce variations designed to expose weaknesses in the reasoning process. The resulting refined counterexample generation isn’t merely about finding any failing case, but identifying those that most effectively demonstrate the limits of the system’s logic, leading to more robust and reliable automated reasoning frameworks. This synergistic combination unlocks a pathway to systematically stress-test and improve the core functionality of AI problem-solving capabilities.

The generation of valid proofs benefits significantly from the integration of reinforcement learning algorithms, particularly Generalized Reinforcement Learning with Off-policy Correction (GRPO). This approach allows the model to learn a policy for selecting proof steps, iteratively refining its strategy based on feedback received from a reward signal. Unlike traditional methods that rely on pre-defined heuristics, GRPO enables the system to explore a wider range of potential proof paths, adapting to the specific challenges posed by each problem. By effectively balancing exploration and exploitation, the model learns to prioritize steps that are likely to lead to a successful proof, resulting in improved performance and a greater capacity to solve complex mathematical problems. The algorithm’s ability to learn from its mistakes and adjust its strategy dynamically is crucial for navigating the intricate landscape of formal reasoning.

Rigorous evaluation of the framework on the challenging CounterMath dataset reveals a substantial leap in automated reasoning capabilities. The system achieves an impressive Pass@1 rate of 95%, indicating that it successfully solves 95% of the presented problems on the first attempt – a significant improvement over existing methods. This translates to solving 95 more problems than the strongest baseline model, demonstrating the effectiveness of the integrated mutation strategy and formal verification techniques. The results highlight not only the system’s ability to find solutions, but also its efficiency in doing so, paving the way for more robust and reliable automated reasoning systems in various domains.

The developed framework demonstrably enhances automated reasoning capabilities, achieving Pass@4 and Pass@9 rates of 69% and 63% respectively on a challenging mathematical dataset. This translates to solving 69 and 63 more problems than the current state-of-the-art baseline, highlighting a substantial improvement in problem-solving efficacy. Crucially, the system’s performance isn’t solely about quantity; a mutation ratio ranging from 1.65 to 2.48 indicates the approach effectively synthesizes new, diverse training data, enabling the model to generalize beyond the initial dataset and refine its reasoning process. This data synthesis capability is pivotal, suggesting the framework’s potential extends beyond simply achieving higher scores and towards fostering a more adaptable and robust automated reasoning system.

The developed framework extends beyond mathematical problem solving, offering a powerful methodology applicable to a wider range of computational challenges. Specifically, the integration of mutation-guided formal verification holds significant promise for program verification, where ensuring code correctness is paramount. By systematically introducing and validating changes, the approach can rigorously identify and eliminate bugs, leading to more reliable software. Furthermore, this methodology contributes to the development of more robust artificial intelligence systems; validating the logical consistency of AI models-and their decision-making processes-is crucial for deploying trustworthy and safe AI in critical applications, enhancing their resilience to unforeseen inputs and adversarial attacks.

The pursuit of formal counterexamples, as detailed in this work, demands a ruthless pruning of complexity. The framework presented prioritizes succinctness in symbolic data synthesis, aiming for minimal, yet conclusive, refutations of proposed theorems. This echoes Claude Shannon’s sentiment: “The most important thing in communication is to convey information with the least possible redundancy.” The multi-reward learning scheme, while sophisticated, ultimately serves to refine the model’s ability to isolate the core elements necessary for a counterexample-discarding extraneous details. The success of this approach demonstrates that genuine understanding, and therefore effective reasoning, isn’t built upon layers of intricacy, but upon the elegance of essential truth.

What’s Next?

The presented work addresses a practical bottleneck-data scarcity-in formal reasoning. Yet, resolving one constraint invariably reveals others. The multi-reward scheme, while demonstrably effective, introduces complexity. Future iterations must assess whether gains in counterexample generation justify increased algorithmic overhead. Clarity is the minimum viable kindness, and simpler solutions remain desirable.

A critical, largely unexplored area concerns generalization. Performance on benchmark theorems does not guarantee robustness against adversarial examples – subtly altered statements designed to expose weaknesses in the LLM’s reasoning. This is not a failing, but a necessary challenge. True intelligence-even simulated-is defined not by what it knows, but by what it doesn’t know, and how it responds.

Ultimately, the field seeks not merely to find counterexamples, but to understand failure. The current framework is a proficient search algorithm. The next step requires imbuing it with something approaching diagnostic capability. The ability to articulate why a theorem fails, not simply that it does, represents a substantive-and likely elusive-advance.

Original article: https://arxiv.org/pdf/2603.19514.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Elusive Truth: Challenging Automated Counterexample Discovery

Strategic Mutation: Expanding the Problem Space

Formal Rigor: Validation Through Theorem Proving

Scaling Reason: Towards Robust Automated Validation

What’s Next?

See also: