DeepSeek-Prover
DeepSeek-Prover
Huajian Xin1,2 Daya Guo1 Zhihong Shao1 Z.Z. Ren1 Qihao Zhu1 Bo Liu1
Chong Ruan1 Wenda Li3 Xiaodan Liang2,4∗
1 2
DeepSeek Sun Yat-sen University 3 University of Edinburgh 4 MBZUAI
arXiv:2405.14333v1 [cs.AI] 23 May 2024
Abstract
1 Introduction
In modern mathematics, the increasing complexity of proofs presents substantial challenges for
peer review. This complexity has led to the acceptance of erroneous proofs, with critical flaws
often detected only after considerable time. To address these issues, formal mathematical languages
such as Lean [De Moura et al., 2015, Moura and Ullrich, 2021], Isabelle [Paulson, 1994], and
Coq [The Coq Development Team] have been developed. These languages enable the creation of
computer-verifiable proofs [Avigad, 2023]. However, crafting formal proofs demands significant
effort, specialized expertise, and poses challenges even for seasoned mathematicians. Consequently,
the significance of automated theorem proving is on the rise [Shulman, 2024].
To reduce the effort involved in writing formal mathematical proofs, several approaches [Polu and
Sutskever, 2020, Jiang et al., 2021, Han et al., 2021, Polu et al., 2022, Lample et al., 2022, Jiang et al.,
2022a, Yang et al., 2024] have been developed, primarily focusing on search algorithms that explore
potential solutions for proposed theorems. However, these methods often struggle with the vast search
spaces required for complex theorems, rendering them ineffective for more intricate proofs [Loos
et al., 2017]. Recently, advances in large language models (LLMs) have introduced a novel strategy,
∗
Corresponding author.
2
theorems commonly found in modern proof assistants such as Lean [De Moura et al., 2015], Isabelle
[Paulson, 1994], and Coq [The Coq Development Team]. The advent of recent deep learning models
and model-guided search techniques has reinvigorated the field [Bansal et al., 2019]. This modern
approach has not only enhanced the capabilities of ATP systems but also expanded their applicability
in solving more intricate mathematical problems.
ATP with Neural Models. With the development of deep learning, several approaches have been
proposed to combine neural models with ATP [Loos et al., 2017]. A series of ATP approaches adopts
tree search algorithms guided by neural models [Polu and Sutskever, 2020, Han et al., 2021, Polu
et al., 2022, Jiang et al., 2022a, Yang et al., 2024]. These approaches primarily utilize reinforcement
learning techniques to enhance the accuracy of the model [Kaliszyk et al., 2018, Crouse et al., 2021,
Wu et al., 2021, Lample et al., 2022]. Since the search space is significantly large, the searching
process consumes considerable time and computing resources.
Another series of ATP approaches harnesses the power of large language models. These approaches
typically involve language models that are fine-tuned with open-source proof data and interact with
verifiers via a state-action transition program [Polu and Sutskever, 2020, Jiang et al., 2021, Han et al.,
2021, Polu et al., 2022, Lample et al., 2022, Jiang et al., 2022a, Yang et al., 2024]. This process
iteratively generates proof steps and verifies their correctness with formal verifiers. It then generates
the next proof steps based on the proof states returned by the formal verifiers. Although these
approaches achieve high performance, they are computationally intensive. To enhance efficiency,
recent researches leverage language models to generate complete formal proofs directly [First et al.,
2023, Jiang et al., 2022b, Zhao et al., 2023, Xin et al., 2023], thus bypassing the iterative interaction
during proof generation.
Autoformalization for Formal Mathematics. Due to the limited availability of formal corpora
for training, the performance of current large language models (LLMs) is also constrained. Thus,
some approaches propose autoformalization [Wu et al., 2022, Jiang et al., 2022b], which involves
converting natural language descriptions into formal statements that can be verified by proof assistants.
Several studies have generated synthetic datasets of formal proofs using rule-based transformations
of existing theorems [Wu et al., 2020, Wang and Deng, 2020, Xiong et al., 2023]. While effective,
these methods are constrained by their reliance on predefined rules and lack flexibility for broader
applications. Recent methodologies adopt large language models to translating natural language
problems into formal statements [Huang et al., 2024]. However, these datasets remain smaller than
needed and are limited to small mathematical benchmarks, leading to only minor improvements
in training outcomes for language models. In this paper, we aim to synthesise formal proofs via
autoformalization at a much larger scale to boost the performance of a neural prover.
3 Approach
In this section, we introduce our approach, which consists of four key processes as depicted in
Figure 1. The initial phase concentrates on generating formal mathematical statements from a broad
collection of informal math problems, necessitating further proof. Next, the autoformalized statements
are filtered through model scoring and hypothesis rejection methods to select high-quality statements.
These statements are then proved by a model called DeepSeek-Prover, with their correctness verified
by the formal verifier called Lean 42 , yielding validated formal statements and proofs. These data
serve as synthetic data for fine-tuning the DeepSeek-Prover. After enhancing DeepSeek-Prover,
we repeat the entire previously described process. This cycle continues until the improvements in
DeepSeek-Prover become marginal. Notably, to enhance proof efficiency, we prove concurrently
both the original statements and their negations. This method has the advantage of swiftly discarding
the original statement when it is invalid by proving its negation. The details of each phase will be
described in the subsequent sections.
3.1 Autoformalization
The generation of formal proof data fundamentally relies on the availability of a substantial corpus
of formal statements. In practice, however, amassing a large collection of manually crafted formal
statements is challenging. Fortunately, the internet is replete with math-related problems expressed in
2
leanprover/lean4 : v4.7.0 − rc2
3
1. Autoformalization 2. Model Scoring and Hypothesis Rejection
Informal Math High-Quality Formal
Formal Math
Problems Math Statements
DS-Prover Statements DS-Prover
5. Repeat
Synthesized Data
DS-Prover
4. Fine-tuning Prover 3. Statements Proving
Formal Statements with
Correct Proofs
Statements Proving
Synthesized Data
DS-Prover Formal Verifier
natural language. By autoformalizing these informal mathematical problems, we can generate a vast
repository of formal statements.
We have observed that problems with explicit conditions and well-defined goals are typically easier
to formalize compared to advanced mathematical topics that necessitate intricate definitions and
constructions. Consequently, this paper primarily examines high school and undergraduate-level
competition problems, with a particular emphasis on algebra and number theory, and to a lesser
extent, combinatorics, geometry, and statistics. Despite their apparent simplicity, these problems often
involve complex solution techniques, making them excellent candidates for constructing proof data
to improve theorem-proving capabilities in Large Language Models (LLMs). To compile our dataset,
we employed web scraping and careful data cleaning techniques to extract problems from online
resources featuring high school and undergraduate exercises, exams, and competitions, resulting in a
dataset of 869,659 high-quality natural language math problems.
Specifically, we initialized the DeepSeek-Prover using the DeepSeekMath-Base 7B model [Shao
et al., 2024]. Initially, the model struggled to convert informal math problems into formal statements.
To address this, we fine-tuned the DeepSeek-Prover model using the MMA dataset [Jiang et al.,
2023], which comprises formal statements from Lean 4’s mathlib3 that were back-translated into
natural language problem descriptions by GPT-4. We then instructed the model to translate these
natural language problems into formal statements in Lean 4 using a structured approach.
Prompt:
Mathematical Problem in Natural Language:
{$informal_statement_with_answers}
Translate the problem to Lean 4 (only the core declaration):
“‘lean4
Response:
{$formal_statement}
“‘
The quality of the autoformalized statements was found to be suboptimal due to two main issues.
Firstly, many formal statements were overly simplistic. To address this, we developed scoring criteria
and provided examples from miniF2F-valid as few-shot examples to guide the DeepSeek-Prover
3
The specific mathlib commit used is 64528268b3c2cf578639bc479828882a9ecd3a82.
4
model in evaluating the content and quality of these statements using a chain-of-thought approach.
Manual review of these scores confirmed that the model’s evaluations closely matched human intuition
and expectations. Specifically, the model was instructed (see Appendix A.1 for the detailed prompt)
to classify the quality of each formal statement into categories: "excellent," "good," "above average,"
"fair," or "poor." Statements rated as "fair" or "poor" were subsequently excluded.
The second issue pertains to formal statements that, although provable, are based on inconsistent
hypotheses leading to vacuous conclusions, rendering the conclusions meaningless in mathematics.
For example, consider the following model-generated statement:
example (θ : R) (h0 : ∀ z : C, z ^ 2 = -1 ∧ z ^ 3 = -1 ∧ z ^ 6 = 1) (h1 :
Real.tan θ = 2 * Real.sqrt 3) : θ = 5 * Real.pi / 3
Here, the hypothesis z 2 = −1 ∧ z 3 = −1 ∧ z 6 = 1 for all complex numbers is clearly false, making
any derived conclusions meaningless. To eliminate such cases from our dataset, we implemented a
hypothesis rejection method. This involves using the DeepSeek-Prover model to attempt proving the
formal statement with ’False’ as the conclusion. A successful proof indicates an invalid hypothesis,
prompting exclusion of the statement. An example is shown below:
example (θ : R) (h0 : ∀ z : C, z ^ 2 = -1 ∧ z ^ 3 = -1 ∧ z ^ 6 = 1) (h1 :
Real.tan θ = 2 * Real.sqrt 3) : False := by
simpa using h0 1
By applying this dual strategy of model scoring and hypothesis rejection, we curated a refined set of
712,073 high-quality formal statements, providing a robust foundation for further proof synthesis.
After creating a substantial corpus of high-quality formal statements, we employed the model to
search for proofs of these statements. Traditionally, language models have been used predominantly
in a brute-force manner to prove theorems—repeatedly attempting until a valid proof is found or
computational resources are exhausted. This approach is inefficient for our purposes. Typically,
language models are applied to human-curated formal statements that are carefully crafted and
generally true and provable; however, in our task of proving autoformalized statements, many of
the statements produced by the model may be incorrect. Indeed, it is unreasonable to expect the
model to validate a false proposition within any reliable proof system. This issue becomes more
pronounced during large-scale autoformalization, where we observed that at least 20% of the formal
statements generated by our model, even after quality filtering, were incorrect, leading to significant
computational waste if addressed with brute force.
To minimize resource wastage on unprovable statements and improve the efficiency of the proof
search process, we exploited the logical symmetry between a statement and its negation to accelerate
proof synthesis. We implemented dual concurrent proof searches for each synthetic statement—one
for the statement Γ ⊢ P and another for its negation Γ ⊢ ¬P . The search terminates as soon as a
valid proof is found for either, conclusively demonstrating the unprovability of the other. Each proof
search stream attempts up to k proofs unless a valid proof emerges sooner.
All validated proofs, whether they justify the original theorems or their negations, are then aggregated
to further train the DeepSeek-Prover. Thus, this dual approach serves as a form of data augmentation,
enriching the dataset with both propositions and their negations—even if the original propositions
were not correctly formalized by the model.
Since the entire pipeline heavily relies on the DeepSeek-Prover, enhancing the model’s performance
after each iteration is crucial. To achieve this, we consistently fine-tune the model with newly
generated data. The updated model is then utilized for subsequent autoformalization iterations. The
key insight from this iterative process is that the model incrementally improves in strength and
efficacy after each cycle of refinement and application. This iterative process continues until no
further gains are observed. Consequently, the theorem-proof pairs generated by the model become
increasingly higher in quality with each iteration. This method ensures that the DeepSeek-Prover
5
consistently enhances its performance, ultimately producing superior theorem-proof pairs through
continuous refinement.
4 Experiments
• GPT-3.5 and GPT-4 [Achiam et al., 2023], developed by OpenAI, are advanced generative
AI models known for their effectiveness in diverse tasks, including code generation. Al-
though not explicitly designed for theorem proving, their extensive scale and parameter count
confer significant capabilities. In contrast, DeepSeekMath is a specialized model, explicitly
pre-trained for mathematical content. We utilized both GPT-4 (specifically the GPT-4-turbo
0409 version) and DeepSeekMath to generate complete proofs for given theorems using a
methodology similar to ours.
• GPT-f [Polu and Sutskever, 2020], utilizing a GPT-2-inspired architecture [Radford et al.,
2019], implements an iterative best-first search method to progressively generate and validate
proof steps within a formal proof setting until a proof is either completed or resources are
depleted. This methodology has been further advanced by Proof Artifact Co-Training
[Han et al., 2021], ReProver [Yang et al., 2024], Llemma [Azerbayev et al., 2023], and
COPRA [Thakur et al., 2023], which employ either specialized fine-tuned models or
versatile general-purpose models such as GPT-3.5 and GPT-4 for the generation of proof
steps.
This study addresses complex mathematical problems in algebra and number theory. We evaluate the
theorem-proving efficacy of our model using the miniF2F [Zheng et al., 2021] and FIMO [Liu et al.,
2023] benchmarks. The metric pass@k is employed to denote the scenario where at least one valid
proof is discovered among the first k attempts generated by the model.
Results on MiniF2F. The miniF2F benchmark consists of 244 validation and 244 test problems,
ranging from basic arithmetic to competition-level problems, e.g., problems from the American
Invitational Mathematics Examination (AIME), the American Mathematics Competitions (AMC),
and the International Mathematical Olympiad (IMO). We use the version of miniF2F in Lean 4, which
was released by the LeanDojo project (https://github.com/yangky11/miniF2F-lean4).
Table 1 compares various state-of-the-art methods on the miniF2F dataset. DeepSeek-Prover outper-
forms all with cumulative scores of 60.2% on miniF2F-valid and 52.0% on miniF2F-test, significantly
higher than other methods, including GPT-4 which scores 25.41% and 22.95%, respectively. Even
the best tree search method, Hypertree Proof Search with a 600M model, achieves only up to 58.6%
on miniF2F-valid and 41.0% on miniF2F-test. DeepSeek-Prover’s scalability is evident as its per-
formance improves with increased computational resources, rising from 30.03% using a greedy
approach to 50.0% at 65536 generation times, demonstrating its effectiveness in handling complex
proof scenarios. Examples of proved theorems of MiniF2F can be found in Appendix A.3.1.
Results on FIMO. The FIMO benchmark comprises 149 formal problems which are sourced from
the IMO shortlist translated into Lean 4. Our method successfully proved 4 theorems with 100
attempts per theorem, whereas GPT-4 failed to prove any. By increasing the number of attempts per
theorem to 4,096, we successfully proved an additional theorem. Examples of proved theorems of
FIMO can be found in Appendix A.3.2.
6
Table 1: Comparing with state-of-the-arts on the miniF2F dataset.
Table 2: Improvement in pass rates for miniF2F at pass@128 in models trained on formal proofs,
including those derived from human-authored theorems in Lean 4’s mathlib and automatically
formalized theorems.
To demonstrate the effectiveness of the model in filtering out low-quality statements, we fine-tuned
the DeepSeekMath-Base model using an equal amount of high-score proof data and low-score proof
data to verify the quality of the data, as shown in Table 3. The table shows that the model trained
on high-score proof data outperformed the model trained on low-score proof data by 4.5%. This
enhancement underscores the utility of the model in accurately scoring and effectively filtering out
lower-quality statements.
7
Table 3: Improvement in pass rates for miniF2F at pass@128 in models trained on differently scored
proof data.
Table 4 demonstrates a distinct correlation between the number of iterations in data synthesis and
enhanced performance in theorem proving. This evidence underscores the success of our iterative
enhancement strategy in augmenting theorem-proving capabilities. Successive iterations not only
refine the model’s ability to handle complex proofs but also significantly increase the quality and
quantity of the synthetic data produced.
Table 4: Improvement in pass rates for miniF2F at pass@128 in models across successive training
iterations, facilitated by the incremental integration of synthesized data via autoformalization.
Our investigation into synthetic theorem proving data reveals a clear correlation between dataset size
and model efficacy, as illustrated in Table 5. By examining subsets of the eight million generated
proof data points, we observed that performance on the miniF2F benchmark improves proportionally
to the exponential increase in dataset size. This pattern highlights the pivotal importance of large-scale
datasets for boosting model proficiency in automatically formalizing natural language questions.
These findings emphasize the significant potential and necessity of systematic data construction for
progressing in the field of automated theorem proving.
Table 5: Improvement in pass rates for miniF2F at pass@128 in models trained with a larger fraction
of synthesized data via autoformalization.
5 Case Studies
This section presents two case studies to demonstrate the application of our methods in autoformaliz-
ing theorems. It showcases both successful proofs and the identification of inconsistencies during the
Hypothesis Rejection stage.
8
5.1 Autoformalized Theorem with Complete Proof
Example a. Problem: Prove that the determinant of the following matrix is zero.
1 cos(a − b) cos(a)
" #
cos(a − b) 1 cos(b)
cos(a) cos(b) 1
This approach effectively translates the algebraic expression of the matrix and its determinant into a
formal language using Lean. The autoformalization captures the essence of the original mathematical
statement by defining a specific 3 × 3 matrix dependent on real numbers a and b, and asserts that its
determinant is zero. The formalization employs the Matrix.det function to compute the determinant,
utilizing the ![...] notation for lists of lists in Lean to represent the matrix rows.
The initial autoformalization incorrectly assumes that the condition D2 = 154 universally applies to
all non-zero real numbers a, b, and c. This assumption is not supported by the problem statement,
which does not claim universal applicability. Instead, the formalization should aim to either identify
specific values of a, b, and c that satisfy D2 = 154 or demonstrate that no such values exist.
The model successfully identifies this inconsistency and provides a counterexample to demonstrate
the absurdity of the hypothesis:
example (D : R) (h0 : ∀ a b c : R, a ̸= 0 ∧ b ̸= 0 ∧ c ̸= 0 →
Matrix.det ![![a, b, c], ![1, 4, 9], ![3, 1, 2]] = D) : False := by
have h1 := h0 1 2 3
have h2 := h0 1 4 9
simp [Matrix.det_fin_three] at h1 h2
linarith
These examples illustrate the model’s capability to verify proofs and identify hypothesis inconsisten-
cies effectively. Further details can be found in Appendix A.2.
6 Conclusion
In this paper, we presented a method to generate extensive synthetic proof data from high-school and
undergraduate-level mathematical competition problems. By translating natural language problems
into formal statements, filtering out low-quality ones, and using iterative proof generation, we created
8 million proof data points and significantly improved the DeepSeekMath 7B model’s performance
in ATP when trained on this synthetic data. Our model outperforms GPT-4 and other methods on
9
benchmarks like miniF2F and FIMO. By open-sourcing our dataset and model, we aim to advance
research in automated theorem proving and enhance the capabilities of large language models in
formal mathematical reasoning. Currently, our work mainly focuses on algebra and number theory at
the middle school and undergraduate levels. In future work, we will aim to expand the diversity of
mathematical problems addressed, enhancing the general applicability of our methods in ATP.
Broader Impact
The research presented in this paper has the potential to significantly advance automated theorem
proving by leveraging large-scale synthetic proof data generated from informal mathematical prob-
lems. This remarkable advancement can enhance the capabilities of large language models in formal
theorem proving, contributing to more reliable mathematical proof verification and providing valuable
educational resources for students and researchers. By directly releasing the code, model, and data,
we aim to ensure the responsible use of our work, fostering further innovation and maintaining high
standards of data privacy and intellectual property compliance.
References
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt,
S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
J. Avigad. Mathematics and the formal turn, 2023.
Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Bi-
derman, and S. Welleck. Llemma: An open language model for mathematics. arXiv preprint
arXiv:2310.10631, 2023.
K. Bansal, S. Loos, M. Rabe, C. Szegedy, and S. Wilcox. Holist: An environment for machine
learning of higher order logic theorem proving. In International Conference on Machine Learning,
pages 454–463. PMLR, 2019.
W. Bibel. Automated theorem proving. Springer Science & Business Media, 2013.
M. Crouse, I. Abdelaziz, B. Makni, S. Whitehead, C. Cornelio, P. Kapanipathi, K. Srinivas, V. Thost,
M. Witbrock, and A. Fokoue. A deep reinforcement learning approach to first-order logic theorem
proving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages
6279–6287, 2021.
L. De Moura, S. Kong, J. Avigad, F. Van Doorn, and J. von Raumer. The lean theorem prover (system
description). In Automated Deduction-CADE-25: 25th International Conference on Automated
Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pages 378–388. Springer, 2015.
E. First, M. N. Rabe, T. Ringer, and Y. Brun. Baldur: Whole-proof generation and repair with large
language models, 2023.
J. M. Han, J. Rute, Y. Wu, E. W. Ayers, and S. Polu. Proof artifact co-training for theorem proving
with language models. arXiv preprint arXiv:2102.06203, 2021.
Y. Huang, X. Lin, Z. Liu, Q. Cao, H. Xin, H. Wang, Z. Li, L. Song, and X. Liang. Mustard: Mastering
uniform synthesis of theorem and proof data. arXiv preprint arXiv:2402.08957, 2024.
A. Q. Jiang, W. Li, J. M. Han, and Y. Wu. Lisa: Language models of isabelle proofs. In 6th
Conference on Artificial Intelligence and Theorem Proving, pages 378–392, 2021.
A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski, T. Odrzygóźdź, P. Miłoś, Y. Wu, and M. Jamnik.
Thor: Wielding hammers to integrate language models and automated theorem provers. Advances
in Neural Information Processing Systems, 35:8360–8373, 2022a.
A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y. Wu, and G. Lample.
Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint
arXiv:2210.12283, 2022b.
10
A. Q. Jiang, W. Li, and M. Jamnik. Multilingual mathematical autoformalization. arXiv preprint
arXiv:2311.03755, 2023.
C. Kaliszyk, J. Urban, H. Michalewski, and M. Olšák. Reinforcement learning of theorem proving.
Advances in Neural Information Processing Systems, 31, 2018.
L. Kovács and A. Voronkov. First-order theorem proving and vampire. In International Conference
on Computer Aided Verification, pages 1–35. Springer, 2013.
G. Lample, T. Lacroix, M.-A. Lachaux, A. Rodriguez, A. Hayat, T. Lavril, G. Ebner, and X. Martinet.
Hypertree proof search for neural theorem proving. Advances in neural information processing
systems, 35:26337–26349, 2022.
C. Liu, J. Shen, H. Xin, Z. Liu, Y. Yuan, H. Wang, W. Ju, C. Zheng, Y. Yin, L. Li, et al. Fimo: A
challenge formal dataset for automated theorem proving. arXiv preprint arXiv:2309.04295, 2023.
S. Loos, G. Irving, C. Szegedy, and C. Kaliszyk. Deep network guided proof search. arXiv preprint
arXiv:1701.06972, 2017.
L. d. Moura and S. Ullrich. The lean 4 theorem prover and programming language. In Automated
Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July
12–15, 2021, Proceedings 28, pages 625–635. Springer, 2021.
L. C. Paulson. Isabelle a Generic Theorem Prover. Springer Verlag, 1994.
S. Polu and I. Sutskever. Generative language modeling for automated theorem proving. arXiv
preprint arXiv:2009.03393, 2020.
S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin, and I. Sutskever. Formal mathematics
statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
S. Schulz. E–a brainiac theorem prover. Ai Communications, 15(2-3):111–126, 2002.
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. Li, Y. Wu, and D. Guo. Deepseek-
math: Pushing the limits of mathematical reasoning in open language models. arXiv preprint
arXiv:2402.03300, 2024.
M. Shulman. Strange new universes: Proof assistants and synthetic foundations, 2024.
A. Thakur, Y. Wen, and S. Chaudhuri. A language-agent approach to formal theorem-proving. arXiv
preprint arXiv:2310.04353, 2023.
The Coq Development Team. Coq. URL https://coq.inria.fr.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
M. Wang and J. Deng. Learning to prove theorems by learning to generate theorems. Advances in
Neural Information Processing Systems, 33:18146–18157, 2020.
M. Wu, M. Norrish, C. Walder, and A. Dezfouli. Tacticzero: Learning to prove theorems from
scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:
9330–9342, 2021.
Y. Wu, A. Q. Jiang, J. Ba, and R. Grosse. Int: An inequality benchmark for evaluating generalization
in theorem proving. arXiv preprint arXiv:2007.02924, 2020.
Y. Wu, A. Q. Jiang, W. Li, M. Rabe, C. Staats, M. Jamnik, and C. Szegedy. Autoformalization with
large language models. Advances in Neural Information Processing Systems, 35:32353–32368,
2022.
11
H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y. Huang, J. Xiong, H. Shi, E. Xie, et al.
Lego-prover: Neural theorem proving with growing libraries. arXiv preprint arXiv:2310.00656,
2023.
J. Xiong, J. Shen, Y. Yuan, H. Wang, Y. Yin, Z. Liu, L. Li, Z. Guo, Q. Cao, Y. Huang, et al. Trigo:
Benchmarking formal mathematical proof reduction for generative language models. arXiv preprint
arXiv:2310.10180, 2023.
K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar.
Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural
Information Processing Systems, 36, 2024.
X. Zhao, W. Li, and L. Kong. Decomposing the enigma: Subgoal-based demonstration learning for
formal theorem proving. arXiv preprint arXiv:2305.16366, 2023.
K. Zheng, J. M. Han, and S. Polu. Minif2f: a cross-system benchmark for formal olympiad-level
mathematics. arXiv preprint arXiv:2109.00110, 2021.
12
A Appendix / supplemental material
A.1 Prompts
Specifically, we use the following format for scoring for the quality of the formalized statements:
Prompt:
To evaluate whether a formal Lean4 statement will be of interest to the
community, consider the following criteria:
4. Community Needs and Gaps: Does the statement fill an identified need
or gap within the Lean4 community or the broader mathematical community?
Addressing these needs directly correlates with interest.
“‘
Example a. Problem in Natural Language: For a real number a and a function f defined on real
numbers, where f (x) = x3 − ax − 1, if for all x, f (x) ≤ 0 implies x is either less than −1 or greater
than 1, then a must equal 3.
Autoformalized Theorems with Complete Proofs:
example (a : R) (f : R → R) (h0 : ∀ x, f x = x ^ 3 - a * x - 1) :
(∀ x, f x ≤ 0 → x ∈ Set.Iio (-1) ∪ Set.Ioi 1) → a = 3 := by
intro h1
have h2 := h1 0
simp [h0 ] at h2
13
have h3 := h1 1
simp [h0 ] at h3
have h4 := h1 (-1)
simp [h0 ] at h4
linarith
Analysis: This is a simple example to illustrate common cases in the autoformalization procedure. The
model is required to recognize natural language descriptions in the problem statement and formalize
them into proper mathematical operators. In this case, the statement "either less than −1 or greater
than 1" describes a union of interval (−∞, −1) ∪ (1, +∞), which has been accurately formalized to
Set.Iio (-1) ∪ Set.Ioi 1, where Set.Iio (-1) denotes an infinite-open interval ends at −1,
and Set.Ioi 1 denotes an open-infinite interval begins at 1.
Example b. Problem in Natural Language: Define two functions, F and f , where F (k) is the
binomial coefficient of (k + 2) choose 2, and f (k) is the binomial coefficient of (k + 1) choose 1.
For any natural number n greater than or equal to 2, there exists a natural number k such that f (k)
equals (n − 1) times n divided by 2. Furthermore, there exists a natural number m such that F (m)
equals (n − 1) times n divided by 2, implying m equals k.
Autoformalized Theorems with Complete Proofs:
example :
let F := fun k => Nat.choose (k + 2) 2;
let f := fun k => Nat.choose (k + 1) 1;
∀ n : N, 2 ≤ n → ∃ k : N, f k = (n - 1) * n / 2 → (∃ m : N, F m = (n - 1) * n
/ 2 → m = k) := by
simp [Nat.choose, Nat.mul_sub_left_distrib, Nat.sub_sub, Nat.mul_one]
aesop
Analysis: The formalization for this combinatorial problem involving binomial coefficients is seman-
tically faithful and effective. The use of Nat.choose accurately represents the binomial coefficients,
and the logical structure with existential quantifiers (∃) and implications (→) aligns well with the
problem’s requirements.
Example c. Problem in Natural Language: Given three 3-dimensional vectors a, b, and c, where the
dot product of a and b equals −3, the dot product of a and c equals 4, and the dot product of b and c
equals 6. Prove that the dot product of b with 7c − 2a equals 48.
Autoformalized Theorems with Complete Proofs:
example (a b c : R × R × R)
(h0 : a.1 * b.1 + a.2.1 * b.2.1 + a.2.2 * b.2.2 = -3)
(h1 : a.1 * c.1 + a.2.1 * c.2.1 + a.2.2 * c.2.2 = 4)
(h2 : b.1 * c.1 + b.2.1 * c.2.1 + b.2.2 * c.2.2 = 6)
: b.1 * (7 * c.1 - 2 * a.1) + b.2.1 * (7 * c.2.1 - 2 * a.2.1) + b.2.2 * (7 *
c.2.2 - 2 * a.2.2) = 48 := by
linarith [h0 , h1 , h2 ]
14
nlinarith [mul_self_nonneg (1 + x - 1)]
Example b. Problem in Natural Language: Ms. Blackwell gives an exam to two classes. The mean
of the scores of the students in the morning class is 84, and the afternoon class’s mean score is 70.
The ratio of the number of students in the morning class to the number of students in the afternoon
class is 34 . What is the mean of the scores of all the students? Show that it is 76.
Formal Proof:
theorem amc12b_2021_p4 (m a : N) (h0 : 0 < m ∧ 0 < a)
(h1 : ↑m / ↑a = (3 : R) / 4)
: (84 * ↑m + 70 * ↑a) / (↑m + ↑a) = (76 : R) := by
have h2 := h0 .1.ne’
have h3 := h0 .2.ne’
field_simp at h2 h3 ⊢
ring_nf
norm_num
rw [div_eq_inv_mul] at h1
field_simp at h1
linarith
Example c. Problem in Natural Language: For how many positive integers m does there exist at
least one positive integer n such that m · n ≤ m + n? Show that it is infinitely many.
Formal Proof:
theorem amc12a_2002_p6 (n : N) (h0 : 0 < n)
: ∃ m, m > n ∧ ∃ p, m * p ≤ m + p := by
simp_all only [ge_iff_le, gt_iff_lt, mul_one, mul_add, mul_comm, mul_assoc,
mul_left_comm]
use n + 1
constructor
exact Nat.lt_succ_self n
use 1
ring_nf
nlinarith
15
obtain ⟨_, _, _, h1 , h2 , h3 ⟩ := h0
simp at *
linarith
We verify the generated Lean 4 code with the following code as the prefix:
import Mathlib.Algebra.Algebra.Basic
import Mathlib.Algebra.Order.Floor
import Mathlib.Algebra.Associated
import Mathlib.Algebra.BigOperators.Basic
import Mathlib.Algebra.BigOperators.Order
import Mathlib.Algebra.BigOperators.Pi
import Mathlib.Algebra.GeomSum
import Mathlib.Algebra.Group.Pi.Basic
import Mathlib.Algebra.Group.Commute.Basic
import Mathlib.Algebra.GroupPower.Basic
import Mathlib.Algebra.GroupPower.Identities
import Mathlib.Algebra.Order.Floor
import Mathlib.Algebra.QuadraticDiscriminant
import Mathlib.Algebra.Ring.Basic
import Mathlib.Analysis.Asymptotics.AsymptoticEquivalent
import Mathlib.Analysis.NormedSpace.Basic
import Mathlib.Analysis.SpecialFunctions.Log.Basic
import Mathlib.Analysis.SpecialFunctions.Log.Base
import Mathlib.Combinatorics.SimpleGraph.Basic
import Mathlib.Data.Complex.Basic
import Mathlib.Data.Complex.Exponential
import Mathlib.Data.Finset.Basic
import Mathlib.Data.Fintype.Card
import Mathlib.Data.Int.Basic
import Mathlib.Data.Int.GCD
import Mathlib.Data.Int.ModEq
import Mathlib.Data.Int.Parity
import Mathlib.Data.List.Intervals
import Mathlib.Data.List.Palindrome
import Mathlib.Data.Multiset.Basic
import Mathlib.Data.Nat.Basic
import Mathlib.Data.Nat.Choose.Basic
import Mathlib.Data.Nat.Digits
import Mathlib.Data.Nat.Factorial.Basic
import Mathlib.Data.Nat.ModEq
import Mathlib.Data.Nat.Multiplicity
import Mathlib.Data.Nat.Parity
import Mathlib.Data.Nat.Prime
import Mathlib.Data.PNat.Basic
import Mathlib.Data.PNat.Prime
import Mathlib.Data.Polynomial.Basic
16
import Mathlib.Data.Polynomial.Eval
import Mathlib.Data.Real.Basic
import Mathlib.Data.Real.Irrational
import Mathlib.Data.Real.NNReal
import Mathlib.Data.Real.Sqrt
import Mathlib.Data.Set.Finite
import Mathlib.Data.Sym.Sym2
import Mathlib.Data.ZMod.Basic
import Mathlib.Dynamics.FixedPoints.Basic
import Mathlib.LinearAlgebra.AffineSpace.AffineMap
import Mathlib.LinearAlgebra.AffineSpace.Independent
import Mathlib.LinearAlgebra.AffineSpace.Ordered
import Mathlib.LinearAlgebra.FiniteDimensional
import Mathlib.Logic.Equiv.Basic
import Mathlib.Order.Filter.Basic
import Mathlib.Order.LocallyFinite
import Mathlib.Order.WellFounded
import Mathlib.Topology.Basic
import Mathlib.Topology.Instances.NNReal
import Aesop
set_option maxHeartbeats 0
set_option trace.aesop true
set_option trace.aesop.proof true
17