Sage Journals: Discover world-class research

Abstract

The CADE ATP System Competition (CASC) is the annual evaluation of fully automatic, classical logic, automated theorem proving (ATP) systems—the world championship for such systems. CASC-J12 was the 29th competition in the CASC series. Nineteen ATP systems competed in the various divisions. This paper presents an outline of the competition design and a commentated summary of the results.

Keywords

automated theorem proving competition

1. Introduction

Automated theorem proving (ATP) deals with the task of proving theorems from axioms—the derivation of conclusions that follow inevitably from known facts (Robinson and Voronkov, 2001). The converse task of disproving conjectures is another facet of interest (Claessen and Sörensson, 2003; Blanchette and Nipkow, 2010). The axioms and conjectures are written in an appropriately expressive logic and the solutions (proofs and models) are often similarly written in logic (Sutcliffe, 2023). The CADE ATP system competition (CASC; Sutcliffe, 2016) is the annual evaluation of fully automatic, classical logic, ATP systems—the world championship for such systems. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, motivate the development and implementation of robust ATP systems that can be easily and usefully deployed in applications, provide an inspiring environment for personal interaction between ATP researchers, and expose ATP systems within and beyond the ATP community. CASC evaluates the performance of the ATP systems in terms of the number of problems solved, the number of acceptable solutions output, and the average time taken for problems solved, in the context of a bounded number of eligible problems and specified time limits.

CASC is held at each of the International Conference on Automated Deduction (CADE) and the International Joint Conference on Automated Reasoning (IJCAR) conference—the major forums for the presentation of new research in all aspects of automated deduction. CASC-J12 was held on 4 July 2024, as part of the 12th IJCAR (IJCAR 2024),¹ in Nancy, France. It was the 29th competition in the CASC series; see Sutcliffe and Desharnais (2023) and citations therein, and the CASC website (https://tptp.org/CASC), for information about previous CASCs.

CASC-J12 was organized by Geoff Sutcliffe, and overseen by a panel consisting of Cláudia Nalon, Christoph Wernhard, and Christoph Weidenbach. The CASC panel has a few, but important, responsibilities:

Advise (by email) on contentious issues that arise before the competition, and that cannot be agreeably resolved by the organizers and entrants.

With the help of the organizers, adjudicate the acceptability of the sample proofs and models that are submitted before the competition.

If possible, be physically present at the start of the competition, so entrants can see who they are and raise any issues they are concerned about.

Supply a seed digit for the random problem selection process.

Be contactable during the competition to advise on contentious issues that arise and that cannot be agreeably resolved by the organizers and entrants.

With the help of the organizers, adjudicate the acceptability of the winners’ proofs and models.

If possible, be physically present at the end of the competition for resolving possible open issues, confirming the final system ranking in each division and assisting in winner announcements.

The competition was run on computers provided by the StarExec project (Stump et al., 2014) at the University of Miami. The CASC-J12 website provides access to all the resources used before, during, and after the event: https://tptp.org/CASC/J12.

The design and organization of CASC have evolved over the years to a sophisticated state. An outline of the CASC-J12 design and organization is provided here; the details are in Sutcliffe (2024) and on the CASC-J12 web site. Important changes for CASC-J12 were (for readers already familiar with the general design of CASC):

The FNT division went on hiatus.

The ICU division was added.

The CASC rules, specifications, and deadlines are absolute. Only the panel has the right to make exceptions. It is assumed that all entrants have read the documentation related to the competition, and have complied with the competition rules. Noncompliance with the rules can lead to disqualification. A catch-all rule is used to deal with any unforeseen circumstances: No cheating is allowed. The panel is allowed to disqualify entrants due to unfairness and to adjust the competition rules in case of misuse.

The rest of this paper is organized as follows: Section 2 describes the competition divisions and the ATP systems that entered the various divisions. Sections 3 and 4 describe the competition infrastructure and the requirements for the ATP systems. Section 5 describes how the systems are evaluated. Section 6 provides a commentated summary of the results. Section 7 contains short descriptions of three of the ATP systems. Section 8 discusses ideas and plans for introducing proof and model verification into CASC. Section 9 concludes and discusses plans for future CASCs.

A Tense Note: Attentive readers will notice changes between the present and past tenses in this paper. Many parts of CASC are established and stable—they are described in the present tense (the rules are the rules). Aspects that were particular to CASC-J12 are described in the past tense so that they make sense when reading this after the event.

2. Divisions and Systems

CASC is divided into divisions according to problem and system characteristics, in a coarse version of the Thousands of Problems for Theorem Provers (TPTP) problem library’s Specialist Problem Classes (SPCs; Sutcliffe and Suttner, 2001). Each division uses problems that have certain logical, language, and syntactic characteristics, so that the systems that compete in a division are, in principle, able to attempt all the problems in the division. Some divisions are further divided into problem categories that make it possible to analyze, at a more fine-grained level, which systems work well for what types of problems. Table 1 catalogs the divisions and problem categories of CASC-J12. The example problems can be viewed online.² Sections 3.2 and 3.3 explain what problems are eligible for use in each division and category.

Table 1.
Divisions and Problem Categories.

Division Problems Problem categories

THF Typed (monomorphic) Higher-order Form theorems (axioms with a provable conjecture). TNE – THF with No Equality, for example, NUM738^1.

TEQ – THF with EQuality, for example, SET171^3.

TFA Typed (monomorphic) First-order form theorems with Arithmetic (axioms with a provable conjecture). TFI—TFA with only Integer arithmetic, for example, DAT016_1.

TFE—TFA with only rEal arithmetic, for example, MSC022_2.

TFN Typed First-order form Non-theorems (axioms with a countersatisfiable conjecture, and satisfiable axiom sets without a conjecture) without arithmetic. For example, COM002_20.

FOF First-Order Form theorems (axioms with a provable conjecture). FNE—FOF with No Equality, for example, COM003+1. FEQ—FOF with EQuality, for example, SEU147+3.

FNT FOF Non-Theorems (axioms with a countersatisfiable conjecture, and satisfiable axioms without a conjecture). FNN—FNT with No equality, for example, KRS173+1. FNQ—FNT with eQuality, for example, MGT033+2.

UEQ Unit EQuality theorems in clause normal form (unsatisfiable clause sets). For example, RNG026-7.

SLH Typed (monomorphic) higher-order theorems (axioms with a provable conjecture), generated by Isabelle’s SLedgeHammer system Paulson and Blanchette (2010). The problems were generated from sessions in Isabelle’s Archive of Formal Proofs Blanchette et al. (2015).

ICU First-order form theorems (axioms with a provable conjecture), supplied by the entrants. Each entrant is saying to the others: “I Challenge yoU!.”

Division	Problems	Problem categories
THF	Typed (monomorphic) Higher-order Form theorems (axioms with a provable conjecture).	TNE – THF with No Equality, for example, NUM738^1.
TEQ – THF with EQuality, for example, SET171^3.
TFA	Typed (monomorphic) First-order form theorems with Arithmetic (axioms with a provable conjecture).	TFI—TFA with only Integer arithmetic, for example, DAT016_1.
TFE—TFA with only rEal arithmetic, for example, MSC022_2.
TFN	Typed First-order form Non-theorems (axioms with a countersatisfiable conjecture, and satisfiable axiom sets without a conjecture) without arithmetic.	For example, COM002_20.
FOF	First-Order Form theorems (axioms with a provable conjecture).	FNE—FOF with No Equality, for example, COM003+1. FEQ—FOF with EQuality, for example, SEU147+3.
FNT	FOF Non-Theorems (axioms with a countersatisfiable conjecture, and satisfiable axioms without a conjecture).	FNN—FNT with No equality, for example, KRS173+1. FNQ—FNT with eQuality, for example, MGT033+2.
UEQ	Unit EQuality theorems in clause normal form (unsatisfiable clause sets).	For example, RNG026-7.
SLH	Typed (monomorphic) higher-order theorems (axioms with a provable conjecture), generated by Isabelle’s SLedgeHammer system Paulson and Blanchette (2010).	The problems were generated from sessions in Isabelle’s Archive of Formal Proofs Blanchette et al. (2015).
ICU	First-order form theorems (axioms with a provable conjecture), supplied by the entrants. Each entrant is saying to the others: “I Challenge yoU!.”

Table 2.

The ATP Systems and Entrants.

ATP system	Divisions							Entrant	Entrant’s affiliation
Connect++ 0.6.0				FOF			ICU	Sean Holden	University of Cambridge
CSE 1.7				FOF				Feng Cao	JiangXi University of Science
									and Technology
CSE_E 1.6				FOF			ICU	Peiyao Liu	Southwest Jiaotong University
CSG_E 1.6				FOF			ICU	Peiyao Liu	Southwest Jiaotong University
CSI_E 1.6				FOF			ICU	Peiyao Liu	Southwest Jiaotong University
cvc5 1.1.3	THF	TFA	TFN	FOF		SLH		Andrew Reynolds	University of Iowa
Drodi 3.6.0				FOF	UEQ		ICU	Oscar Contreras	Amateur Programmer
E 3.1						SLH		CASC	CASC-29 winner
E 3.2.0	THF			FOF	UEQ	SLH	ICU	Stephan Schulz	DHBW Stuttgart
GKC 0.8				FOF	UEQ			Tanel Tammet	Tallinn University of Technology
iProver 3.9		TFA	TFN	FOF	UEQ		ICU	Konstantin Korovin	University of Manchester
LEO-II 1.7.0	THF							Alexander Steen	University of Greifswald
Leo-III 1.7.8	THF							Alexander Steen	University of Greifswald
Prover9 1109a				FOF				CASC	CASC fixed point
Twee 2.4.2					UEQ			CASC	CASC-29 winner
Twee 2.5.0					UEQ			Nick Smallbone	Chalmers University of
									Technology
Vampire 4.8	THF	TFA	TFN	FOF				CASC	CASC-29 winner
Vampire 4.9	THF	TFA	TFN	FOF	UEQ	SLH	ICU	Michael Rawson	TU Wien
Zipperp’n 2.1.9999	THF			FOF				Jasmin Blanchette	Ludwig-Maximilians-
									Universität München

Note. ATP = automated theorem proving.

Systems that cannot be entered into the competition divisions (e.g., the system requires special hardware or the entrant is an organizer) can be entered into the demonstration division. The demonstration division uses the same problems as the competition divisions, and the entry specifies which competition divisions’ problems are to be used.

Nineteenth ATP systems competed in the various divisions of CASC-J12. The division winners from the previous CASC (CASC-29) and the Prover9 1109a system were automatically entered into the demonstration division, to provide benchmarks against which progress can be judged. The systems, the divisions in which they were entered, and their entrants, are listed in Table 2. A division acronym in italics indicates the system was in the demonstration division. System descriptions are in the competition proceedings (Sutcliffe, 2023) and on the CASC-J12 website.

3. Infrastructure

3.1. Computers

The StarExec computers used for the competition have two octa-core Intel(R) Xeon(R) E5-2667 v4 CPUs running at 2.10 GHz, 256 GB memory, and the CentOS Linux release 7.4.1708 (Core) operating system Linux kernel 3.10.0-693.el7.x86_64. StarExec uses Linux’s sched_setaffinity to run one ATP system on one CPU at a time, and setrlimit to limit memory use to 128 GB. This separation avoids contention that might affect performance measurements (Beyer et al., 2019; Fichte et al., 2024). There were 30 computers available for CASC-J12, that is, 60 CPUs. In CASC-J12 a total of 1488 CPU hours were used in 353 wall clock hours (i.e., 5.88 h spread over the 60 CPUs, which allowed the competition to be run live in one conference day). For the problems that were solved, 202 CPU hours were used in 41 wall clock hours.

Systems can use all the cores on the CPU, which can be advantageous in divisions where a wall clock time limit is used (see Section 3.5). StarExec copies the systems and problems to the compute nodes before starting execution, so that there are no network delays. The StarExec computers used for CASC are the same as are publicly available to the TPTP community, which allows system developers to test and tune their systems in exactly the same environment as is used for the competition.

Demonstration division systems can run on the competition computers, or the computers can be supplied by the entrant. The CASC-J12 demonstration division systems all used the competition computers.

3.2. Problems for the TPTP-based Divisions

The problems for the THF, TFA, TFN, FOF, and UEQ divisions were taken from the TPTP problem library ( v9.0.0, Sutcliffe, 2017). The TPTP version used for CASC is released after the competition has started, so that new problems in the release have not been seen by the entrants. The problems have to meet certain criteria to be eligible for use:

The TPTP tags problems that are designed specifically to be suited or ill-suited to some ATP system, calculus, or control strategy as biased. They are excluded from the competition.

The problems must be syntactically nonpropositional.

The TPTP uses system performance data in the Thousands of Solutions from Theorem Provers (TSTP) solution library to compute problem difficulty ratings in the range 0.00 (easy) to 1.00 (unsolved; Sutcliffe and Suttner, 2001). Problems with ratings in the range of 0.21 to 0.99 are eligible—the upper bound of 0.99 excludes problems that cannot be solved by any system and thus don’t differentiate between systems; the lower bound of 0.21 was chosen (many years ago and it has worked successfully) to exclude problems that would be solved by most of the systems and thus don’t differentiate between systems.

Problems of lesser and greater ratings are made eligible if there are not enough problems with ratings in that range. In the CASC-J12 TFN division 59 problems with a rating of 0.00–0.20 and 48 problems with a rating of 1.00 were made eligible, because there were only 55 eligible problems with a rating of 0.21 to 0.99. The organizer considered making these additional 107 problems eligible to be acceptably useful: solving easy problems would be encouraging for weaker systems, and solving hard problems would be encouraging for stronger systems. See Section 6.3 for the results on these additional problems.

Systems can be submitted before the competition so that their performance data is used in computing the problem ratings—problems that are newly solved get a rating <1.00 and thus become eligible (until the rating drops below 0.21). The rating calculation also uses performance data from ATP systems that are not entered into the competition, which can produce ratings that make some problems eligible for selection but easy or unsolvable for the systems in the competition. Using problems that are solved by all or none of the competition systems does not affect the competition rankings, has the benefit of placing the systems’ performances in the context of the state-of-the-art in ATP, but does reduce the differentiation between the systems in the competition.

In order to ensure that no system receives an advantage or disadvantage due to the specific presentation of the problems in the TPTP, the problems are obfuscated by stripping out all comment lines (in particular, the problem header), randomly reordering the formulae (include directives are left before the formulae, and type declarations are kept before the symbols’ uses), randomly swapping the arguments of associative connectives, randomly reversing implications, and randomly reversing equalities.

The numbers of problems used in each division and problem category are constrained by the number of eligible problems, the number of systems entered in the divisions, the number of CPUs available, the time limits, and the time available for running the competition live in one conference day, that is, in about 6 h. The numbers of problems used are set within these constraints according to the judgment of the organizer. The problems used are randomly selected from the eligible problems based on a seed supplied by the competition panel:

The selection is constrained so that no division or category contains an excessive number of very similar problems, according to the “very similar problems” (VSP) lists distributed with the TPTP problem library (Sutcliffe, 2000). For each problem category in each division, if the category is going to use $N$ problems and there are $L$ VSP lists that have an intersection of at least $N / (L + 1)$ with the eligible problems for the category, then maximally $N / (L + 1)$ problems are taken from each of those VSP lists.

In order to combat excessive tuning toward problems that were already in the preceding released TPTP version, the selection is biased to select problems that are new in the TPTP version used until 50% of the problems in each problem category have been selected or there are no more new problems to select, after which random selection from old and new problems continues. The number of new problems used depends on how many new problems are eligible and the limitation on very similar problems.

Table 3 gives the numbers of eligible problems, the maximal numbers that could be used after taking into account the limitation on very similar problems, and the number of problems used in each division and category. With the exception of the SLH division (which is a special case), nowhere near 50% new problems were selected. This is due to a slump in the contributions of new problems to the TPTP—nowhere near 50% new problems were available (see Section 9 for a plea to the community to submit problems).

Table 3.
Numbers of Eligible and Used Problems.

THF TFA FOF

Division category TNE TEQ TFI TFE TFN FNE FEQ UEQ SLH ICU

Eligible 118 1049 211 45 162 413 3765 494 7400 80

Usable 118 1049 211 45 162 92 887 487 7400 80

Used 100 400 175 25 150 75 425 300 1000 80

New 0 0 10 0 2 0 0 0 1000 12

	THF	TFA		FOF
Eligible	118	1049	211	45	162	413	3765	494	7400	80
Usable	118	1049	211	45	162	92	887	487	7400	80
Used	100	400	175	25	150	75	425	300	1000	80
New	0	0	10	0	2	0	0	0	1000	12

The problems are given to the ATP systems in TPTP format, with include directives, in increasing order of TPTP difficulty rating.

3.3. Problems for the SLH Division

For the SLH division of CASC-29 (the previous CASC), Isabelle’s Sledgehammer system was used to generate 8400 problems, of which 1000 appropriately difficult problems were selected based on performance data (Sutcliffe, 2023; Sutcliffe and Desharnais, 2024). For the SLH division of CASC-J12, the same problem set was used, but 1000 different problems were selected. The same CPU limit was imposed. This was announced in advance of CASC-J12, so that developers could tune their systems using the CASC-29 problems. Of the 1000 problems selected for CASC-J12, 401 had not been solved in the testing done before CASC-29. If many of them were solved in the competition, that would indicate progress. See Section 6.6 for results on these problems. The problems were given in a roughly estimated increasing order of difficulty.

3.4. Problems for the ICU Division

For the ICU division, each entrant had to submit 10 FOF theorems (axioms with a provable conjecture). The problems had to be provided in decreasing order of desired use in the division, that is, probably from hardest to easiest for other systems. The problems had to all be different, as assessed by the competition organizer. Problems from the TPTP problem library had their include directives expanded. The problems were given in reverse order of desired use, so that the “easier” problems were used before the “harder” ones.

It was expected that each entrant would submit problems that are easy enough for that entrant’s system, but difficult for the other entrants’ systems. Most of the entrants chose existing TPTP problems with high difficulty ratings—only the entrants of Drodi and E submitted non-TPTP problems: seven for Drodi and five for E. Fifteen of the 68 TPTP problems had a rating of 1.00, and another 36 had ratings above 0.80. Interestingly the entrant of the demonstration division system Connect++ chose problems all with a rating of 1.00, suggesting that he was setting a challenge for the other systems without having much hope of Connect++ solving the problems.

The design decision that “each entrant had to submit 10 FOF theorems” meant that any entrant or group that entered multiple systems would be able to submit problems in their collective interest. In CASC-J12 the entrants of CSE_E, CSG_G, and CSI_E were in this situation. Analysis of those systems’ results in Section 6.7 indicates that there was no collusion. To prevent possible collusion in the future there will be a limit on the number of problems submitted by individual entrants and their colleagues.

3.5. Time Limits

In the TPTP-based divisions, a time limit is imposed for each problem. The minimal time limit for each problem is 120 s. The maximal time limit for each problem is constrained by the same factors that constrain the number of problems that are used, taking into account the phenomenon that ATP systems solve most problems quickly and very few slowly (Sutcliffe and Suttner, 2001; Sutcliffe, 2024). This phenomenon is also evident in the performance plots from the competition,³ and from the ratios of total times taken to solved times given in Section 3.1. The time limit is chosen value within the range allowed according to the judgment of the organizer and is announced at the competition. In CASC-J12, a 180 s wall clock time limit was imposed for each problem, and no CPU time limit was imposed (so that it could be advantageous to use all the cores on the CPU).

In the SLH division, a CPU time limit is imposed for each problem. The limit is between 15 and 90 s, which is the range of CPU time that can be usefully allocated for a proof attempt in the Sledgehammer context.⁴ The time limit is chosen within the range allowed according to the judgment of the organizer and is announced at the competition. In CASC-J12, a 30 s CPU time limit was imposed for each problem.

In the ICU division, a wall clock time limit is imposed for each problem. The limit is between 300 and 600 s, which is a range that gives the systems sufficient time (4800 s CPU time on the octa-core CPUs) to attempt the difficult problems submitted. The time limit is chosen within the range allowed according to the judgment of the organizer and is announced at the competition. In CASC-J12, a 600 s wall clock time limit was imposed for each problem.

4. System Entry, Delivery, and Execution

Systems can be entered at only the division level and can be entered into more than one division. A system that is not entered into a division is assumed to perform worse than the entered systems, for that type of problem—wimping out is not an option. Entering many similar versions of the same system is deprecated, and entrants can be required to limit the number of system versions that they enter. Systems that rely essentially on running other ATP systems without adding value are deprecated; such systems might be disallowed or moved to the demonstration division.

The ATP systems entered into CASC are delivered to the competition organizer as StarExec installation packages, which the organizer installs and tests on StarExec. Source code is delivered separately, under the trusting assumption that the installation package corresponds to the source code. After the competition all competition division systems’ StarExec and source code packages are made available on the CASC web site. This allows anyone to use the systems on StarExec, and to examine the source code. An open source license is encouraged, to allow the systems to be freely used, modified, and shared. Many of the StarExec packages include statically linked binaries that provide further portability and longevity of the systems.

The ATP systems must be fully automatic. They are executed as black boxes, on one problem at a time. Any command line parameters have to be the same for all problems in each division. The ATP systems must be sound and are tested for soundness by submitting nontheorems to the systems in the THF, TFA, FOF, UEQ, SLH, and ICU divisions, and theorems to the systems in the TFN division. Claiming to have found a proof of a nontheorem or a disproof of a theorem indicates unsoundness. Happily, no systems were found to be unsound before CASC-J12.

5. System Evaluation

The ATP systems are ranked at the division level. For each ATP system, for each problem, four items of data are recorded: whether or not the problem was solved, the CPU and wall clock times taken (as measured by StarExec’s runsolver utility (Roussel, 2011), and prepended to each line of the system’s stdout), and whether or not a solution was output. The systems are ranked according to the number of problems solved with an acceptable solution output; see Sutcliffe (2023) for an explanation of what is “acceptable” (and Section 8 for plans to include verification as part of the evaluation process of future CASCs). Ties are broken according to the average time taken over problems solved. Trophies are awarded to the competition divisions’ winners.

In addition to the ranking data, three other performance measures are presented in the results: The state-of-the-art contribution (SotAC) quantifies the unique abilities of each system (excluding the previous year’s winners that are earlier versions of competing systems). For each problem solved by a system, its SotAC for the problem is the fraction of systems that do not solve the problem. A system’s overall SotAC is the average SotAC over the problems it solved but not solved by all the systems. The efficiency balances the number of problems solved with the time taken. It is the average solution rate over the problems solved multiplied by the fraction of problems solved (the solution rate for one problem is the reciprocal of the time taken to solve it). Efficiency is computed for both CPU time and wall clock time, to measure how efficiently the systems use one core and multiple cores, respectively. The core usage measures the extent to which the systems take advantage of multiple cores. The core usage is the ratio of CPU time to wall clock time used. Core usage below 1.0 is typically the result of the problem being solved in early (pre)processing before multicore search is started. The results present the average core usage and the number of problems solved with core usage >1.0. The competition ran on octa-core computers, thus the maximal core usage was 8.0. While high core usage can be seen as a strength of an ATP system, the ability to solve problems quickly before a multicore search is started is also a strength—the number of such problems is simply the difference between the number solved and the number solved with core usage >1.0.

The demonstration division results are presented along with the competition divisions’ results, but might not be comparable with those results. The demonstration division is not ranked.

6. Results

The result tables below give the number of problems solved with a solution output (with a “()” bracketed number of problems solved without a solution output), the average time taken over the problems solved, the SotAC, the (micro-) efficiency, the core usage/the number of problems solved with core usage >1.0, the number of new problems solved, and the number of problems solved in each problem category. In each table, the best value for each performance measure in the competition division is emboldened, and the CASC-29 winner is emphasized in the demonstration division. Detailed results, including the systems’ output files, are available on the CASC-J12 web site.

6.1. The THF Division

Table 4 summarizes the results of the THF division. Vampire 4.9 won the division, with a significant improvement over Vampire 4.8. The main reason for the stronger performance was better optimization of the strategy schedules for the TPTP problems. A short system description of Vampire is provided in Section 7, further explaining the improvements.

Table 4.
THF Division Results.

ATP system THF/500 Avg WC SotAC WC $μ$ Eff Core usage TNE/100 TEQ/400

Vampire 4.9 463 5.4 0.32 674 4.5/260 87 376

Zipperposition 2.1.9999 433 13.1 0.28 412 6.2/297 83 350

E 3.2.0 411 11.9 0.25 624 4.1/168 71 340

cvc5 1.1.3 299 13.1 0.15 351 1.1/1 52 247

Leo-III 1.7.15 268 12.5 0.11 139 2.7/268 57 211

LEO-II 1.7.0 88 8.7 0.02 130 0.0/0 34 54

Demonstration division

Vampire 4.8 447 2.6 – 680 5.7/243 86 361

ATP system	THF/500	Avg WC	SotAC	WC $μ$ Eff	Core usage	TNE/100	TEQ/400
Vampire 4.9	463	5.4	0.32	674	4.5/260	87	376
Zipperposition 2.1.9999	433	13.1	0.28	412	6.2/297	83	350
E 3.2.0	411	11.9	0.25	624	4.1/168	71	340
cvc5 1.1.3	299	13.1	0.15	351	1.1/1	52	247
Leo-III 1.7.15	268	12.5	0.11	139	2.7/268	57	211
LEO-II 1.7.0	88	8.7	0.02	130	0.0/0	34	54
Demonstration division
Vampire 4.8	447	2.6	–	680	5.7/243	86	361

Note. ATP = automated theorem proving; SotAC = state-of-the-art contribution.

In addition to solving the most problems, Vampire had the lowest average time, highest SotAC and efficiency, solved 203 problems before starting to use multiple cores, and reasonable core usage for the other problems. E and cvc5 solved the most problems before starting to use multiple cores, but E had good average core usage while cvc5 did not try to use multiple cores (the data shows that cvc5 used more than one core for one problem, but that is not due to intentional parallelism). Zipperposition made the most use of the multiple cores. It was noticed that E solved 13 problems in the last second of wall clock time, but used less than a second of CPU time. The developer commented: “What is weird is that for many problems outside the SEV/SEU domain, this effect does not seem to happen. But from a process point of view, there should be no difference.” The category rankings are almost completely aligned with the overall ranking, which indicates that the systems are not particularly tuned to problems with or without equality.

Leo-III was experimentally configured for the competition with only E as a subsystem. The developer explained: “There is a tradeoff between more (orthogonal) provers and more time/instances per prover.” In CASC-29, Leo-III 1.7.8 included cvc4 as a subsystem, and solved 302 problems, with 110 (36% of 302) solved by the cvc4 subsystem and 130 (43% of 302) by the E subsystem. In CASC-J12, 203 problems (76% of 268) were solved by the lone E subsystem, indicating that using the single subsystem might be an effective configuration.

The individual problem results show that 10 problems were unsolved, 55 problems were solved by all the systems, and 29 problems were solved by only one system (12 problems were solved by only the two versions of Vampire, and are counted as unique solutions for Vampire 4.9). Of the 29 unique solutions, 15 were by Vampire 4.9, seven by E, five by Zipperposition, and two by cvc5. A portfolio (sharing the time between several systems) of these four systems, with 45 s allocated to each, would solve 484 problems.

6.2. The TFA Division

Table 5 summarizes the results of the TFA division. The winner was Vampire 4.9, and as in the THF division, there was a significant improvement over Vampire 4.8. The ranking of the systems is the same as in CASC-29, with the Princess system missing from the bottom of the ranking. The SotAC, efficiency, and category rankings are aligned with the overall ranking. All the systems other than cvc5 made good use of the multiple cores.

Table 5.
TFA Division Results.

ATP system TFA/200 Avg WC SotAC WC $μ$ Eff Core usage TFI/175 TFE/25 New/10

Vampire 4.9 164 4.9 0.29 553 5.1/118 143 21 4

cvc5 1.1.3 151 23.6 0.25 308 0.0/0 132 19 7

iProver 3.9 90(+3) 16.8 0.04 181 5.3/76 78 15 2

Demonstration division

Vampire 4.8 136 16.8 – 466 5.4/83 117 19 2

ATP system	TFA/200	Avg WC	SotAC	WC $μ$ Eff	Core usage	TFI/175	TFE/25	New/10
Vampire 4.9	164	4.9	0.29	553	5.1/118	143	21	4
cvc5 1.1.3	151	23.6	0.25	308	0.0/0	132	19	7
iProver 3.9	90(+3)	16.8	0.04	181	5.3/76	78	15	2
Demonstration division
Vampire 4.8	136	16.8	–	466	5.4/83	117	19	2

Note. ATP = automated theorem proving; SotAC=state-of-the-art contribution.

There were 10 new problems in the division, all proving the equivalence of two syntactically different programs that generate the same sequence of integers Gauthier et al. (2023). cvc5 solved the majority of the new problems, which were challenging for the other systems. cvc5 also had an interesting performance plot⁵ : 43 problems were solved very quickly, and another 37 problems were solved in 14–18 s. The 37 problems are all TFI software verification problems, generated by the Toccata system (Bobot et al., 2014), and the same cvc5 strategy solved 36 of them. This illustrates how a diverse strategy schedule can be useful, and that the order in which the strategies are used has a significant impact.

The individual problem results show that 12 problems were unsolved, 75 problems were solved by all the systems, and 43 problems were solved by only one system (14 problems were solved by only the two versions of Vampire, and are counted as unique solutions for Vampire 4.9). Of the 43 unique solutions, 22 were by Vampire 4.9, 18 by cvc5, and three by Vampire 4.8. A simple portfolio of Vampire 4.9 and cvc5 1.1.3, with 90 s allocated to each, would solve 183 problems. A more careful allocation of time, with 26 s allocated to Vampire 4.9, 124 s to cvc5 1.1.3, and 30 s to Vampire 4.8, would solve 187 problems.

Over the years, the TFA division has struggled to attract entrants, with the same few systems being entered each year—this was noted in the CASC-29 report (Sutcliffe and Desharnais, 2024). Theorem proving in typed first-order logic with arithmetic is an important area, with applications in areas such as mathematics (Korovin et al., 2023), software verification (Bobot et al., 2015), and interactive theorem proving (Paulson and Blanchette, 2010). CASC will continue to have a TFA division in order to encourage the development of ATP in this area. Conversations during IJCAR 2024 suggested that there is work in progress to add arithmetic capabilities to two of the other regular CASC entrants—fingers crossed for CASC-30.

Table 6.

TFN Division Results.

ATP system	TFN/150	Avg WC	SotAC	WC $μ$ Eff	Core usage	New/2
iProver 3.9	67(+18)	23.2	0.10	175	6.3/59	1
cvc5 1.1.3	0(+48)	41.9	0.02	173	0.0/0	0
Vampire 4.9	0(+107)	4.3	0.19	519	4.9/75	1
Demonstration division
Vampire 4.8	0(+106)	3.0	–	504	5.4/71	1

Note. ATP = automated theorem proving; SotAC = state-of-the-art contribution.

6.3. The TFN Division

Table 6 summarizes the results of the TFN division. While the Vampires solved the most problems, neither output any acceptable models. In CASC-29, Vampire 4.8 produced empty saturations due to aggressive simplification in preprocessing. This was discussed by the panel after CASC-29, resulting in 18 saturations not being counted. Vampire 4.8 also produced some unsatisfiable models due to syntactic errors in the output, but the CASC-29 panel allowed those models to be counted. These errors were not fixed in Vampire 4.9, and the entrant conceded: “we failed to finish our satisfiability-certificates homework since last year,” and when told the models would not be counted in CASC-J12: “Given these two disqualifications span our entire non-theorem capability, Vampire will effectively be demonstration-only in the non-theorem divisions. This is fine by us.” cvc5 suffered the same fate because it outputs empty models in some cases, for example, when the ---mbqi strategy is used. The cvc5 developer accepted his fate: “Unfortunately [I] don’t have a lot of bandwidth to resolve the issues for this year’s CASC, so feel free to use cvc5 in a way you see fit.” In contrast, in CASC-29, iProver 3.8 had 34 models not counted: 18 empty saturations (it uses the same CNF conversion code as Vampire) and 26 Herbrand formulae for transformed problems with a signature different from that of the input problem. In CASC-J12, iProver 3.9 had turned off the aggressive simplification in the CNF conversion, and no longer produces empty saturations. However, in CASC-J12, some models were still for transformed problems, and these 18 models were not counted. The upshot is that iProver won the division.

As explained in Section 3.2, 59 problems with a rating of 0.00–0.21 and 48 problems with a rating of 1.00 were made eligible for the TFN division. Of those, 51 rated 0.00 and 44 rated 1.00 were selected. All the problems with a rating of 0.00 were solved (without model output) by both the Vampires, 5 were not solved by iProver, and 10 were not solved by cvc5. Of the 44 problems with a rating of 1.00 only one was solved by all the systems except cvc5. One might expect all the problems with a rating of 0.00 and none of the problems with a rating of 1.00 to be solved, but this is not the case because the ratings were calculated using earlier versions of the systems; clearly, progress is not monotonic.

The individual problem results (without requiring model output) show that 43 problems were unsolved, 40 problems were solved by all the systems, and 14 problems were solved by only one system (all 14 problems were solved by only the two versions of Vampire, and are counted as unique solutions for Vampire 4.9). A portfolio approach cannot improve the individual systems’ results.

The inability of the model-finding systems to output acceptable models is disappointing. The reason seems to be a lack of developer time, rather than a lack of motivation. The systems have the information they need to output a model when satisfiability is established, and “all” that is necessary is to output the model in an acceptable format. Section 8 summarizes some more formal specifications of an acceptable model that will be used in CASC-30. Hopefully, this step will motivate developers towards outputting models.

Table 7.
FOF Division Results.

ATP system FOF/500 Avg WC SotAC WC $μ$ Eff Core usage FNE/75 FEQ/425

Vampire 4.9 471 4.0 0.38 716 4.4/267 71 400

E 3.2.0 370 12.5 0.21 485 4.8/233 63 307

CSI_E 1.0 365(+1) 13.0 0.21 459 4.7/218 62 304

CSE_E 1.6 360 11.3 0.20 450 4.3/197 61 299

iProver 3.9 342(+1) 22.4 0.19 182 5.9/327 69 274

CSG_E 1.0 331 16.6 0.17 344 1.0/46 59 272

GKC 0.8 326 10.6 0.17 402 5.4/250 58 268

Drodi 3.6.0 303 11.6 0.14 372 6.5/183 55 248

Zipperposition 2.1.9999 288 13.2 0.13 296 5.1/201 50 238

cvc5 1.1.3 279 20.0 0.15 260 1.0/1 43 236

CSE 1.7 105 54.9 0.02 54 1.0/37 33 72

Demonstration division

Vampire 4.8 370 4.1 – 512 5.0/225 70 300

Prover9 1109a 131 18.9 0.04 95 0.0/0 20 111

Connect++ 0.6.0 110 49.1 0.03 53 0.0/0 48 62

ATP system	FOF/500	Avg WC	SotAC	WC $μ$ Eff	Core usage	FNE/75	FEQ/425
Vampire 4.9	471	4.0	0.38	716	4.4/267	71	400
E 3.2.0	370	12.5	0.21	485	4.8/233	63	307
CSI_E 1.0	365(+1)	13.0	0.21	459	4.7/218	62	304
CSE_E 1.6	360	11.3	0.20	450	4.3/197	61	299
iProver 3.9	342(+1)	22.4	0.19	182	5.9/327	69	274
CSG_E 1.0	331	16.6	0.17	344	1.0/46	59	272
GKC 0.8	326	10.6	0.17	402	5.4/250	58	268
Drodi 3.6.0	303	11.6	0.14	372	6.5/183	55	248
Zipperposition 2.1.9999	288	13.2	0.13	296	5.1/201	50	238
cvc5 1.1.3	279	20.0	0.15	260	1.0/1	43	236
CSE 1.7	105	54.9	0.02	54	1.0/37	33	72
Demonstration division
Vampire 4.8	370	4.1	–	512	5.0/225	70	300
Prover9 1109a	131	18.9	0.04	95	0.0/0	20	111
Connect++ 0.6.0	110	49.1	0.03	53	0.0/0	48	62

Note. ATP = automated theorem proving; SotAC = state-of-the-art contribution.

6.4. The FOF Division

Table 7 summarizes the results of the FOF division. The dominance of Vampire increased further compared to CASC-29: in CASC-29, Vampire 4.8 solved 451 problems, E 3.1 solved 393 problems, CSE_E 1.5 solved 346 problems, and iProver 3.8 solved 355 problems. The numbers of problems solved generally suggest that the CASC-J12 problems were harder than the CASC-29 problems, but Vampire 4.9 still solved more problems than Vampire 4.8 did in CASC-29. The main reasons for Vampire 4.9’s stronger performance were work tackling the problem of overfitting in strategy schedule construction (Bártek et al., 2024) and submitting the system in time for problem ranking (as discussed below).

The performance plots for the FOF division⁶ clearly separate the systems into three groups: Vampire 4.9, the “hoi polloi,” and the “also rans” (two of which were demonstration systems). In the middle group, E and the CS?_E systems stand out with better performances. The CS?_E systems have a unique architecture. CS? and E are applied to the given problem sequentially. If either prover solves the problem, then the proof process completes. If neither CS? nor E can solve the problem then some clauses with no more than two literals, especially unit clauses, inferred by CS? are fed to E as lemmas along with the original clauses for further proof search. CSE_E solved 14 problems that E did not, two by the CSE part, and 12 by the E part using lemmas generated by the CSE part. CSG_E solved nine problems that E did not, all by E using lemmas generated by the CSG part. CSI_E solved two problems that E did not. All the problems that the CS?_E systems solved but the E didn’t are hard problems. The 14 by CSE_E have ratings from 0.55 to 0.97 with an average of 0.80; the nine by CSG_E have ratings from 0.41 to 0.97 with an average of 0.72; and the two solved by CSI_E have TPTP difficulty ratings 0.85 and 0.94. The CS? parts use the SCS inference rule (Xu et al., 2018) that can have very many parents—in one proof there is an inference step with 636 parents! The ability of the CS?_E systems to solve hard problems has been a focus of the developers’ research (Liu et al., 2023). A short system description of the CS?_E systems is provided in Section 7.

CSI_E and iProver each solved one problem close to the time limit, but did not finish proof output by the time limit.

The top-performing systems had the best SotACs and efficiencies. CSG_E solved the most problems before starting to use multiple cores and seldom used multiple cores. The highest core usages came from Drodi, iProver, and GKC. The performances in the two problem categories are well aligned with the overall performance.

The individual problem results show that 18 problems were unsolved, 24 problems were solved by all the systems, and 41 problems were solved by only one system (nine problems were solved by only the two versions of Vampire, and are counted as unique solutions for Vampire 4.9). Of the 41 unique solutions, 37 were by Vampire 4.9, three by cvc5 1.1.3, and one by Vampire 8.8. Not much could be gained by a portfolio: with 35 s allocated to Vampire 4.9, 36 s to Vampire 4.8, and 109 s to E, 475 problems would be solved.

To assess the extent to which Vampire benefited from: “submitting the system in time for problem ranking,” the TSTP data used for computing the problem ratings has been examined. Note that the system versions used to obtain the data (and referred to here) were in some cases the same as used in the competition and in some cases a preceding version; the TSTP had been updated with the latest versions of the systems available by the system submission for problem rating—1 May 2024. In particular, the Vampire version used for problem rating was a special prerelease of the version used in the competition. Of the 500 problems used in the competition, the prerelease of Vampire was the only system to have solved 74 of them, and six problems had not been solved by Vampire but had been solved by one of the other systems in the division. In the competition, Vampire 4.9 solved 60 of the 74 and one of the six problems. The runner-up, E, solved 11 of the 74 and three of the six. Those numbers give Vampire a net advantage of 47 problems over E, that is, around 10%, which is a significant advantage. The converse numbers gave E no net advantage over Vampire. This clearly shows the benefit of submitting a system before the competition so that its performance data is used in computing the problem ratings. However, there were clearly other improvements in Vampire that contributed to its strong performance.⁷ A short system description of Vampire is provided in Section 7, further explaining the improvements.

Connect++ was the only newcomer in CASC-J12, and met the developer’s expectation in the online system description: “It is not expected at this stage to be competitive.’. On the flip side, entering CASC had a positive impact on the development of Connect++, and a short system description of Connect++ is provided in Section 7. The CASC rules explicitly state: “A system that is not entered into a division is assumed to perform worse than the entered systems,” that is, Connect++ is assumed to be better than known ATP systems (that were not entered). There is one well-known FUNny ATP system that has been absent from CASC for many years, and the developer has been threatening to enter again if the EPR division is brought back from hiatus (see Section 9).

6.5. The UEQ Division

Table 8 summarizes the results of the UEQ division. After many years of effort and hope, Vampire finally ascended to the UEQ throne, thus deposing “King Nicholas of Twee,” who had succeeded the Waldmeister in 2021 (Sutcliffe and Desharnais, 2022). As reported in Sutcliffe and Desharnais (2024), the original Vampire developer had made some bold statements during and after the running of CASC-29 about winning the division in CASC-J12 …his claims were justified! The new techniques used for UEQ in Vampire 4.9 are described in the short system description in Section 7. Twee 2.5.0 improved slightly over the CASC-29 winning version. Twee’s missing proof was caused by an out of memory error.

Table 8.
UEQ Division Results.

ATP system UEQ/300 Avg WC SotAC WC $μ$ Eff Core usage

Vampire 4.9 265 10.3 0.35 492 5.3/218

Twee 2.5.0 256(+1) 16.9 0.32 455 6.5/209

E 3.2.0 230 17.4 0.26 414 5.3/192

iProver 3.9 202 34.2 0.19 121 6.7/197

Drodi 3.6.0 115 21.8 0.04 155 6.8/105

GKC 0.8 102 21.8 0.03 118 6.8/95

Demonstration division

Twee 2.4.2 254 14.6 – 461 6.4/205

ATP system	UEQ/300	Avg WC	SotAC	WC $μ$ Eff	Core usage
Vampire 4.9	265	10.3	0.35	492	5.3/218
Twee 2.5.0	256(+1)	16.9	0.32	455	6.5/209
E 3.2.0	230	17.4	0.26	414	5.3/192
iProver 3.9	202	34.2	0.19	121	6.7/197
Drodi 3.6.0	115	21.8	0.04	155	6.8/105
GKC 0.8	102	21.8	0.03	118	6.8/95
Demonstration division
Twee 2.4.2	254	14.6	–	461	6.4/205

Note. ATP = automated theorem proving; SotAC = state-of-the-art contribution.

The SotAC and efficiency values were mostly aligned with the division ranking, with the exception of Drodi whose higher efficiency resulted from very few solutions with a high time taken—only 26 solutions (23% of 115) took more than 20s. This is in contrast to iProver which solved 112 problems (55% of 202) in more than the 20s, with a corresponding low efficiency. Vampire and Twee 2.5.0 solved the most problems before starting to use multiple cores and had good core usage during multicore search. iProver, Drodi, and GKC solved very few problems before starting to use multiple cores and thus had the highest core usage.

The individual problem results show that nine problems were unsolved, 77 problems were solved by all the systems, and 22 problems were solved by only one system (four problems were solved by only the two versions of Twee, and are counted as unique solutions for Twee 2.5.0). Of the 22 unique solutions, nine were by Vampire, eight by E, and five by Twee 2.5.0. A simple portfolio of these three systems, with 60 s allocated to each, would solve 283 problems.

6.6. The SLH Division

Table 9 summarizes the results of the SLH division. Vampire’s improved performance on higher-order problems (see Section 6.1) is shown again here. The problems were apparently harder in CASC-J12 compared to CASC-29: in CASC-29 E 3.1 solved 467 problems. As noted in Section 3.3, 401 of the 1000 problems had not been solved in the testing done before CASC-29, and if many of them were solved in CASC-J12 that would indicate progress in the field. Of the 401 problems, 19 were solved in CASC-J12, 13 by cvc5, five by Vampire, and 1 by E 3.2.0 (i.e., none were solved by more than one system). That possibly indicates progress, but at best slow progress.

Table 9.
SLH Division Results.

ATP system SLH/1000 Avg CPU SotAC CPU $μ$ Eff

Vampire 4.9 463 2.7 0.11 309

E 3.2.0 429 4.2 0.09 214

cvc5 1.1.3 363 3.0 0.06 137

Demonstration division

E 3.1 426 4.1 – 239

ATP system	SLH/1000	Avg CPU	SotAC	CPU $μ$ Eff
Vampire 4.9	463	2.7	0.11	309
E 3.2.0	429	4.2	0.09	214
cvc5 1.1.3	363	3.0	0.06	137
Demonstration division
E 3.1	426	4.1	–	239

Note. ATP = automated theorem proving; SotAC = state-of-the-art contribution.

The individual problem results show that 446 problems were unsolved, 258 problems were solved by all the systems, and 110 problems were solved by only one system (30 problems were solved by only the two versions of E, and are counted as unique solutions for E 3.2.0). Of the 110 unique solutions, 39 were by each of Vampire and cvc5, 31 by E 3.2.0, and one by E 310. A portfolio approach works well here—a portfolio of cvc5, E 3.2.0, and Vampire, with 10 s allocated to each, would solve 524 problems.

One of the goals of the SLH division is to encourage system developers to tune their systems to the needs of Sledgehammer users. As noted in Section 3.3, the problems were from the same problem set as for CASC-29 (but different problems were selected), so that developers could tune their systems using the CASC-29 problems. After CASC-J12 the developers were asked if they had done that tuning. One answered: “I’ve not done any tuning over the last two years, being too busy working on finalising the integration of higher-order logic,” and another (who had claimed after CASC-29 that: “We will be ready for SLH @ CASC-J12 in two months from now”) answered: “The schedule(s) $\dots$ were not tuned on the CASC-29 problems. Instead, they were tuned on the 5000 problems from ‘Seventeen Provers Under the Hammer’ Desharnais et al. (2022). The schedules are probabilistic as with SnakeForV (Bhayat and Suda, 2024a, 2024b).” Clearly, the motivation is there, but the hours are not. In CASC-30 the same 8400 problems will be eligible again, and again different problems will be selected for the competition. It will interesting to see if any developers take advantage of the tuning opportunity.

The use of Sledgehammer problems for testing and evaluating ATP systems has led to a practical development whereby Sledgehammer will provide its ATP subsystems with persistent storage associated with streams of homogeneous problems. This will allow the ATP systems to save search data between invocations, and use that data to incrementally improve their performance on the stream of problems. This approach was exploited by Vampire 4.7’s dynamic strategy weighting in the LTB division of CASC-J11 (Sutcliffe and Desharnais, 2023), and is used in the incremental strategy merging approach developed for E (McKeown and Sutcliffe, 2024).

6.7. The ICU Division

The ICU division was created in response to evidence of CASC being a possible cause of incremental development of ATP systems (Sutcliffe and Desharnais, 2024), which in itself might be contributing to a slowing of progress in ATP (Sutcliffe et al., 2024). The new division also supports CASC’s goal of stimulating ATP research by providing a new type of challenge that could lead to some of the development effort being moved from breadth to depth (Sutcliffe and Desharnais, 2024). It certainly produced some rather interesting competition and results.

Table 10.
ICU Division Results.

ATP system ICU/80 Avg WC SotAC WC $μ$ Eff Core Usage New/12

Vampire 4.9 59 27.8 0.45 225 6.0/54 8

CSI_E 1.0 43 118.0 0.29 94 7.2/39 12

E 3.2.0 42 127.8 0.27 108 7.0/40 12

iProver 3.9 21 46.5 0.12 17 7.1/21 3

CSE_E 1.0 20 239.4 0.11 1 3.3/20 3

Drodi 3.6.0 15 48.9 0.06 95 6.4/14 5

CSG_E 1.0 11 185.9 0.05 1 1.0/4 5

Demonstration division

Connect++ 0.6.0 1 295.4 0.00 4 0.0/0 1

ATP system	ICU/80	Avg WC	SotAC	WC $μ$ Eff	Core Usage	New/12
Vampire 4.9	59	27.8	0.45	225	6.0/54	8
CSI_E 1.0	43	118.0	0.29	94	7.2/39	12
E 3.2.0	42	127.8	0.27	108	7.0/40	12
iProver 3.9	21	46.5	0.12	17	7.1/21	3
CSE_E 1.0	20	239.4	0.11	1	3.3/20	3
Drodi 3.6.0	15	48.9	0.06	95	6.4/14	5
CSG_E 1.0	11	185.9	0.05	1	1.0/4	5
Demonstration division
Connect++ 0.6.0	1	295.4	0.00	4	0.0/0	1

Note. ATP = automated theorem proving; SotAC = state-of-the-art contribution.

Table 11.

ICU Matrix.

ATP system	VAM	CSI	EEE	IPR	CSE	DRO	CSG	CPP
Vampire 4.9	10	6	5	10	7	10	10	1
CSI_E 1.0	1	10	10	2	8	8	4	0
E 3.2.0	1	10	10	2	8	8	3	0
iProver 3.9	0	0	1	8	2	7	3	0
CSE_E 1.0	0	0	3	2	5	3	7	0
Drodi 3.6.0	0	0	2	0	2	9	2	0
CSG_E 1.0	0	0	4	0	2	2	3	0
Demonstration division
Connect++ 0.6.0	0	0	0	0	2	1	0	0

Note. ATP = automated theorem proving.

Table 10 summarizes the results of the ICU division and Table 11 shows how many of the problems submitted by each system’s developer were solved by each system (the columns correspond to the problem submissions from the corresponding system row). As might have been expected, Vampire was dominant, solving all 10 of its own problems and also all 10 of the iProver, Drodi, and CSG problems. Only CSI_E and E managed to solve a Vampire problem, but in turn Vampire solved six and five of the CSI_E and E problems. As noted in Section 3.4, Connect++ was not expected to solve the problems it submitted, and indeed only Vampire managed to solve one. The problems submitted by CSI_E were most effective as being solvable by itself but unsolvable by other systems—itself and E solved them all, and Vampire was the only other system that solved any of them. The standalone E solved 42 of the 43 problems that CSI_E solved, suggesting that CSI_E’s solutions are largely thanks to the E component. The contribution of the E component to the performance of CSI_E is undoubted, but an examination of the proofs reveals that seven of CSI_E’s proofs are different from the standalone E proofs, that is, the CSI component of CSI_E helps the E component find proofs. The three CS?_E systems solved different numbers of each other’s problems: CSI_E solved five of its own problems, eight CSE_E problems, and 10 CSG_E problems; CSE_E solved five of its own problems, seven CSI_E problems, and no CSG_E problems; CSG_E solved none of its own problems (!), three CSI_E problems, and three CSE_E problems.

The SotAC values align with the division ranking, but the efficiency values are markedly unaligned. The performance plots for the ICU division⁸ show the different numbers of problems solved quickly, resulting in this misalignment. The core usage numbers show that multicore search was used for most problems, and the core usage was high for most of the systems.

The individual problem results show that nine problems were unsolved, one problem was solved by all the systems, and 14 problems were solved by only one system. All 14 unique solutions were by Vampire. A portfolio approach works well here—a combination of CSI_E and Vampire, with 300 s allocated to each, would solve 71 problems.

This was the first ICU division, and only two entrants submitted non-TPTP problems—Drodi and E (see Section 3.4). CSI_E and E were most effective on the non-TPTP problems submitted. In E’s case, the strategy of submitting non-TPTP problems was effective, but as is evident from Table 11 CSI_E strategy of submitting carefully selected TPTP problems was also effective. In order to choose problems that are solvable by one system and not solvable by other systems requires some effort from entrants, to understand the other systems and test other systems on candidate problems. Making that effort has benefits beyond CASC, as it leads to an understanding of other systems’ techniques that work well for certain types of problems, which might subsequently lead to an integration of those techniques into the entrant’s own system.

7. System Descriptions

Each year the competition report features short system descriptions of systems that stand out in some way. For CASC-J12, the salient systems are Vampire 4.9 for its strong overall performance and particularly its improved performance in the THF, TFN, FOF, and UEQ divisions; the CS?_E family of systems for generally strong performance, particularly from CSI_E; and Connect++ for being an interesting new development. These short descriptions of the systems were written by their entrants.

Vampire Kovacs and Voronkov (2013) is an ATP in the superposition/saturation tradition for first-order logic, with extensions to theories and higher-order logic. It also implements a MACE-style finite model builder for finding finite counter-examples (Reger et al., 2016). A huge improvement was made in FOF this year, improving over last year’s winner Vampire 4.8 by over 100 problems. Perhaps surprisingly, this is largely due to work tackling the problem of overfitting in strategy schedule construction (Bártek et al., 2024), which found very useful strategies ahead of the competition. For the first time in many years, Vampire was submitted ahead of time for problem ranking, which we suspect allowed the use of many problems previously excluded from competition. THF also benefited from this treatment. UEQ’s performance also improved to such a degree that Vampire won the division for the first time. This was due to a number of optimizations, including improved code trees, lazy flatterm creation, lazy application of substitutions, and runtime-specialized ordering checks. Encompassment demodulation (Duarte and Korovin, 2022) was used in competition for the first time. A sneaky implementation of Waldmeister’s “enlarging the hypothesis” trick (Loechner and Hillenbrand, 2002) further improved performance in some cases. These techniques, while developed for UEQ, also helped in other divisions.

CS?_E is a family of automated theorem provers for first-order logic, combining a CS-based (Contradiction Separation Based Dynamic Multi-Clause Synergized Automated Deduction (S-CS) (Xu et al., 2018)) theorem prover with the E prover (Schulz et al., 2019). CSE_E is the basic system (Liu et al., 2023), CSG_E applies a new deduction calculus based on the S-CS rule called the “gridle construction” method, and CSI_E is a multilayer inverse and parallel prover (Zeng et al., 2024). CSE_E and CSG_E combine their own proof search with E, like this: E and CSE/G are applied to the given problem sequentially. If either solves the problem then the proof process completes. If neither solves the problem then some clauses with no more than two literals, especially unit clauses, inferred by CSE/G are fed to E as lemmas along with the original problem clauses, for further proof search. Strategies used in CSE_E and CSG_E include (a) contradiction separation clause evaluation based mainly on the number of clauses and the distance to the goal clause, (b) lemma filtering based mainly on deduction weight of clauses and variable appearances, (c) fine-grained dynamic time allocation scheme in different run stages. CSI_E works differently. Certain clauses obtained from applying the S-CS rule to the given clause set are divided into subclauses, to form inverse deduction subgoals. These are fed to E along with the original clauses for E to attempt in parallel. If any of the inverse problems with the subclauses are solved, then the proof process completes. Strategies used in CSI_E include (a) complementary ratio measuring the complementary relation between two clauses to guide the clause evaluation and deduction path planning, (b) portfolio strategy clause selection schemes for different run stages. CS?_E is implemented mainly in C++, and Java is used for running problem batches. The job dispatch between CS? and E is implemented in C++.

Connect++ (Holden, 2023) is an automated theorem prover for first-order logic with equality, employing proof search using the connection calculus (Bibel, 1987). Finding a proof in connection calculus essentially involves an iterative deepening backtracking search, while constructing a substitution applied to the entire proof tree. Connect++’s search strategy includes a backtracking restriction, regularity testing and (optionally) the lemma rule. It attempts the left branches of extension steps first, and all steps employ the leftmost literal of the current clause. Its default schedule includes multiple approaches to definitional clause conversion and reorderings of the matrix. Connect++ does not attempt to apply heuristics to its proof search on the basis of the specific properties of the problem; this aspect remains at present a work-in-progress. To maintain a global variable substitution, variables are represented only once, so that undoing a substitution under backtracking is straightforward. Connect++ shares all terms and subterms using an index that allows existing (sub-)terms to be found in constant time. Nothing in the index is ever deleted under backtracking. Additionally, new variables are recycled under backtracking, and this combination is effective in keeping the process of making new copies of clauses efficient. Connect++ can output succinct, prolog-readable summaries of any proof found, and can internally check the proof for correctness. Connect++ is implemented in C++. The implementation aims to provide flexibility in controlling the proof search. The system maintains its own stacks for controlling proof search, thus allowing essentially arbitrary modifications to the search process to be implemented with minimal effort. Acknowledgments: For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) license to any author-accepted manuscript version arising from this submission.

8. Proof and Model Verification

Over the years it has often been suggested that CASC should have a “verified track,” in which solutions are counted only if they have been approved by an independent verifier. Many other communities’ competitions include such tracks, for example, the SAT competition since 2016 (Balyo et al., 2017), the Confluence competition since its inception in 2012 (Middledorp et al., 2021), and the Termination competition’s certified category since 2007 (Giesl et al., 2019). To date, CASC has not tried to have a certified track, because there has not been an adequate proof or model format that system developers have commonly adopted and adhere to.

Empirical testing, as is done by most ATP system developers in their preparations for CASC, provides a weak assurance of soundness. The soundness testing done in CASC is also empirical (see Section 4), and this has been effective in detecting occasional instances of unsoundness, for example, Sutcliffe and Desharnais (2023, 2022)). However, there remains a need to verify the logical correctness of solutions in CASC. In addition to logical verification, the utility of solutions for users in their applications should be considered. Solutions must be well-formed so they can be parsed and verified, and comprehensible to the applications (including humans) that need to use the solutions. In the light of these observations, verification can be incrementally imposed in CASC:

(1)
Solutions must be written in an agreed language. For CASC, the language must be a TPTP language (Sutcliffe, 2023). A parser based on the BNF definition of the TPTP languages (Van Gelder and Sutcliffe, 2006) can be used to check conformance.
(2)
Solutions must be written in an agreed-upon concrete format. For derivations, the established TPTP format (Sutcliffe et al., 2006) has already been adopted by many ATP systems. For models, the existing format for finite models (Sutcliffe et al., 2006) has been adopted by many ATP systems, but the forthcoming new format for interpretations (Steen et al., 2023) will support a broader range of model types.
(3)
Solutions must be structurally correct. For proofs, the parents of the inferred formula must be documented and exist in the derivation, the derivation must be acyclic, refutations must have false roots, assumptions must be discharged, etc. The GDV derivation verifier (Sutcliffe, 2006) can be used to check conformance. For models, the domains must be specified (either explicitly or implicitly as is the case for Herbrand interpretations), and symbol mappings must be provided (either explicitly or implicitly). The AGMV model verifier can help (Steen et al., 2023).
(4)
Solutions must be for the given problem. For proofs, the leaves of a derivation must come from the problem. This can be checked by treating leaves of the derivation as inferred from the problem and proceeding to step (5). For models, symbol mappings must be provided for all the symbols in the problem. Symbol mappings may additionally be provided for symbols added in the derivation, e.g., definitions, Skolem functions, etc. This can be done by comparing the signatures of the problem and model.
(5)
Solutions must be logically correct. This is where most attention has been paid (possibly at the expense of the preceding requirements that are simply assumed) in proof verification (e.g., McCune and Shumsky-Matlin, 2000; Wetzler et al., 2014; Andreotti et al., 2023). There are several proof verification systems that can be used, including GDV (Sutcliffe, 2006), Dedukti (Dowek, 2022), and GAPT (Ebner et al., 2016). Verification of the logical properties of models has not received adequate attention, but tools such as AGMV and Dolmen Bury and Bobot (2023) are available.
For CASC-30, proofs will have pass checking for the first three items above, that is, they must be written in the TPTP language and format, and pass structural testing. Models will have to be written in the new TPTP format, but given the recency of the new format only the first two items above will be checked. Logical verification is left for CASCs beyond that. A long-range plan is emerging to run CASC only at CADE conferences, and have competitions based on other ATP-related topics at the IJCAR conferences. In particular, a proof verifier/verifiability competition (the “ProoVer” competition) is being planned for IJCAR in 2027. ProoVer will evaluate the capabilities of proof verifiers, and the extent to which ATP systems’ proofs can be verified.
9. Conclusion

CASC-J12 was the 29th annual evaluation of fully automatic, classical logic, ATP systems. CASC-J12 fulfilled its objectives by evaluating the relative capabilities of ATP systems, and stimulating development and interest in ATP. The highlights of CASC-J12 were: significant improvement in Vampire, the ability of the CS?_E systems to solve hard problems, continued weakness in model outputs, and interesting competition in the new ICU division.

While the design of CASC is mature and stable, each year’s experiences lead to ideas for changes and improvements. Changes being planned for CASC-30 are:

Proofs will have to be in TPTP format and pass parsing and structural verification tests. Models will have to be in the new TPTP format and pass parsing tests.

The ICU division will limit the number of different problems submitted by individual entrants and their colleagues.

CASC-30 will be run either on the StarExec Miami cluster (which will be maintained, unlike the StarExec Iowa cluster that will be shutting down), or on StarExec-ARC that runs in Amazon’s EKS Kubernetes environment (Fuenmayor et al., 2024).

CASC’s evaluation is principally based on ATP systems’ abilities to solve problems. Evaluation of other performance measures would also be interesting (Sutcliffe et al., 2024). These include measures such as resource usage, stability modulo perturbations of the input, and verifiability of proofs/models (see Section 8). Evaluation of nonperformance measures is often ignored, but for users might be just as important. These include measures such as the range of logics covered, ease of building and deploying, portability to different hardware and operating system environments, an easy-to-access API, availability of source code, quality of source code and its documentation, licensing that permits a required level of use or modification, availability of user documentation, and (maybe most importantly!) developer support. These are possible topics for other future competitions.

As always, the ongoing success and utility of CASC depends on ongoing contributions of problems to the TPTP. The automated reasoning community is encouraged to continue making contributions to all types of problems.

Footnotes

ORCID iD

Geoff Sutcliffe

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Andreotti

Lachnitt

Barbosa

(2023). Carcara: An efficient proof checker and elaborator for SMT proofs in the Alethe format. In S. Sankaranarayanan & N. Sharygina (Eds.), Proceedings of the 13th international conference on tools and algorithms for the construction and analysis of systems. Lecture notes in computer science (p. Online). Springer-Verlag.

Balyo

Heule

Jarvisalo

(2017). SAT competition 2016: Recent developments. In S. Singh, & S. Markovithc (Eds.), Proceedings of the 31st AAAI conference on artificial intelligence (Vol. 31, pp. 5061–5063). AAAI Press.

Bártek

Chvalovsky

Suda

(2024). Regularization in Spider-style strategy discovery and schedule construction. In C. Benzmüller, M. Heule, & R. Schmidt (Eds.), Proceedings of the 12th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 194–213). Springer-Verlag.

Beyer

Löwe

Wendler

(2019). Reliable benchmarking: Requirements and solutions. International Journal on Software Tools for Technology Transfer, 21, 1–29. https://doi.org/ 10.1007/s10009-017-0469-y

Bhayat

Suda

(2024). A higher-order vampire (Short Paper). In C. Benzmüller, M. Heule, & R. Schmidt (Eds.), Proceedings of the 12th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 75–85). Springer-Verlag.

Bhayat

Suda

(2024). A higher-order vampire (short paper). https://doi.org/10.48550/arXiv.2407.05208

Bibel

(1987). Automated theorem proving. Vieweg and Sohn.

Blanchette

Haslbeck

Matichuk

Nipkow

(2015). Mining the archive of formal proofs. In M. Kerber, J. Carette, C. Kaliszyk, F. Rabe, & V. Sorge (Eds.), Proceedings of the 8th conference on intelligent computer mathematics. Lecture notes in computer science (pp. 3–17). Springer-Verlag.

Blanchette

Nipkow

(2010). Nitpick: A counterexample generator for higher-order logic based on a relational model finder. In M. Kaufmann, & L. Paulson (Eds.), Proceedings of the 1st international conference on interactive theorem proving. Lecture notes in computer science (pp. 131–146). Springer-Verlag.

10.

Bobot

Filliatre

J.-C.

Marche

Melquiond

Paskevich

(2014). Toccata: Certified programs and certified tools, http://toccata.lri.fr/gallery/why3.en.html

11.

Bobot

Filliâtre

J.-C.

Marché

Paskevich

(2015). Let’s verify this with Why3. International Journal on Software Tools for Technology Transfer, 17(6), 709–727. https://doi.org/10.1007/s10009-014-0314-5

12.

Bury

Bobot

(2023). Verifying models with Dolmen. In S. Graham-Lengrand, & M. Preiner (Eds.), Proceedings of the 21st international workshop on satisfiability modulo theories. CEUR workshop proceedings (pp. 62–70).

13.

Claessen

Sörensson

(2003). New techniques that improve MACE-style finite model finding. In P. Baumgartner, & C. Fermueller (Eds.), Proceedings of the CADE-19 workshop: Model computation—principles, algorithms, applications.

14.

Desharnais

Vukmirović

Blanchette

Wenzel

(2022). Seventeen provers under the hammer. In J. Andronick, & L. de Moura (Eds.), Proceedings of the 13th international conference on interactive theorem proving (pp. 8:1–8:18). Leibniz International Proceedings in Informatics, Schloss Dagstuhl—Leibniz-Zentrum für Informatik.

15.

Dowek

(2022). From the universality of mathematical truth to the interoperability of proof systems. In J. Blanchette, L. Kovacs, & D. Pattinson (Eds.), Proceedings of the 11th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 8–11). Springer-Verlag.

16.

Duarte

Korovin

(2022). Ground joinability and connectedness in the superposition calculus. In J. Blanchette, L. Kovacs, & D. Pattinson (Eds.), Proceedings of the 11th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 169–187). Springer-Verlag.

17.

Ebner

Hetzl

Reis

Riener

Wolfsteiner

Zivota

(2016). System description: GAPT 2.0. In N. Olivetti, & A. Tiwari (Eds.), Proceedings of the 8th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 293–301). Springer-Verlag.

18.

Fichte

Geibinger

Hecher

Schlögel

(2024). Parallel empirical evaluations: Resilience despite concurrency. In M. Wooldridge, J. Dy, & S. Natarajan (Eds.), Proceedings of the 38th AAAI conference on artificial intelligence (Vol. 38, pp. 8004–8012). AAAI Press.

19.

Fuenmayor

McKeown

Sutcliffe

(2024). Towards StarExec in the cloud. In M. Rawson, S. Schulz, & K. Korovin (Eds.), Proceedings of the 15th international workshop on the implementation of logics (p. To appear).

20.

Gauthier

Brown

Janota

Urban

(2023). A mathematical benchmark for inductive theorem provers. In R. Piskac, & A. Voronkov (Eds.), Proceedings of 24th international conference on logic for programming artificial intelligence and reasoning. EPiC series in computing (pp. 224–237). EasyChair Publications.

21.

Giesl

Rubio

Sternagel

Waldmann

Yamada

(2019). The termination, complexity competition. In T. Vojnar, & L. Zhang (Eds.), Proceedings of the 2019 international conference on tools and algorithms for the construction and analysis of systems. Lecture notes in computer science (pp. 156–166). Springer-Verlag.

22.

Holden

(2023). Connect++: A new automated theorem prover based on the connection calculus. In J. Otten, & W. Bibel (Eds.), Proceedings of the 1st international workshop on automated reasoning with connection calculi. CEUR workshop proceedings (pp. 95–106).

23.

Korovin

Kovac

Reger

Schoisswohl

Voronkov

(2023). ALASCA: Reasoning in quantified linear arithmetic (extended version). EasyChair Preprints. https://easychair.org/publications/preprint/KJX2

24.

Kovacs

Voronkov

(2013). First-order theorem proving and Vampire. In N. Sharygina, & H. Veith (Eds.), Proceedings of the 25th international conference on computer aided verification. Lecture notes in artificial intelligence (pp. 1–35). Springer-Verlag.

25.

Liu

Chen

Liu

Cao

(2023). An efficient contradiction separation based automated deduction algorithm for enhancing reasoning capability. Knowledge-Based Systems, 261, 110217. https://doi.org/10.1016/j.knosys.2022.110217

26.

Loechner

Hillenbrand

(2002). A phytography of Waldmeister. AI Communications, 15 (2/3), 127–133.

27.

McCune

Shumsky-Matlin

(2000). Ivy: A preprocessor and proof checker for first-order logic. In M. Kaufmann, P. Manolios, & J. Strother Moore (Eds.), Computer-aided reasoning: ACL2 case studies. Advances in formal methods (pp. 265–282). Kluwer Academic Publishers.

28.

McKeown

Sutcliffe

(2024). Dataset-specific strategies for the E theorem prover. In M. Rawson, S. Schulz, & K. Korovin (Eds.), Proceedings of the 15th international workshop on the implementation of logics (p. To appear).

29.

Middledorp

Nagele

Shintani

(2021). CoCo 2019: Report on the eighth confluence competition. International Journal on Software Tools for Technology Transfer, 23, 905–916. https://doi.org/10.1007/s10009-021-00620-4

30.

Paulson

Blanchette

(2010). Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. In G. Sutcliffe, E. Ternovska, & S. Schulz (Eds.), Proceedings of the 8th international workshop on the implementation of logics. EPiC series in computing (pp. 1–11). EasyChair Publications.

31.

Reger

Suda

Voronkov

(2016). New techniques in clausal form generation. In C. Benzmüller, G. Sutcliffe, & R. Rojas (Eds.), Proceedings of the 2nd global conference on artificial intelligence. EPiC series in computing (pp. 11–23). EasyChair Publications.

32.

Robinson

Voronkov

(2001). Handbook of automated reasoning. Elsevier Science.

33.

Roussel

(2011). Controlling a solver execution with the runsolver tool. Journal of Satisfiability, Boolean Modeling and Computation, 7(4), 139–144. https://doi.org/10.3233/SAT190083

34.

Schulz

Cruanes

Vukmirović

(2019). Faster, higher, stronger: E 2.3. In P. Fontaine (Ed.), Proceedings of the 27th international conference on automated deduction. Lecture notes in computer science (pp. 495–507). Springer-Verlag.

35.

Steen

Sutcliffe

Fontaine

McKeown

(2023). Representation, verification, and visualization of Tarskian interpretations for typed first-order logic. In R. Piskac, & A. Voronkov (Eds.), Proceedings of 24th international conference on logic for programming artificial intelligence and reasoning. EPiC series in computing (pp. 369–385). EasyChair Publications.

36.

Stump

Sutcliffe

Tinelli

(2014). StarExec: A cross-community infrastructure for logic solving. In S. Demri, D. Kapur, & C. Weidenbach (Eds.), Proceedings of the 7th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 367–373). Springer-Verlag.

37.

Sutcliffe

(2000). The CADE-16 ATP system competition. Journal of Automated Reasoning, 24(3), 371–396. https://doi.org/10.1023/A:1006393501098

38.

Sutcliffe

(2006). Semantic derivation verification: Techniques and implementation. International Journal on Artificial Intelligence Tools, 15(6), 1053–1070. https://doi.org/10.1142/S0218213006003119

39.

Sutcliffe

(2016). The CADE ATP system competition—CASC. AI Magazine, 37(2), 99–101.

40.

Sutcliffe

(2017). The TPTP problem library and associated infrastructure. From CNF to TH0, TPTP v6.4.0. Journal of Automated Reasoning, 59(4), 483–502. https://doi.org/10.1007/s10817-017-9407-7

41.

Sutcliffe

(2023). Proceedings of the CADE-29 ATP system competition. http://tptp.org/CASC/29/Proceedings.pdf

42.

Sutcliffe

(2023). The logic languages of the TPTP world. Logic Journal of the IGPL, 31(6), 1153–1169.

43.

Sutcliffe

(2024). Stepping stones in the TPTP world. In C. Benzmüller, M. Heule, & R. Schmidt (Eds.), Proceedings of the 12th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 30–50). Springer-Verlag.

44.

Sutcliffe

(2024). Proceedings of the 12th IJCAR ATP system competition, Online, http://tptp.org/CASC/J12/Proceedings.pdf

45.

Sutcliffe

Desharnais

(2022). The CADE-28 automated theorem proving system competition – CASC-28. AI Communications, 34(4), 259–276. https://doi.org/10.3233/AIC-210235

46.

Sutcliffe

Desharnais

(2023). The 11th IJCAR automated theorem proving system competition—CASC-J11. AI Communications, 36(2), 73–91. https://doi.org/10.3233/AIC-220244

47.

Sutcliffe

Desharnais

(2024). The CADE-29 automated theorem proving system competition—CASC-29. AI Communications, 37(4), 485–503. https://doi.org/10.3233/AIC-220244

48.

Sutcliffe

Kotthoff

Perrault

C. R.

Khalid

(2024). An empirical assessment of progress in automated theorem proving. In C. Benzmüller, M. Heule, & R. Schmidt (Eds.), Proceedings of the 12th international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 53–74). Springer-Verlag.

49.

Sutcliffe

Schulz

Claessen

Van Gelder

(2006). Using the TPTP language for writing derivations and finite interpretations. In U. Furbach, & N. Shankar (Eds.), Proceedings of the 3rd international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 67–81). Springer-Verlag.

50.

Sutcliffe

Suttner

C. B.

(2001). Evaluating general purpose automated theorem proving systems. Artificial Intelligence, 131(1–2), 39–54. https://doi.org/10.1016/S0004-3702(01)00113-8

51.

Van Gelder

Sutcliffe

(2006). Extending the TPTP language to higher-order logic with automated parser generation. In U. Furbach, & N. Shankar (Eds.), Proceedings of the 3rd international joint conference on automated reasoning. Lecture notes in artificial intelligence (pp. 156–161). Springer-Verlag.

52.

Wetzler

Heule

Hunt

(2014). DRAT-trim: Efficient checking and trimming using expressive clausal proofs. In C. Sinz, & U. Egly (Eds.), Proceedings of the 17th international conference on the theory and applications of satisfiability testing. Lecture notes in theoretical computer science (pp. 422–429). Springer-Verlag.

53.

Liu

Chen

Zhong

(2018). Contradiction separation based dynamic multi-clause synergized automated deduction. Information Sciences, 462, 93–113. https://doi.org/10.1016/j.ins.2018.04.086

54.

Zeng

Chen

Liu

(2024). Inverse dynamic deduction algorithm of standard contradiction separation rule based on parallel mechanism. In E. Kerre, J. Lu, L. Martinez, T. Li, J. Montero, & P. Flores-Vidal (Eds.), Proceedings of the 16th FLINS conference on computational intelligence in decision and control (pp. 97–104). Computer Engineering and Information Science, World Scientific.

	THF		TFA			FOF
Division category	TNE	TEQ	TFI	TFE	TFN	FNE	FEQ	UEQ	SLH	ICU
Eligible	118	1049	211	45	162	413	3765	494	7400	80
Usable	118	1049	211	45	162	92	887	487	7400	80
Used	100	400	175	25	150	75	425	300	1000	80
New	0	0	10	0	2	0	0	0	1000	12

The 12th IJCAR Automated Theorem Proving System Competition—CASC-J12

Abstract

Keywords

1. Introduction

2. Divisions and Systems

3.1. Computers

3.2. Problems for the TPTP-based Divisions

Table 3. Numbers of Eligible and Used Problems. THF TFA FOF Division category TNE TEQ TFI TFE TFN FNE FEQ UEQ SLH ICU Eligible 118 1049 211 45 162 413 3765 494 7400 80 Usable 118 1049 211 45 162 92 887 487 7400 80 Used 100 400 175 25 150 75 425 300 1000 80 New 0 0 10 0 2 0 0 0 1000 12

3.4. Problems for the ICU Division

3.5. Time Limits

4. System Entry, Delivery, and Execution

5. System Evaluation

6. Results

6.1. The THF Division

6.5. The UEQ Division

Table 9. SLH Division Results. ATP system SLH/1000 Avg CPU SotAC CPU μ Eff Vampire 4.9 463 2.7 0.11 309 E 3.2.0 429 4.2 0.09 214 cvc5 1.1.3 363 3.0 0.06 137 Demonstration division E 3.1 426 4.1 – 239

8. Proof and Model Verification

Footnotes

ORCID iD

Funding

Declaration of Conflicting Interests

Notes

References

Table 3.
Numbers of Eligible and Used Problems.

THF TFA FOF

Division category TNE TEQ TFI TFE TFN FNE FEQ UEQ SLH ICU

Eligible 118 1049 211 45 162 413 3765 494 7400 80

Usable 118 1049 211 45 162 92 887 487 7400 80

Used 100 400 175 25 150 75 425 300 1000 80

New 0 0 10 0 2 0 0 0 1000 12

Table 9.
SLH Division Results.

ATP system SLH/1000 Avg CPU SotAC CPU $μ$ Eff

Vampire 4.9 463 2.7 0.11 309

E 3.2.0 429 4.2 0.09 214

cvc5 1.1.3 363 3.0 0.06 137

Demonstration division

E 3.1 426 4.1 – 239