Sage Journals: Discover world-class research

Abstract

We define a distance function on propositional formulas in CNF as a measure of non-isomorphism of formulas: the larger the distance between two formulas is, the further they are from being isomorphic. This distance induces a metric on isomorphism classes of formulas. We show how this distance can be used for $SAT$ solving, namely for per-instance algorithm selection where there is a “portfolio” of $SAT$ solvers and there is a “meta-solver” that chooses a solver from the portfolio for a given input formula.

Keywords

Distance on CNF formulas isomorphism invariants per-instance algorithm selection

1. Introduction

Two propositional formulas in CNF are isomorphic if one of them can be obtained from the other by renaming variables and flipping literals. It is natural to view isomorphism as “similarity” of formulas, but this similarity measure is all or nothing: any two formulas are similar or not. How can we modify this measure to make it gradual? In this paper, we define a distance functions on formulas such that the smaller the distance between two formulas is, the closer they are to being isomorphic. This distance function is a pseudometric on formulas and, thereby, it induces a metric on isomorphism classes of formulas.

Application to $SAT$ Solving. Before we describe the distance function and its properties, we briefly note how this distance can be used for $SAT$ solving, see more details in this section below and a formal description in Sections 5 and 6. We consider the following variant of per-instance algorithm selection [11]. There are a “dataset” D of formulas and a “portfolio” P of $SAT$ algorithms such that for every formula $ϕ \in D$ , at least one algorithm in P performs fast on ϕ. There is a “meta-algorithm” S that takes a formula ψ as input and tries to select an algorithm from P that performs fast on ψ. How does S work?

Under certain conditions on D and P (we specify these conditions in Section 6), if an algorithm from P performs fast on a formula $ϕ \in D$ then this algorithm performs fast on any formula close to ϕ. This suggests that S finds a formula $ϕ \in D$ closest to ψ and selects an algorithm $A \in P$ that performs fast on ϕ. If the distance between ψ and ϕ is small enough, then A performs fast on ψ as well. Otherwise (D contains no formula close to ψ), the meta-algorithm S gives up and returns “can’t select”.

Distance Function. Precise definitions are given in Sections 2 and 3; here we give the main idea. The distance between two formulas is defined by combining isomorphism of formulas and symmetric difference of sets. Let $ϕ_{1}$ and $ϕ_{2}$ be propositional formulas in CNF ( $ϕ_{1}$ and $ϕ_{2}$ are sets of clauses). The distance between them, denoted $δ (ϕ_{1}, ϕ_{2})$ , is the minimum cardinality of the symmetric differences of formulas $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ , where the minimum is taken over all formulas $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ such that they are isomorphic to $ϕ_{1}$ and $ϕ_{2}$ respectively. This definition has two equivalent forms:

$δ (ϕ_{1}, ϕ_{2})$ is the minimum number of clauses that can be removed from $ϕ_{1}$ and $ϕ_{2}$ so that the remaining subformulas are isomorphic to each other;

$δ (ϕ_{1}, ϕ_{2})$ is the minimum number of clauses that can be added to $ϕ_{1}$ and $ϕ_{2}$ so that the resulting superformulas are isomorphic to each other.

By definition, δ is preserved under isomorphism: if

ϕ_{1}

ϕ_{1}^{'}

and

ϕ_{2}

ϕ_{2}^{'}

are pairs of isomorphic formulas, then

\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = δ (ϕ_{1}^{'}, ϕ_{2}^{'}) . \end{matrix}

In Section 3, we show that δ is a pseudometric on formulas. This pseudometric induces a metric on isomorphism classes of formulas.

Note that this approach to defining a distance function is essentially a replica of the approach used by Gromov in the definition of the Gromov–Hausdorff distance on metric spaces [5,10]. The Gromov–Hausdorff distance between metric spaces $M_{1}$ and $M_{2}$ is the minimum Hausdorff distance between the images of $M_{1}$ and $M_{2}$ under distance-preserving functions $f_{1}$ and $f_{2}$ from these spaces to a metric space M, where the minimum is taken over all M, $f_{1}$ , and $f_{2}$ .

Computing the Distance. In Section 4 we address the problem of computing the distance function δ. Although it is $NP$ -hard to compute $δ (ϕ_{1}, ϕ_{2})$ in general, it is easier to determine whether this distance is “small”. Namely, for every integer d, the problem of verifying the equality $δ (ϕ_{1}, ϕ_{2}) = d$ is Turing reducible to the graph isomorphism problem. The latter can be solved in quasi-polynomial time [4] and, in practice, it is often solved efficiently using Nauty tools [14].

What distance is computed in the per-instance algorithm selection approach described above? The meta-algorithm determines whether the dataset contains a formula lying within a small distance from the input formula ψ. Assuming that a “small distance” means a constant distance, this can be done by a polynomial-time oracle algorithm with an oracle for graph isomorphism.

Per-Instance Algorithm Selection. Section 6 shows how the distance δ can be used for per-instance algorithm, see the brief sketch above. The meta-algorithm S defined in this section is called the fixed-distance selector. Taking a formula ψ and an integer d as input, S checks whether the dataset D contains a formula ϕ such that $δ (ϕ, ψ) ⩽ d$ . If so, S selects an algorithm $A \in P$ that performs fast on ϕ and returns it. Otherwise, S says “can’t select”. We formulate conditions on D and P that guarantee fast performance of A on the input formula ψ.

The key point of this approach is that every algorithm $A \in P$ has the following property:

If A performs fast on a formula ϕ, then A performs fast on any formula close to ϕ.

To formalize this property, we define a class of parameters of formulas that we call parameters of constant variation (Section 5). By a parameter we mean any computable function from the set of formulas to the non-negative integers. A parameter π is a parameter of constant variation if the following holds:

π is preserved under isomorphism;

for every formula ϕ, a slight variation of ϕ can cause only a slight change of $π (ϕ)$ : the addition of one clause to ϕ can change $π (ϕ)$ by only a constant value that does not depend on ϕ.

The class of constant-variation parameters includes incidence treewidth, maximum deficiency, and many other parameters π such that the parameterized problem

SAT (π)

is fixed-parameter tractable [15]. This class can be characterized in terms of the distance δ. Namely, π is a parameter of constant variation if and only if

\begin{matrix} | π (ϕ_{1}) - π (ϕ_{2}) | ⩽ c \cdot δ (ϕ_{1}, ϕ_{2}) \end{matrix}

for some number c and all formulas

ϕ_{1}

and

ϕ_{2}

. This characterization connects the running time of parameterized algorithms from the portfolio P with the distance δ.

Extension to CSP Instances. In Section 7 we briefly discuss an extension of the distance function δ from CNF formulas to instances of the constraint satisfaction problem.

2. Formulas and Isomorphism

Propositional Formulas in CNFs. A literal is a propositional variable or its negation; each of them is the complement of the other. The complement of a literal a is denoted by $\neg a$ . A clause is a nonempty finite set of literals that contains no pair of complements. A propositional formula in Conjunctive Normal Form, called a formula or CNF formula for short, is a finite set of clauses.

Let ϕ be a formula. We write $| ϕ |$ to denote the number of clauses of ϕ and we write $var (ϕ)$ to denote the set of variables appearing in ϕ (with or without negation). The width of a clause C is the number of literals in C. If ψ is a formula such that $ϕ \subseteq ψ$ , we call ϕ a subformula of ψ.

An assignment for a formula ϕ is a function from $var (ϕ)$ to ${0, 1}$ . Let α be such an assignment and $x \in var (ϕ)$ . We say that x is true under α if $α (x) = 1$ ; otherwise x is false under α. Symmetrically, we say that $\neg x$ is true under α if $α (x) = 0$ ; otherwise $\neg x$ is false under α. A clause is thought of as the disjunction of its literals; we say that α satisfies a clause $C \in ϕ$ if at least one literal in C is true under α. A formula is thought of as the conjunction of its clauses; if α satisfies all clauses of ϕ then we say that α satisfies ϕ and call α a satisfying assignment for ϕ. The empty formula, denoted by ⊤, is considered to be satisfied by any assignment. The number of satisfying assignments for ϕ is denoted by $# ϕ$ [9].

We write $SAT$ to denote the satisfiability problem for propositional formulas in CNF: given a formula, does it have a satisfying assignment? The counting version of $SAT$ , denoted by $# SAT$ , is the following function problem: given a formula ϕ, find $# ϕ$ .

Isomorphisms. Informally, two formulas are isomorphic if one of them can be obtained from the other by renaming variables and flipping literals. A more formal definition is given below.

Let $ϕ_{1}$ and $ϕ_{2}$ be formulas. Let $L_{1}$ and $L_{2}$ be the sets of literals defined by $\begin{array}{l} L_{1} = var (ϕ_{1}) \cup {\neg x ∣ x \in var (ϕ_{1})} \\ L_{2} = var (ϕ_{2}) \cup {\neg x ∣ x \in var (ϕ_{2})} \end{array}$ Let f be a bijection from $L_{1}$ to $L_{2}$ . We call f an isomorphism from $ϕ_{1}$ to $ϕ_{2}$ if the following holds:

f and its inverse preserve the complement relation, which means that for all literals $a, b \in L_{1}$ , $\begin{matrix} a is the complement of b \Leftrightarrow f (a) is the complement of f (b); \end{matrix}$

f and its inverse preserve the clauses, which means that for all literals $a_{1}, \dots, a_{k} \in L_{1}$ , $\begin{matrix} {a_{1}, \dots, a_{k}} is a clause of ϕ_{1} \Leftrightarrow {f (a_{1}), \dots, f (a_{k})} is a clause of ϕ_{2} . \end{matrix}$

Formulas

ϕ_{1}

and

ϕ_{2}

are isomorphic, written as

ϕ_{1} ≃ ϕ_{2}

, if there is an isomorphism from

ϕ_{1}

ϕ_{2}

. For every formula ϕ, we write

[ϕ]

to denote the isomorphism class of ϕ, i.e., the set of all formulas isomorphic to ϕ. Thus,

[ϕ]

can be viewed as ϕ up to isomorphism. Note that an isomorphism between formulas is sometimes defined using renamings only [3], yet in this paper we adhere to the more general version.

3. Distance Function on Formulas

Let X be a set. A function d from $X \times X$ to $R_{⩾ 0}$ is a distance function if $d (x, y) = d (y, x)$ and $d (x, x) = 0$ for all $x, y \in X$ . Consider the following two conditions on d: for all $x, y, z \in X$ ,

(1)
if $d (x, y) = 0$ then $x = y$ ;
(2)
$d (x, y) ⩽ d (x, z) + d (z, y)$ (triangle inequality).
The distance function d is a metric on X if d satisfies both (1) and (2). If d satisfies (2), but not necessarily (1), then d is a pseudometric on X.

To define a distance function on formulas, we recall the notation for the symmetric difference operation. The symmetric difference of sets A and B is denoted by $A △ B$ : $\begin{matrix} A △ B = (A ∖ B) \cup (B ∖ A) = (A \cup B) ∖ (A \cap B) . \end{matrix}$
Definition 1.
Let δ be the following function that maps pairs of CNF formulas to nonnegative integers: $\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = min_{ϕ_{1}^{'}, ϕ_{2}^{'}} | ϕ_{1}^{'} △ ϕ_{2}^{'} | \end{matrix}$ where the minimum is taken over all formulas $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ such that $ϕ_{1}^{'} ≃ ϕ_{1}$ and $ϕ_{2}^{'} ≃ ϕ_{2}$ .
Proposition 1.
The distance function δ is a pseudometric.

Proof: Consider the distance function $d_{△}$ on finite sets defined as follows: for all finite sets A and B, $\begin{matrix} d_{△} (A, B) = | A △ B | . \end{matrix}$ It is well known that $d_{△}$ is a metric. Indeed, we have $\begin{matrix} (A △ B) △ (B △ C) = A △ C \end{matrix}$ and, hence, the triangle inequality for $d_{△}$ holds. Now the triangle inequality for δ follows from the triangle inequality for $d_{△}$ . □

Suprema and Infima. We say that a formula ϕ is embedded into a formula ψ if ϕ is isomorphic to some subformula of ψ. A formula ψ is called a supremum of formulas $ϕ_{1}$ and $ϕ_{2}$ if the following holds:
both $ϕ_{1}$ and $ϕ_{2}$ are embedded into ψ;

for every formula $ψ^{'}$ such that both $ϕ_{1}$ and $ϕ_{2}$ are embedded into $ψ^{'}$ , we have $| ψ | ⩽ | ψ^{'} |$ .
Similarly, we say that a formula 𝜃 is an infimum of formulas $ϕ_{1}$ and $ϕ_{2}$ if

𝜃 is embedded into both $ϕ_{1}$ and $ϕ_{2}$ ;

for every formula $𝜃^{'}$ such that 𝜃 is embedded into both $ϕ_{1}$ and $ϕ_{2}$ , we have $| 𝜃 | ⩾ | 𝜃^{'} |$ .
Proposition 2.
Let ψ and 𝜃 be respectively a supremum and an infimum of formulas $ϕ_{1}$ and $ϕ_{2}$ . Then $\begin{array}{l} (1) & δ (ϕ_{1}, ϕ_{2}) = | ψ | - | 𝜃 |; \\ (2) & δ (ϕ_{1}, ϕ_{2}) = 2 | ψ | - | ϕ_{1} | - | ϕ_{2} |; \\ (3) & δ (ϕ_{1}, ϕ_{2}) = | ϕ_{1} | + | ϕ_{2} | - 2 | 𝜃 | . \end{array}$

Proof: It straightforwardly follows from the definition of δ and properties of the symmetric difference. □

Induced Metric on Isomorphism Classes. It follows from the definition of δ that if $ϕ_{1} ≃ ϕ_{1}^{'}$ and $ϕ_{2} ≃ ϕ_{2}^{'}$ , then $\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = δ (ϕ_{1}^{'}, ϕ_{2}^{'}) . \end{matrix}$ Thus, δ induces a distance function on isomorphism classes of formulas: this function, denoted by Δ, is defined by $\begin{matrix} Δ ([ϕ_{1}], [ϕ_{2}]) = δ (ϕ_{1}, ϕ_{2}) . \end{matrix}$ Proposition 1 implies that Δ is a metric on the set of isomorphism classes.

Example. Let $ϕ_{1}$ and $ϕ_{2}$ be the following formulas: $\begin{array}{l} ϕ_{1} = {{x_{1}, \neg x_{4}}, {x_{2}, \neg x_{3}}, {\neg x_{1}, \neg x_{2}, x_{4}}}; \\ ϕ_{2} = {{y_{1}, \neg y_{2}, y_{3}}, {\neg y_{1}, y_{2}}, {y_{3}, y_{4}, \neg y_{5}}, {\neg y_{4}}} . \end{array}$ Consider the following subformula of $ϕ_{1}$ : $\begin{matrix} 𝜃 = {{x_{1}, \neg x_{4}}, {\neg x_{1}, \neg x_{2}, x_{4}}} . \end{matrix}$ The formula 𝜃 is isomorphic to the following subformula of $ϕ_{2}$ : $\begin{matrix} {{y_{1}, \neg y_{2}, y_{3}}, {\neg y_{1}, y_{2}}} \end{matrix}$ (this isomorphism is given by $x_{1} \mapsto \neg y_{1}$ , $x_{2} \mapsto \neg y_{3}$ , $x_{4} \mapsto \neg y_{2}$ ). Thus, 𝜃 is embedded into both $ϕ_{1}$ and $ϕ_{2}$ . It is easy to check that no formula with more than two clauses is embedded into $ϕ_{1}$ and $ϕ_{2}$ . Hence, 𝜃 is an infimum of $ϕ_{1}$ and $ϕ_{2}$ . Using Proposition 2, namely equality (3), we can find the distance between $ϕ_{1}$ and $ϕ_{2}$ : $\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = | ϕ_{1} | + | ϕ_{2} | - 2 | 𝜃 | = 3 + 4 - 2 \cdot 2 = 3 . \end{matrix}$

The distance can also be found using a supremum. Consider the following formula that contains $ϕ_{2}$ as a subformula: $\begin{matrix} ψ = {{y_{1}, \neg y_{2}, y_{3}}, {\neg y_{1}, y_{2}}, {y_{3}, y_{4}, \neg y_{5}}, {\neg y_{4}}, {\neg y_{3}, \neg y_{4}}} . \end{matrix}$ The formula $ϕ_{1}$ is embedded into ψ with the isomorphism given by $x_{1} \mapsto \neg y_{1}$ , $x_{2} \mapsto \neg y_{3}$ , $x_{3} \mapsto y_{4}$ , $x_{4} \mapsto \neg y_{2}$ . Since $ϕ_{1}$ and $ϕ_{2}$ cannot be embedded into a formula with less than five clauses, ϕ is a supremum of $ϕ_{1}$ and $ϕ_{2}$ . By equality (2), we have $\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = 2 | ψ | - | ϕ_{1} | - | ϕ_{2} | = 2 \cdot 5 - 3 - 4 = 3 . \end{matrix}$
4. Distance Computation

Given formulas $ϕ_{1}$ and $ϕ_{2}$ , how hard is it to compute $δ (ϕ_{1}, ϕ_{2})$ ? We show that the problem of computing the distance δ is at least as hard as the hardest problems in $NP$ . On the positive side, this general problem can be restricted to a more tractable case: determine whether the distance $δ (ϕ_{1}, ϕ_{2})$ is “small”, where “small” means a constant not depending on the input formulas.

To make these claims more precise and to prove them, we consider the following computational problems:

The embedding problem: given formulas $ϕ_{1}$ and $ϕ_{2}$ , determine whether $ϕ_{1}$ is embedded into $ϕ_{2}$ .

The distance verification problem: given formulas $ϕ_{1}$ , $ϕ_{2}$ , and an integer d, determine whether $δ (ϕ_{1}, ϕ_{2}) = d$ .

The distance computation problem: given formulas $ϕ_{1}$ and $ϕ_{2}$ , find the distance $δ (ϕ_{1}, ϕ_{2})$ .

Proposition 3.
The embedding problem is $NP$ -complete.

Proof: It is obvious that the embedding problem is in $NP$ . To prove the completeness, we reduce the Hamiltonian path problem to the embedding problem. Without loss of generality, we can assume that instances of the Hamiltonian path problem are graphs without isolated vertices.

In our reduction, every graph G without isolated vertices is represented by a formula denoted by $ϕ_{G}$ and defined as follows. Let G be a graph on vertices $v_{1}, \dots, v_{n}$ and with edges $e_{1}, \dots, e_{m}$ . The formula $ϕ_{G}$ has $n + m$ variables $\begin{matrix} x_{1}, \dots, x_{n}, y_{1}, \dots, y_{m} \end{matrix}$ where each $x_{i}$ represents the vertex $v_{i}$ and each $y_{k}$ represents the edge $e_{k}$ . Using these variables, we form m clauses of $ϕ_{G}$ : for each edge $e_{k}$ with endpoints $v_{i}$ and $v_{j}$ , there is the clause ${y_{k}, x_{i}, x_{j}}$ . Note that the map $G \mapsto ϕ_{G}$ has the following property: for all graphs G and H without isolated vertices, H is isomorphic to a subgraph of G if and only if $ϕ_{H}$ is embedded into $ϕ_{G}$ .

Now we define a reduction of the Hamiltonian path problem to the embedding problem as follows. Let G be a graph with n vertices (all of them are non-isolated) and m edges. Let H be a Hamiltonian path on n vertices: a graph on vertices $u_{1}, \dots, u_{n}$ with edges between $u_{i}$ and $u_{i + 1}$ where $i = 1, \dots, n - 1$ . The reduction maps G to the pair $(ϕ_{H}, ϕ_{G})$ . It is easy to see that G has a Hamiltonian path if and only if H is isomorphic to a subgraph of G. Therefore, G has a Hamiltonian path if and only if $ϕ_{H}$ is embedded into $ϕ_{G}$ . It remains to note that the pair $(ϕ_{H}, ϕ_{G})$ can be computed from G in polynomial time. □
Proposition 4.
The distance verification problem is $NP$ -hard.

Proof: We show that the embedding problem is reducible to the distance verification problem. The reduction is a polynomial-time algorithm that takes an instance $(ϕ_{1}, ϕ_{2})$ of the embedding problem as input and outputs the equality $\begin{matrix} (4) & δ (ϕ_{1}, ϕ_{2}) = | ϕ_{2} | - | ϕ_{1} | . \end{matrix}$ If $ϕ_{1}$ is embedded into $ϕ_{2}$ then $δ (ϕ_{1}, ϕ_{2}) = | ϕ_{2} | - | ϕ_{1} |$ by Proposition 2 and, hence, (4) holds.

Conversely, suppose that (4) holds. By Definition 1, $\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = | ϕ_{1}^{'} △ ϕ_{2}^{'} | \end{matrix}$ for some formulas $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ such that $ϕ_{1}^{'} ≃ ϕ_{1}$ and $ϕ_{2}^{'} ≃ ϕ_{2}$ . Hence, using (4), we obtain $\begin{matrix} | ϕ_{1}^{'} △ ϕ_{2}^{'} | = | ϕ_{1}^{'} | - | ϕ_{2}^{'} | . \end{matrix}$ By properties of the symmetric difference, the latter equality holds only if $ϕ_{1}^{'}$ is a subformula of $ϕ_{2}^{'}$ . Therefore, $ϕ_{1}$ is embedded into $ϕ_{2}$ , which completes the proof. □
Corollary 5.
The distance computation problem is at least as hard as the hardest problems in $NP$ .

Proof: There is a trivial polynomial-time Turing reduction of the distance verification problem (shown to be $NP$ -hard) to the distance computation problem. □
Proposition 6.
The distance verification problem can be solved by an oracle algorithm A with the following properties:
the oracle of A answers questions of whether two given formulas are isomorphic;

there exists a polynomial p such that on every instance $(ϕ_{1}, ϕ_{2}, d)$ , the running time of A is at most $p (s) \cdot M^{d}$ , where s is the size of the instance and $M = max (| ϕ_{1} |, | ϕ_{2} |)$ .

Proof: The distance verification problem can be solved using the equivalent definitions of δ given by Proposition 2. For instance, we can use equality (3) that expresses δ in terms of infima: $\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = | ϕ_{1} | + | ϕ_{2} | - 2 | 𝜃 | \end{matrix}$ where 𝜃 is an infimum of $ϕ_{1}$ and $ϕ_{2}$ . In fact, this definition shows that distance verification is equivalent to “infimum verification”: $\begin{matrix} δ (ϕ_{1}, ϕ_{2}) = d \Leftrightarrow | 𝜃 | = \frac{| ϕ_{1} | + | ϕ_{2} | - d}{2} \end{matrix}$ That is, instead of verifying whether $δ (ϕ_{1}, ϕ_{2}) = d$ , we can verify whether $ϕ_{1}$ and $ϕ_{2}$ have an infimum 𝜃 that contains k clauses where $\begin{matrix} k = \frac{| ϕ_{1} | + | ϕ_{2} | - d}{2} . \end{matrix}$ The algorithm A takes $(ϕ_{1}, ϕ_{2}, d)$ as input and uses brute force to check whether such an infimum exists. Namely, A takes the following two steps.

First, A enumerates pairs $(𝜃_{1}, 𝜃_{2})$ , where $𝜃_{1}$ is a k-subset of $ϕ_{1}$ and $𝜃_{2}$ is a k-subset of $ϕ_{2}$ . For each such pair, A asks the oracle whether $𝜃_{1}$ and $𝜃_{2}$ are isomorphic. If the oracle’s answer is positive, an infimum of cardinality at least k exists. The number of queries to the oracle is estimated by $\begin{matrix} (\binom{| ϕ_{1} |}{k}) \cdot (\binom{| ϕ_{2} |}{k}) & = (\binom{| ϕ_{1} |}{| ϕ_{1} | - k}) \cdot (\binom{| ϕ_{2} |}{| ϕ_{2} | - k}) \\ ⩽ M^{| ϕ_{1} | - k} \cdot M^{| ϕ_{2} | - k} = M^{| ϕ_{1} | + | ϕ_{2} | - 2 k} = M^{d} . \end{matrix}$ Second, in a similar way, A checks whether there exist a $(k + 1)$ -subset of $ϕ_{1}$ and a $(k + 1)$ -subset of $ϕ_{2}$ such that these subsets are isomorphic. If no such subsets exist, there is an infimum of cardinality k. The total number of queries to the oracle in the two steps does not exceed $2 M^{d}$ , and each query is processed in polynomial time. Therefore, the overall running time is at most $p (s) \cdot M^{d}$ for some polynomial p. □

The algorithm in Proposition 6 uses an oracle for deciding the formula isomorphism problem (given two formulas, are they isomorphic?). It is well known that this problem is polynomially equivalent to the graph isomorphism problem [3]. It is easy to modify the algorithm A so that its oracle for the formula isomorphism problem can be replaced with an oracle for the graph isomorphism problem. Let $A_{G}$ be the modified algorithm; the running time of $A_{G}$ remains the same as up to a polynomial factor.

The complexity class $GI$ consists of problems that have polynomial-time Turing reduction to the graph isomorphism problem, see for example [12]. It is known that the graph isomorphism problem can be solved in quasi-polynomial time [4] and, in practice, it is often solved efficiently using Nauty tools [14].
Proposition 7.
For every integer d, the restriction of the distant verification problem to instances $(ϕ_{1}, ϕ_{2}, d)$ is in $GI$ .

Proof: For each restriction, the algorithm $A_{G}$ described above is a polynomial-time Turing reduction to the graph isomorphism problem. □
5. Parameters of Constant Variation

By a parameter of formulas we mean any computable function from the set of formulas to the non-negative integers. Natural examples of parameters are the number of clauses, the number of variables, and the number of satisfying assignments. In this section we define a class of parameters of constant variation. This type of parameters is needed for the next section where we describe how the distance δ can be used for $SAT$ solving. We give only two examples of constant-variation parameters, namely incidence treewidth and maximum deficiency, but many parameters for which the corresponding parameterized version of $SAT$ is fixed-parameter tractable are parameters of constant variation.

Definition 2 (parameter of constant variation).

We say that a parameter π is a parameter of constant variation if it has the following two properties:

for all formulas $ϕ_{1}$ and $ϕ_{2}$ , if $ϕ_{1} ≃ ϕ_{2}$ then $π (ϕ_{1}) = π (ϕ_{2})$ ;

there exists a number c such that for every formula ϕ and every formula $ϕ^{'}$ obtained from ϕ by adding one clause, we have $\begin{matrix} (5) & | π (ϕ^{'}) - π (ϕ) | ⩽ c . \end{matrix}$

The number c is called a bound on the change.

Thus, π is a parameter of constant variation if it is preserved under isomorphism and the addition of one clause to ϕ can change $π (ϕ)$ by only a constant value that does not depend on ϕ. The number of clauses in a formula is a trivial example of parameters of constant variation. As for the number of variables, it is not a parameter of constant variation. Indeed, consider formulas ϕ and $ϕ^{'}$ where $ϕ^{'}$ is obtained from ϕ by adding one clause with arbitrarily many new variables. Then there is no constant upper bound on the difference between $| var (ϕ^{'}) |$ and $| var (ϕ) |$ .

For more illustration of the notion of constant-variation parameters, we give two examples: incidence treewidth and maximum deficiency. Both parameters are well known in connection with fixed-parameter tractability of $SAT$ .

Incidence Treewidth. For every formula ϕ, the incidence graph of ϕ is a bipartite graph G defined as follows. The set of vertices of G is partitioned into two subsets: one of them is the set of variables of ϕ and the other is the set of clauses of ϕ. There are only edges between variables and clauses: a variable x is connected by an edge with a clause C if and only if $x \in var (C)$ . The treewidth of the incidence graph of ϕ is called the incidence treewidth of ϕ and is denoted by $tw (ϕ)$ .

There are various algorithms that solve $SAT$ and $# SAT$ when an input formula is given along with a tree decomposition of its incidence graph. For example, the algorithm from [16] solves $# SAT$ in time $2^{w} \cdot poly (l)$ where w is the width of the input tree decomposition and l is the size of the input. Also there are algorithms that build a tree decomposition of a given graph. One of such algorithm is given in [1]: it approximates a tree decomposition with the minimum width. The combination of these two algorithms solves $# SAT$ in time $2^{3.7 \cdot tw (ϕ)} \cdot poly (l)$ .

Proposition 8.
Incidence treewidth is a parameter of constant variation.

Proof: If two formulas are isomorphic then their incidence graphs are isomorphic too and, therefore, incidence treewidth is preserved under isomorphism. We prove that incidence treewidth has the second property from Definition 2 with $c = 1$ : for every formula ϕ and every formula $ϕ^{'}$ obtained from ϕ by adding one clause, we have $tw (ϕ^{'}) - tw (ϕ) ⩽ 1$ .

Consider a tree decomposition of the incidence graph of ϕ: a labeled tree T in which every vertex is labeled by a bag of vertices. We show how to extend it to a tree decomposition of the incidence graph of $ϕ^{'}$ . Let C be the “new” clause added to ϕ and let $x_{1}, \dots, x_{k}$ be the “new” variables, i.e., the variables appearing in C and not occurring in ϕ. These “new” elements must be added to T. The variables $x_{1}, \dots, x_{k}$ are added in a simple way: (1) select any leaf of T, (2) connect it with k new vertices, and (3) assign one-element bags ${x_{1}}, \dots, {x_{k}}$ to the new vertices. Finally, we extend all bags (including the bags of new vertices) by adding the clause C.

It is easy to see that the result of our extension, the labeled tree $T^{'}$ , is a tree decomposition of the incidence graph of $ϕ^{'}$ . Moreover, by our construction, we have $w^{'} - w = 1$ where w and $w^{'}$ are the width of T and the width of $T^{'}$ respectively. Taking the tree decomposition T such that $w = tw (ϕ)$ and using the obvious inequality $tw (ϕ^{'}) ⩽ w^{'}$ , we obtain $\begin{matrix} tw (ϕ^{'}) - tw (ϕ) ⩽ w^{'} - w = 1, \end{matrix}$ which completes the proof. □

Maximum Deficiency. The parameter of formulas called maximum deficiency was introduced in [8]; it relates satisfiability of a formula with matchings in its incidence graph.

Let ϕ be a formula and let M be a matching in its incidence graph. For each edge in M that connects a variable x and a clause C, we can assign x a truth value that makes C true. Therefore, if all clauses of ϕ are matched by M then ϕ is satisfiable. The maximum deficiency of a formula ϕ, denoted by $md (ϕ)$ , is the number of clauses unmatched by a maximum matching in the incidence graph of ϕ (this number remains the same whatever maximum matching is taken). Since a maximum matching can be found in polynomial time, $md (ϕ)$ is computable in polynomial time. Equivalently, due to Hall’s marriage theorem, maximum deficiency can be defined as follows: $\begin{matrix} md (ϕ) = max_{ϕ^{'} \subseteq ϕ} (| ϕ^{'} | - | var (ϕ^{'}) |) \end{matrix}$ where the maximum is taken over all subformulas of ϕ. Note that $md (ϕ) ⩾ 0$ for all ϕ because ⊤ is a subformula of every formula. The satisfiability problem parameterized by maximum deficiency is fixed-parameter tractable: the algorithms in [7,17] solve $SAT$ in time $2^{md (ϕ)} \cdot poly (l)$ where l is the size of ϕ.
Proposition 9.
Maximum deficiency is a parameter of constant variation.

Proof: First, maximum deficiency is preserved under isomorphism. Second, it is obvious that the addition of one clause to a formula ϕ can increase $md (ϕ)$ by at most one and, therefore, the second property from Definition 2 holds with $c = 1$ . □

Characterization in Terms of Distance. The proposition below gives an equivalent definition of parameters of constant variation in terms of the distance δ.
Proposition 10.
The following two statements are equivalent: (1)
A parameter π is a parameter of constant variation and c is a bound on the change.
(2)
Let π be a parameter and let c be a number such that for all formulas $ϕ_{1}$ and $ϕ_{2}$ , $\begin{matrix} (6) & | π (ϕ_{1}) - π (ϕ_{2}) | ⩽ c \cdot δ (ϕ_{1}, ϕ_{2}) . \end{matrix}$

Proof: First, suppose that (6) holds. Then $δ (ϕ_{1}, ϕ_{2}) = 0$ implies $π (ϕ_{1}) = π (ϕ_{2})$ , which means that π is preserved under isomorphism. If $ϕ_{2}$ is obtained from $ϕ_{1}$ by adding one clause, then $δ (ϕ_{1}, ϕ_{2}) = 1$ and, therefore, we have the second property from Definition 2: $\begin{matrix} | π (ϕ_{2}) - π (ϕ_{1}) | ⩽ c . \end{matrix}$

Second, suppose that π is a parameter of constant variation and c is a bound on the change. Let $ϕ_{1}$ and $ϕ_{2}$ be arbitrary formulas. By the definition of δ and Proposition 2, there exist four formulas $ϕ_{1}^{'}$ , $ϕ_{2}^{'}$ , ψ, 𝜃 such that
$ϕ_{1} ≃ ϕ_{1}^{'}$ and $ϕ_{2} ≃ ϕ_{2}^{'}$ ;

𝜃 is subformula of both $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ ;

both $ϕ_{1}^{'}$ and $ϕ_{2}^{'}$ are subformulas of ψ;

$δ (ϕ_{1}, ϕ_{2}) = | ψ | - | 𝜃 |$ .
Thus, ψ is obtained from 𝜃 by the addition of $δ (ϕ_{1}, ϕ_{2})$ clauses. Since the addition of one clause can change $π (𝜃)$ by at most c, we have inequality (6). □

Note that inequality (6) is the same as in the definition of Lipschitz functions (also called Lipschitz continuous functions or Lipschitz maps) [5]. Thus, loosely speaking, π is a parameter of constant variation if and only if π is a Lipschitz function.
6. Fixed-Distance Selector

There are many techniques for designing a “meta-algorithm” that solves $SAT$ using a given “portfolio” of $SAT$ algorithms [11]. We show how the distance δ can be used in a variant of the per-instance algorithm selection technique. Namely, we describe an incomplete meta-algorithm, called a fixed-distance selector, that takes a formula ψ as input and tries to select an algorithm from the portfolio that performs fast on ψ. Then we formulate conditions on portfolios and prove that if the conditions are met then the selected algorithm indeed performs efficiently on ψ.

Datasets and Portfolios. We consider algorithms that take formulas as inputs, for example, algorithms that solve $SAT$ , or $# SAT$ , or $MAXSAT$ . Let D be a finite set of formulas called a dataset. With every formula $ϕ \in D$ , we associate a non-empty finite set $E_{ϕ}$ of algorithms that perform fast on ϕ (we do not define here what “fast” means, this will be done in the conditions formulated below). Note that sets $E_{ϕ}$ associated with different formulas may or may not overlap. Given a dataset D, the set $\begin{matrix} P = ⋃_{ϕ \in D} E_{ϕ} \end{matrix}$ of algorithms is called the portfolio corresponding to D. Thus, for every formula $ϕ \in D$ , there is at least one algorithm $A \in P$ that performs fast on ϕ. Likewise, every algorithm in the portfolio performs fast on at least one formula from the dataset.

Fixed-Distance Selection. We call the following simple meta-algorithm the fixed-distance selector and denote it by S. The algorithm uses a dataset D and the corresponding portfolio P. Let ψ be a formula and let d be a non-negative integer. Taking ψ and d as input, the selector S either outputs an algorithm $A \in P$ or returns “can’t select”:

(1)
Find a formula $ϕ \in D$ such that $δ (ϕ, ψ) ⩽ d$ and $δ (ϕ, ψ)$ is the minimum over all formulas $ϕ \in D$ . If there is no such formula ϕ, return “can’t select”.
(2)
Select any algorithm A from $E_{ϕ}$ and return it.
We do not specify how to find the formulas of D closest to ψ and how to choose one of them. The straightforward way of finding ϕ is exhaustive search: for $r = 0, 1, \dots, d$ , enumerate all formulas $ϕ \in D$ and verify whether $δ (ϕ, ψ) = r$ (the distance verification problem). It follows from Proposition 6 that each verification can be done using a polynomial-time oracle algorithm with an oracle for graph isomorphism. Thus, this exhaustive search requires at most $(d + 1) N$ runs of the oracle algorithm where N is the cardinality of the dataset.

What is the role of the input number d, should it be small or large? Proposition 11 gives an upper bound on the running time of the selected algorithm A on ψ. This bound depends on d: the smaller d is, the faster A is. On the other hand, assuming that N is fixed, if d is too small, this increases the chance that S returns “can’t select”. Thus, a good tradeoff for choosing d is needed.

Algorithms Matching with Parameters. An efficient performance of an algorithm A on a given class of formulas is typically due to the fact that A makes use of some “hidden structure” of formulas of this class. Different algorithms exploit different hidden structures and, thereby, they are successful on different classes of formulas. Hidden structures can be utilized explicitly or implicitly. For example, the fixed-parameter tractable algorithms mentioned in Section 5 make explicit use of certain properties of the incidence graph: the existence of a tree decomposition with a small treewidth and a small maximum deficiency. The CDCL $SAT$ solvers in [2,13] can be viewed as examples of implicit use of community structures.

A hidden structure is often characterized by a parameter π such that the value $π (ψ)$ determines how fast an algorithm A runs on ψ: the smaller $π (ψ)$ is, the faster A on ψ is. For the most of such parameters, the running time of A on ψ is exponential in $π (ψ)$ and polynomial in the size of ψ [6,15]. We say that A matches with π if there exist a number b and a polynomial q such that for every formula ψ, the running time of A on ψ is at most $b^{π (ψ)} \cdot q (l_{ψ})$ where $l_{ψ}$ is the size of ψ.

Note two differences between an algorithm A matching with a parameter π and a fixed-parameter algorithm. First, the input to A is a formula, while the input to a fixed-parameter algorithm is a formula along with a value of the parameter. Second, the upper bound for A has the exponential factor $b^{π (ψ)}$ instead of an arbitrary function $f (π (ψ))$ as in the case of a fixed-parameter algorithm.

Conditions on Datasets and Portfolios. Let D be a dataset and P be the corresponding portfolio. Let a, b, and c be numbers and let q be a polynomial. We say that the tuple $(a, b, c, q)$ is a bound for D and P if for every formula $ϕ \in D$ and every algorithm $A \in E_{ϕ}$ , there exists a parameter π such that the following three conditions are met:

Condition 1. $π (ϕ) ⩽ a$ .

Condition 2. The algorithm A matches with π and the corresponding base and polynomial are bounded from above by b and q respectively: for every formula ψ, the running time of A on ψ does not exceed $b^{π (ψ)} \cdot q (l_{ψ})$ where $l_{ψ}$ is the size of ψ.

Condition 3. The parameter π is a parameter of constant variation and the corresponding bound on the change is less than or equal to c.
Proposition 11.
Let D be a dataset and P be the corresponding portfolio used by the fixed-distance selector S described above. Suppose that numbers a, b, c, and a polynomial q are bounds for D and P. Given a formula ψ and an integer d as input, let $A \in P$ be the algorithm returned by S on this input. Then the running time of A on ψ is at most $b^{a + c d} \cdot q (l_{ψ})$ where $l_{ψ}$ is the size of ψ.

Proof: Let $ϕ \in D$ be the formula chosen by S as a formula closest to the input formula ψ. By Conditions 2 and 3, there is a parameter π such that
π is a parameter of constant variation;

A runs on ψ in time at most $b^{π (ψ)} \cdot q (l_{ψ})$ .
By Proposition 10 we have $π (ψ) ⩽ π (ϕ) + c \cdot δ (ϕ, ψ)$ . By Condition 1 we have $π (ϕ) ⩽ a$ . Therefore, $π (ψ) ⩽ a + c d$ . □

Thus, A performs efficiently on ψ: assuming that d in the bound is fixed, there exists a polynomial $p (l) = b^{a + c d} \cdot q (l)$ such that the running time of A on ψ does not exceed $p (l_{ψ})$ .
7. Concluding Remarks: Possible Extensions

Isomorphism of Formulas. There are several equivalent definitions of the distance δ (Proposition 2) and all of them use isomorphism of formulas as the central notion. On the other hand, these definitions do not require exactly our notion of isomorphism: two formulas are isomorphic if one of them can be obtained from the other by renaming variables and flipping literals. For example, we could define $ϕ_{1} ≃ ϕ_{2}$ based on only renaming, or only flipping, or another equivalence relation on formulas. In fact, we could define the distance relative to any equivalence relation on formulas: if I is such an equivalence relation, then $δ^{I}$ is the corresponding distance function on formulas. This distance $δ^{I}$ could be of interest for $SAT$ solving if, first, the satisfiability status is preserved under I and, second, the problem of I-equivalence testing is “easier” than the problem of satisfiability testing (like the graph isomorphism problem is supposedly easier than the satisfiability problem).

Extension to CSP Instances. The main definitions and results of this paper can be extended from formulas in CNF to instances of the constraint satisfaction problem (CSP instances). The adjustments are obvious: clauses are replaced with constraints and isomorphism of formulas is replaced with isomorphism of CSP instances.

There are several variants of isomorphism of CSP instances; we describe the most common one. Let V be a finite set of variables that take values from a finite domain D. A constraint of arity k is a pair $(\vec{x}, R)$ , where $\vec{x}$ is a k-ary tuple of variables from V and R is a k-ary relation on D. A CSP instance is a triple $(V, D, C)$ where C is finite set of constraints over V and D. Two CSP instances $(V_{1}, D_{1}, C_{1})$ and $(V_{2}, D_{2}, C_{2})$ are isomorphic if there are bijections $f : V_{1} \to V_{2}$ and $g : D_{1} \to D_{2}$ such that the set $C_{2}$ of constraints is obtained from $C_{1}$ by replacing each variable $x \in V_{1}$ and each value $a \in D_{1}$ with their images $f (x)$ and $g (a)$ respectively.

Using this variant of isomorphism, the definitions and propositions of Section 3 are easily carried over from formulas in CNF to CSP instances. Thus, the distance between two CSP instances $Φ_{1}$ and $Φ_{2}$ is the minimum number of constraints that can be removed from $Φ_{1}$ and $Φ_{2}$ so that the remaining sub-instances are isomorphic to each other. Or, equivalently, this distance is the minimum number of constraints that can be added to $Φ_{1}$ and $Φ_{2}$ so that the resulting “super-instances” are isomorphic to each other. The definitions and propositions regarding parameters of constant variation, their characterization in terms of the distance, and fixed-distance selection are carried over to CSP instances as well.

Footnotes

Acknowledgement

We thank the anonymous reviewers for their targeted comments that helped us significantly improve the manuscript.

References

Amir

, Approximation algorithms for treewidth, Algorithmica 56(4) (2010), 448–479. doi:10.1007/s00453-008-9180-4.

Ansótegui

Bonet

M.L.

Giráldez-Cru

and Levy

, Structure features for SAT instances classification, Journal of Applied Logic 23 (2017), 27–39. doi:10.1016/j.jal.2016.11.004.

Ausiello

Cristiano

Fantozzi

and Laura

, Syntactic isomorphism of CNF Boolean formulas is graph isomorphism complete, in: Proceedings of the 21st Italian Conference on Theoretical Computer Science, 2020, CEUR Workshop Proceedings, Vol. 2756, CEUR-WS.org, 2020, pp. 190–201.

Babai

, Graph isomorphism in quasipolynomial time [extended abstract], in: Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, ACM, 2016, pp. 684–697.

Burago

and Ivanov

, A Course in Metric Geometry, Graduate Studies in Mathematics, Vol. 33, American Mathematical Society, 2001.

Cygan

Fomin

F.V.

Kowalik

Lokshtanov

Marx

Pilipczuk

and Saurabh

, Parameterized Algorithms, Springer, 2015.

Fleischner

Kullmann

and Szeider

, Polynomial-time recognition of minimal unsatisfiable formulas with fixed clause-variable difference, Theoretical Computer Science 289(1) (2002), 503–516. doi:10.1016/S0304-3975(01)00337-1.

Franco

J.V.

and Gelder

A.V.

, A perspective on certain polynomial-time solvable classes of satisfiability, Discrete Applied Mathematics 125(2–3) (2003), 177–214. doi:10.1016/S0166-218X(01)00358-4.

Gomes

C.P.

Sabharwal

and Selman

, Model counting, in: Handbook of Satisfiability, 2nd edn, Frontiers in Artificial Intelligence and Applications, Vol. 336, IOS Press, 2021, pp. 993–1014, Chapter 25.

10.

Gromov

, Metric Structures for Riemannian and Non-Riemannian Spaces, Progress in Mathematics, Vol. 152, Birkhäuser, 1999, Based on the 1981 French original.

11.

Hoos

H.H.

Hutter

and Leyton-Brown

, Automated configuration and selection of SAT solvers, in: Handbook of Satisfiability, 2nd edn, Frontiers in Artificial Intelligence and Applications, Vol. 336, IOS Press, 2021, pp. 481–507, Chapter 12.

12.

Köbler

Schöning

and Torán

, The Graph Isomorphism Problem: Its Structural Complexity, Progress in Theoretical Computer Science, Birkhäuser/Springer, 1993.

13.

Chung

Mukherjee

Vinyals

Fleming

Kolokolova

and Ganesh

, On the hierarchical community structure of practical Boolean formulas, in: Proceedings of the 24th International Conference on Theory and Applications of Satisfiability Testing, SAT 2021, Lecture Notes in Computer Science, Vol. 12831, Springer, 2021, pp. 359–376. doi:10.1007/978-3-030-80223-3_25.

14.

McKay

B.D.

and Piperno

, Practical graph isomorphism, II, Journal of Symbolic Computation 60 (2014), 94–112. doi:10.1016/j.jsc.2013.09.003.

15.

Samer

and Szeider

, Fixed-parameter tractability, in: Handbook of Satisfiability, 2nd edn, Frontiers in Artificial Intelligence and Applications, Vol. 336, IOS Press, 2021, pp. 693–736, Chapter 17.

16.

Slivovsky

and Szeider

, A faster algorithm for propositional model counting parameterized by incidence treewidth, in: Proceedings of the 23rd International Conference on Theory and Applications of Satisfiability Testing, SAT 2020, Lecture Notes in Computer Science, Vol. 12178, Springer, 2020, pp. 267–276. doi:10.1007/978-3-030-51825-7_19.

17.

Szeider

, Minimal unsatisfiable formulas with bounded clause-variable difference are fixed-parameter tractable, Journal of Computer and System Sciences 69(4) (2004), 656–674. doi:10.1016/j.jcss.2004.04.009.

An Isomorphism-Invariant Distance Function on Propositional Formulas in CNF

Abstract

Keywords

1. Introduction

2. Formulas and Isomorphism

3. Distance Function on Formulas

Definition 2 (parameter of constant variation).

Footnotes

Acknowledgement

References