Abstract
Deep Deductive Reasoning refers to the training and then executing of deep learning systems to perform deductive reasoning in the sense of formal, mathematical logic. We discuss why this is an interesting and relevant problem to study, and explore how hard it is as a deep learning problem. In particular, we present some of the progress made on this topic in recent years, understand some of the theoretical limitations that can be assessed from existing literature, and discuss negative results we have obtained regarding improving on the state of the art.
Keywords
Introduction
Deductive reasoning over formal logics is arguably
An important aspect of Knowledge Representation (KR) research is to find correct and scalable algorithms for deductive reasoning over KR-relevant logics. However, the mathematical definitions often do not easily translate into algorithms. Indeed, algorithms for more expressive logics tend to be so involved that they often need correctness proofs. Furthermore, they tend to be of high computational complexity, if the problem of computing logical consequences is at all decidable. Indeed, some prominent logics, such as first-order predicate logic, are not decidable but only semi-decidable, and some higher-order logics, including some non-monotonic logics of importance to KR, are not even semi-decidable [12]. Among the decidable logics are many from the Description Logics family [4] that are relevant for KR-based data management, and some of which have become W3C standards for the Semantic Web research field [20]. Their complexities can be very high, e.g., the OWL 2 DL standard is based on the Description Logic
Deep Deductive Reasoning (DDR) [14] refers to the training and then executing of deep learning (DL) systems to perform deductive reasoning. There are several reasons why the study of DDR is a worthwhile undertaking. Fundamentally, DDR presents us with a set of related problems of differing difficulty, at scales that can be chosen arbitrarily, and with mostly easy-to-obtain training data, and thus makes it possible to explore the capabilities and limitations of DL in what is similar to a controlled laboratory setting. A different perspective on the problem is motivated from a cognitive science viewpoint, in that KR and DL can be understood as simple formal abstractions of cognition, respectively, the human brain: And while we humans are capable of performing deductive reasoning, we would like to understand if and how this can be done with DL systems. A third perspective is more applied: DDR, if it could be made to work with high accuracy and at scale, may lead to improvements over conventional deductive reasoning algorithms through better tolerance of errors and noise in the data (which conventional algorithms do not handle well) and to improved reasoning speed (in the light of high computational complexities). At the same time, capturing deductive reasoning in DL systems opens avenues for neurosymbolic architectures that truly combine learning and reasoning, without the need to recourse to hybrid architectures.
The rest of the paper is organized as follows. In Section 2, we will give a brief overview of existing DDR research. In Section 3 we will discuss theoretical limitations on DDR that can be derived from the literature. In Section 4 we report on some of our failed attempts to improve DDR, which we include to emphasize how difficult the problem is, and in order to aid other researchers by reporting on paths that appear to be dead ends. In Section 5 we conclude.
Brief literature review
There are a number of particularly important features when assessing DDR approaches and systems:
The specific logic that is being reasoned over, and in some cases also the reasoning task, as different reasoning tasks have differing complexities for some logics. For example, in logic programming [24], often only the derivation of logical consequences that are In order to do DDR, the deductive reasoning problem needs to be transformed into a problem suitable for DL. There appear to be two different principled ways to do this. The first is to phrase DDR as a classification problem: Given a knowledge base In deductive reasoning, the names of identifiers (constants, predicates, etc.) are insubstantial in that a consistent renaming within a knowledge base does not change the deductive reasoning results or procedures. Symbolic algorithms are likewise independent of identifier names. Ideally, DDR systems would have the same capability, i.e., to transfer to inputs over new and previously unseen vocabularies. Scalability is another important aspect and has at least three distinct variants: Scalability in terms of input size. Application oriented knowledge bases sometimes have hundreds, and sometimes millions or even billions of logical statements. For DDR systems, it is important to note what input sizes can be ingested so that accuracy is still high. Scalability in terms of training time needed, which of course is related to the input sizes considered. Scalability in terms of the runtime of the trained DDR system. Finally, accuracy of a DDR system is of importance, and it is probably best assessed in terms of precision, recall, and
Considering these dimensions, we provide a comparison of existing DDR approaches of which we are aware, summarized in Table 1.
3
In [19], their goal was to use LSTMs and CNNs to approximate stream reasoning with C-SPARQL (Continuous SPARQL). C-SPARQL has a limit of processing triples per second, and as queries become more complicated, the time to provide answers will increase, and in cases where the amount of triples per second exceeds the limit, C-SPARQL provides incorrect results. Thus, the reasoning time will increase when we have an enormous amount of data, so an approximation method should be applied. The results show that for highly complicated queries, the LSTM provides better results, while the CNN performed a little bit better for other queries and the time is less than for the LSTM. A Recursive Reasoning Network (RRN) is proposed as the deep learning model for ontology reasoning in [27]. It is mentioned in their paper that knowledge bases with the same ontology but different facts are used for training, so regarding the transferability, they noted that the trained model can be used on different databases with similar vocabularies (unseen KB but for the same ontology), so we considered it as a non-transferable reasoning model in our sense. In [8], a model based on Logic Tensor Networks (LTNs) is applied for reasoning over first-order predicate logic (FOL). They claim that this model is a good model for simple reasoning, and also, after training, they can reason over combinations of axioms they were not trained on. A Bidirectional GRU-based model is proposed in [36] for noise-tolerant symbolic reasoning. Their results show that in those classes with thousands of instances, the accuracy is high, on the other hand, this model for the small-instanced classes does not perform well as is expected in deep learning. In [35], they proposed a model that also provides explanations for how the inferences are generated besides the inferences; their model consists of a parallel path using bidirectional GRU. A non-generative but transferable model based on memory networks is developed in [16]. For this purpose, RDF triples are stored and the model is trained to embed the knowledge graph to be able to predict the query result. Note that this is the only approach to date that has been shown to be highly transferable, with high accuracy and moderate scalability. In [13], an LSTM-based noise-resistant model is presented for learning
Comparison of DDR approaches.
We note, in particular, that none of the current systems is both generative and able to transfer. Furthermore, scalability is generally below many requirements in practice. The exception may be [54] – however this system makes use of only small TBoxes (which are the actual logical axioms), i.e., only the ABox (essentially, the non-axiom facts) is large.
While it is not our focus herein, we would like to mention that some studies have focused on investigating reasoning within propositional logic. For instance, in [47] efforts were made to fine-tune GPT-2 and GPT-3 to emulate resolution rules applicable to propositional logic programs, and in [18] the issue of entailment within propositional logic is addressed.
Also related is research in automated theorem proving involving DL, which can be categorized into two main groups. Firstly, there are hybrid approaches that combine neural network encodings with conventional deduction methods [28,29,34]. Secondly, there exist fully deep learning-based models such as the one proposed by [45], which presents a DNN architecture customized for automated proof synthesis, estimating the probability of applying an inference rule at each step of the proof. Additionally, in [10,46], a graph neural network-based model is proposed for the classification and resolution of SAT problems.
Overall it seems fair to say that, at present, research results do not yet meet realistic application requirements, in particular for KR applications.
Deductive reasoning problems have a long history as complete problems for complexity classes, and those in knowledge representation tend to possess high complexity. Therefore it is natural to think that DDR is subject to severe theoretical limitations,
Deep learning can be studied through several abstract models. The most direct and literal is the analytic model (see e.g., [6,52]), in which neural networks are a class of functions of real variables, studied in terms of approximation, convergence, etc., with computational considerations kept to a minimum. An advantage of this point of view is its ability to see specifics of different architectures that would be washed out in other models, letting us more often compare predictions against empirical results. However, it is not well suited to study hardness in absolute terms: typically one can prove facts such as rapid convergence-in-probability of stochastic gradient descent, or lack thereof, but this only concerns the performance of one algorithm (e.g., stochastic gradient descent) at a time.
The other obvious model is boolean circuit computation [50]. We represent neural networks as discrete functions acting on boolean vectors, and so all considerations are quite different. Typically some proxy such as “majority threshold circuits of constant depth” (
The circuit model is more germane to the study of DDR, because deductive reasoning problems are discrete, but there is another issue that bears mentioning. Whereas a Turing machine has a finite number of states, a circuit family, with enough circuits to process inputs of any length, is an infinite one [51] because a different circuit is required for each input length. It is thus possible to decide some nonrecursive languages with circuits. This feature, called “nonuniformity”, is a plague on lower-bound proofs; proving problems too hard for circuits is generally harder than proving them too hard for Turing machines with similar resource bounds. We do not understand whether this reflects a significant phenomenon or just a gap in our knowledge.
When a decision (classification) problem is not in a nonuniform complexity class, such as
We have neglected the issue that deep learning is inherently random and fuzzy. Since most known facts about hardness of logic concern exact decision problems, with no provision for error, the easiest type of limitations to prove are those of exactly-correct networks. These results mostly just confirm the intuition that deep learning
For deep learning in general, one might argue that we do not have a computational model for the knowledge in a dataset of arbitrary real-world examples because thinking about the example generator as a Turing machine is misleading – we do not know whether the universe can be modeled as a Turing machine.
11
Even for DDR, perhaps the bottleneck really is just the uninsightful ways we can automatically generate ground truth, and if we trained on example deductions made by human experts we could do better.
12
If this is so, then perhaps we ought to model neural networks as if training data is arbitrary. This approximation made, the only circuit-based limits to their power are non-uniform ones, but there is now no possibility of stochastic models performing better than exact ones, since randomness and error tolerance give no advantage to nonuniform computers.
13
It is almost trivial to prove that
Weaker logics provide more interesting problems. It is widely conjectured, though hard to prove (it implies
One pertinent compromise model between uniform and nonuniform is
Overall, the verdict of theory (as of the time of writing) is that the limitations of deep learning for logic problems are not well understood. Since these problems occur at many levels of the complexity hierarchy, they provide an interesting bellwether for the hardness of deep learning in general. Indeed, all the results we have been able to cite above have been excessively general; if there is something particularly challenging for practical neural systems about reasoning, as compared to other computationally hard tasks, this remains to be explained by future research.
Practical hardness
In principle, DDR should represent a deep learning problem that is readily investigated. Using symbolic reasoners it is straightforward to produce vast amounts of training data algorithmically. Given the promises of deep learning as a cure-all for all manner of complex problems, it should be expected that, at least for simple logics, solutions for DDR should be findable without too much hassle. It turns out that this is not the case at all, even when staying within the bounds of the theoretical limitations discussed above. In this section, we report on two of our investigations, and their (relative) failures. Given the frequent emphasis in published research on positive results, we take this opportunity to discuss negative results, i.e., failure to solve DDR in some settings.
LSTMs for
reasoning
We briefly discuss the investigation reported in detail in [13] regarding the use of LSTMs for
It was our intention to avoid, as much as possible, embeddings or any distance adjustments that might help the system learn answers based on embedding similarity without first learning the reasoning structure.
16
In fact, a primary feature of our network is its lack of complexity. We take a very simple logic, reason over knowledge bases in that logic, and then extract so-called
LSTMs are a type of recurrent neural network that work well with data that has long-term dependencies [26]. We choose to use an LSTM to learn the reasoning patterns of

Piecewise architecture (top); deep architecture (middle); flat architecture (bottom).
We would like to emphasize the distinction between our method, which we prefer to call an encoding, and a traditional embedding. Our encoding adds no additional information to the names beyond the transformation from integer to scaled floating point number. (It can be directly reversed with a slight loss of precision due to integer conversion.) The semantics of each encoded predicate name with regards to its respective knowledge base is structural in relation to its position in the 4-tuple (just like it would be in an
Every data set we use has a fixed number of names that it can contain for both roles and concepts, and every concept or role has an arbitrary number of name identifiers. Identifier labels are stripped and the concept and role numbers are shuffled around and substituted with random integers to remove any possibility of the names injecting a learning bias. These labels are remembered for later retrieval for output but the LSTMs do not see them. Knowledge bases have a maximum number of role and concept names that are used to scale all of the integer values for names into a range of
To input knowledge bases into the LSTM, we first apply the encoding defined in the previous section to each of its statements then concatenate each of these encodings end-to-end to obtain a much longer vector. For instance,
Translation rules:
As already mentioned (and we will not replicate the results tables from [13]), our attempts failed in so far as we only managed to do a little bit better than a random guesser. Of course this does not mean that the problem is not solvable, but it appears that a-priori reasonable set-ups do not readily solve it.
Artificial Neural Networks (ANNs) have been in use to learn sequence-to-sequence problems for a long time. Particularly, Recurrent Neural Networks (RNNs) have achieved state-of-the-art performance in mapping such problems. One key point in such applications is that the dimensions of the output dictionary must be known beforehand. In combinatorial problems such as the Travelling Salesman Problem (or for generative DDR), each input sequence may have any number of entities. In addition, the size of the output dictionary depends on the input sequence (i.e., not only on its length). This creates a distinction with the general settings of sequence-to-sequence problems.
To understand the task of (generative) DDR, we shall look at it through the lens of RDF, or more precisely RDFS [23]. It is a logic where each statement consists of a
Pointer Networks, introduced by Vinyals, Fortunato, and Jaitly [49] learn the conditional probability of an output sequence. Each token in the output sequence points to an item in the input sequence. It uses attention to point to a token of the input sequence as part of the probable output sequence. In combinatorial optimization problems such as TSP, Convex-hull, and Delaunay Triangulations, the authors show that the Pointer Network achieves state-of-the-art performance, and therefore seems to be a natural choice to explore generative DDR in this case. Since in RDFS reasoning, the input triples define the different node-edge-node sections of a graph, and the output of the reasoning entails establish previously undefined relationships among nodes using information already provided in the input – Pointer Networks becomes a candidate architecture because its general working principle involves pointing to input tokens as output tokens. This has prompted the study reported in [15] with positive initial results. However, as we are reporting in more detail below, these results do not appear to stand up to additional scrutiny. 19
A Pointer Network has an encoder-decoder based architecture with attention mechanism that uses position of tokens in the input sequence to represent the output tokens. In a sample input sequence, positions of different tokens are encoded, while during decoding – the output sequence obtains a part of the input tokens expressed by their respective positions. For more implementation details we refer the reader to [49]. The completion of a graph
Using Apache Jena API and an RDF dataset containing approximately 16K graphs to generate the corresponding inferences for each graph to supervise the model. The dataset is split into 60% training, 20% validation, and 20% test. The dataset is first encoded on the URI level, where a dictionary is created with all the unique URIs. Thereafter, the encoded graphs and inferences are fed to train the model. We have tested with different learning rates between 0.001 and 0.1, maximum length of a graph between 200 and 700. (We have also experimented with encoding the dataset on the token level where each token in the URIs is encoded, but then the produced output sequences contained a large number of invalid URIs where tokens from different URIs are combined.)
To evaluate the approach, a metric needs to be calculated on a triple level instead of the URI level because a correct prediction needs to have all URIs to be valid to compose a correct triple. In other words, a partial true prediction for a triple does not count towards a right prediction. In our analysis we have found that Pointer Network’s capability of producing correct triples is very limited. It resulted in less than 10% accuracy in terms of predicting triples. Unsurprisingly, we have found Precision and Recall to be insignificant as well. Given these negative findings, we also returned to our initial study reported in [15] which showed better results, but these were not on the triple level, but on the token level, and it appears that token padding (to arrive at same-length inputs) played a major role and the results simply did not hold on the triple level. However, any fit-for-purpose ANN-based model must learn the underlying relationships between subjects and objects of a triple when trying to perform deep deductive reasoning.
For instance, in the case of transitive property, where entities A and B have a relation, then B and C have the same relation, the model should be able to understand that A and C have the same relation. However, it should learn this property in a symbol-invariant fashion. Concretely, if the same property is fed into the model with a different set of symbols, ideally the model should be able to identify it too. Our speculation for the Pointer Network failing to perform well is that the model is trying to learn the relationship between symbols, which is rather a difficult job. Consequently, it fails to remember what it has seen before. We believe the Pointer Network approach works on the Travelling Salesman Problem because there is only one relationship to consider amongst all nodes (cities) and the number of considered cities is small. However, in case of DDR, the number of nodes (subjects and objects) is naturally rather large. To make things even more difficult, in DDR there are as many relations to consider as the number of RDFS rules in play, and it appears that the Pointer Network fails to grasp the relationships due to their number being large compared to the overall input size. Finally, due to the fact that it is trying to learn the relationship between “symbols”, it fails to remember what it has seen before. This is important because, in RDFS graphs, the order of triples do not matter. In the transitive property example, after the model sees the first part of a transitive property, if the second part comes after a few triples, the model fails to connect the two parts.
Conclusions
We have presented the state of the art in DDR, discussed theoretical limitations, and argued that (even within the theoretical limits) it is a hard deep learning problem in practice based on our own prior experiences.
For the theoretical limitations, we have shown that very expressive logics are out of bounds (with 100% accuracy), and even NP-complete ones may be out of bounds (depending on how the P vs. NP problem will be addressable eventually). Of course, this does not mean that investigating DDR for NP-complete or harder logics should not be done, but it does mean that any DDR system for such logics will always be approximate; however it can of course still be useful [25,43].
In terms of practical hardness, we note that there is currently no well-performing system even for the comparatively simple polynomial logic
As for practical usefulness, it should be mentioned that RDFS is a prominent logic in the context of Knowledge Graphs, Ontologies, and the Semantic Web [20], and that a generative system capable of transfer and scalability up to non-trivial input sizes with high accuracy (say, even inputs of about
Deep Deductive Reasoning at scale, and with logics more expressive than RDFS, would of course open up many more opportunities by providing a path to a deep integration of symbolic formal logical reasoning
As such, DDR remains a highly interesting area of research, where practical applications seem within grasp. However we have also seen that the right approaches have not been found yet. To advance, it requires new ideas, and investigations on a broad front, with many different deep learning architectures and approaches.
Footnotes
Acknowledgement
The authors acknowledge partial funding by the National Science Foundation under grants 2033521
which is nondeterministic double exponential, i.e., nondeterministic
Propositional reasoning can in fact be represented
].
There are exceptions. Some neural architectures, e.g., recurrent, can feed variable-length inputs to a single network [
]; we can easily represent its different “length modes” as different circuits, but this makes “constant depth” implausible and introduces uniformity (see below). Constant-depth may also be implausible for convolutional networks. Recognizing objects in images generally requires more down-sampling layers for higher resolution, i.e., at least logarithmically more layers for larger inputs. The class of languages accepted by logarithmically-deep threshold circuits is called
a fact used to link deductive and reasoning with MLPs in the pre-DL era [5,
].
variously defined; see e.g., [11,
].
in the network input width and in the amount of training data.
We mean that, for some
This is because we can use random bits to train a network, then simulate it on the input, and the result is a polynomial-time stochastic algorithm that is correct if given the right training data, but whenever such an algorithm exists there is also a deterministic polynomial-space algorithm [
], which we can pipeline with the data-generator.
e.g., because the Turing machine in question would be so intricate that its size is not negligible compared to our inputs, which calls into question the whole framework of “asymptotic resource use in terms of input size”.
This is an idealization, because human experts only work up to a certain input size, but they can solve many reasoning problems not captured by any known decidable logics.
, Section 3.2] for Turing machines with polynomial randomness. We can also prove that if any problem is solvable with bounded 2-sided error by choosing poly-size circuits at random from an arbitrary distribution, that problem can be solved deterministically by poly-size circuits.
see the abstract for this claim.
We just don’t see how pre-compiled embeddings would be helpful for deep deductive reasoning if reasoning should transfer in the sense discussed in Section 2: To have embeddings reflect similarity relevant for transferable deductive reasoning would have to capture similarity in some sense conveyed by the input theory, which is not known before runtime, i.e., which is in particular not available for training embeddings.
I.e., the system learns to produce support, and then learns to create output from both support and input knowledge base.
On a related note, this serves to show that preprints that have not been peer reviewed always need to be taken with caution.
