Sage Journals: Discover world-class research

Abstract

Motivated by the analysis of chemical metal compounds and their properties for catalysis, we developed a gradient boosting model that explores graph structures to perform prediction tasks. Taking advantage of the iterative nature of boosting, our novel approach, called PathBoost, explores the graphs to identify the relevant paths and simultaneously fits a prediction model. Advantages of PathBoost include automatic variable selection, as only relevant paths are kept in the model and explainability, as a measure of variable importance is provided. The novel algorithm is applied to the tmQM dataset, where the molecules are represented as graphs with atoms as nodes and bonds as edges. The goal is to predict specific quantum properties of the molecules, such as the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) gap. These properties usually require heavy computational power to be computed, while our model aims to provide comparable results using much fewer resources.

Keywords

Catalysis gradient boosting graphs HOMO/LUMO gap prediction

1 Introduction

Understanding the relationship between the structure of a molecule and its properties is a cornerstone of numerous domains, such as medicinal chemistry (Dara et al., 2022) and materials science (Butler et al., 2018). These molecular structures embody graph-based information, where atoms are represented as nodes and the bonds as edges. Predicting molecular properties from these intricacies presents analytical challenges due to the data’s high dimensionality, non-linearity and inherent complexity.

The relationship between a molecular structure and a property of interest can be computed through complex quantum-mechanics computations (Balcells and Skjelstad, 2020), but the computations require many computer hours, resulting in an extremely time-consuming and environmentally unfriendly process. A different strategy consists of learning the relationships by exploiting a statistical learning approach, which is trained on existing data and used to predict unseen cases. Many approaches have been developed to this aim, including graph neural networks (GNNs, see, e.g. Jørgensen et al., 2018; Schütt et al., 2018; Kneiding et al., 2023) and boosting (e.g., Kudo et al., 2004; Saigo et al., 2009; Fei and Huan, 2010; Pan et al., 2017). The former (GNNs) provide very good results in terms of prediction and are therefore regarded as the state of the art. Nevertheless, they are black-boxes that do not allow to infer new information from the trained model (Scarselli et al., 2009). The inherent opacity of GNNs poses significant challenges to comprehending the mechanism regulating the property of interest, making their practical use particularly unattractive in chemistry applications (Wu et al., 2021).

In this article, we choose to follow a boosting strategy instead. Advantages of the proposed boosting procedure include variable selection, explainability (variable importance) and resistance towards overfitting. In contrast to the boosting approaches currently available in the literature to handle graph-structured input data, which first search for the optimal patterns and then apply a boosting algorithm (see, e.g., Kudo et al., 2004), we simultaneously perform the search for the relevant paths and the evaluation of their effects on the response. Our algorithm, indeed, updates the statistical learning model while exploring the graph. This is facilitated by the specific characteristics of the molecules we are investigating, namely the transition metal-organic compounds, which have a specific metal centre that we can use as the starting point for our search. The presence of distinct centres inspired us to explore parallelization during the training phase. We then devised an aggregation strategy that prioritizes genuinely relevant centres, rather than those that appear important merely due to the independent training performed on each centre. Our work is motivated by the analysis of the tmQM dataset (Balcells and Skjelstad, 2020), which contains the molecular structures of 60 799 transition metal–organic compounds. The goal is to predict one specific property of the compounds, the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) gap. The HOMO/LUMO gap is the difference between the HOMO and LUMO energies and it is important for identifying promising molecules for catalysis and other applications.

The article is structured as follows: In Section 2 we describe our case of study and introduce the data; in Section 3 we describe the gradient boosting approach and our specific version for graph exploration and regression; the novel approach is illustrated and investigated through synthetic data in Section 4; and finally evaluated on the real data in Section 5. Section 6 concludes the article.

2 Motivating application

This work is motivated by the analysis of transition metal complexes. They make up a distinct category of chemical compounds that exhibit an extensive array of physical properties and consist of a metal centre, around which the molecular structure is developed. While transition metal complexes have been explored and utilized extensively in the past, more work is needed to fully understand their electronic properties. Knowing the latter, in particular, may lead to scientifically intriguing and practically beneficial discoveries (Khomskii, 2014), involving anti-diabetic compounds (Sodhi, 2019), bio-energy fuels (Yun, 2016) and electronic devices (Wang et al., 2012). In practice, in this study we use the data provided by Balcells and colleagues in their tmQM dataset series (Balcells and Skjelstad, 2020; Kneiding et al., 2023, 2024). The graph data are publicly available at https://github.com/uiocompcat/tmQMg.

The most relevant challenge in implementing a statistical learning algorithm in this context is arguably the data structure. As we mentioned in the introduction, the molecules are represented in the form of graphs, with the nodes representing the atoms and the edges the bonds. This means that the input information is not in the standard form of an n × p matrix. The 60 799 molecules have completely different sizes and the space of the possible paths is in practice too large for an exhaustive enumeration.

One important property of the molecular structure of the transition metal compounds that helps in our work is the presence of a uniquely identified metal centre. This property allows our algorithm to start the search for relevant structures from a specific point, the metal centre itself. The dataset contains compounds with 30 different metal centres, with different number of instances (see Figure S1 in the Supplementary Materials). Moreover, the relevant structures affecting the HOMO/LUMO gap of the molecule are not so far from the metal centre, supporting a sequential search that starts from the metal centre.

Here, the goal is to explore the molecular structure of the existing compounds and to use this information to predict the HOMO/LUMO gap of a new compound. The HOMO/LUMO gap is a continuous variable and we assume that its conditional distribution is Gaussian. Note that we aim to only use the topology of the molecular graph, without considering further information available for the nodes and the edges. This will be material for future work (see also Section 6).

3 PathBoost: Path-based boosting

In this section, we introduce PathBoost, as a scalable boosting algorithm for learning an additive regression model based on labelled paths. By transforming each graph into a vector of counts, where each count refers to a specific labelled path, the approach is able to deal with the issue of various-sized graphs, providing an interpretable representation of the original input graph. Moreover, the additive structure of the model maintains the interpretability of the transformed input in its predicted output. To enable efficient learning of the proposed model, we exploit the fundamentally iterative nature of boosting to sequentially build up the input space or the representation of the graphs, as part of the learning procedure.

3.1 Preliminaries and notation

At a general level, the considered problem is an instance of graph regression, where we want to predict a target response, $Y \in ℝ$ , based on an input graph, $G \in G$ . We focus on node-labelled undirected graphs defined as a triple $G = (V, E, L)$ , where $V = \{v_{1}, \dots, v_{d}\}$ is the set of nodes (or vertices), $E \subset V \times V$ is the set of (undirected) edges between the nodes and $L : V \to L$ is a mapping function that assigns each node a label from a finite set of labels, $L = \{l^{(1)}, \dots, l^{(K)}\}$ . In terms of our application, each graph corresponds to a molecule where the nodes of the graph are the atoms that compose the molecule and the edges are the bonds between them. In this context, the label of a node is its atomic symbol. As an example, for a node v_i representing a carbon atom we would have $L (v_{i}) = C$ (see Figure 1).

Figure 1

An illustration of how the feature matrix is expanded during training for a toy dataset containing three molecules. The first two feature columns represent the initial set of features Pt and Zr, as defined by the anchor node (i.e. the metal centre). After selecting Pt, all existing paths obtained by extending Pt by one node are (Pt,Br), (Pt,C), (Pt,Si) and these are added to the feature matrix and the corresponding feature values of each new path is the number of times the path occurs in the molecule.

The graph regression problem amounts to learning a function $f : G \to ℝ$ to predict a global response Y, given a set of training data $T = {\{(G_{i}, y_{i})\}}_{i = 1}^{n}$ consisting of joint observations of the response y_i and the associated graph G_i = (V_i, E_i, L_i). Importantly, the number of nodes in each graph, d_i = |V_i|, may vary. To account for this, we are going to model the response of a graph as a function of its paths. A path of length m − 1 in a graph is defined as a sequence of nodes $(v_{j_{1}}, v_{j_{2}}, \dots, v_{j_{m}})$ , where each included node occurs only once and each successive pair of nodes is connected by an edge. Moreover, we use the term labelled path to refer to the associated labels of the nodes involved in the path, as denoted by

l_{u} = (L (v_{j_{1}}), L (v_{j_{2}}), \dots, L (v_{j_{m}})),

where u is the index or identifier of the labelled path. Moreover, we will use M[G_i, l _u ] to denote the number of times that the labelled path l _u is found in graph G_i. Returning to our application, a path represents a sequence of connected unlabelled atoms within a molecule, while the corresponding labelled path also provide information about specific atomic elements. As an example, for a path of length two, denoted by $(v_{j_{1}}, v_{j_{2}}, v_{j_{3}})$ , we could have the corresponding labelled path (Pt, C, C) (see Figure 1).

Finally, and motivated by our application, we will assume that there is a single designated anchor node, denoted by v^*, for each graph and we will focus on paths that start from this particular node. For a given set of training data, $T$ , we will use $V^{*} = {\{v_{i}^{*}\}}_{i = 1}^{n}$ to denote the associated set of anchor nodes. Moreover, in our particular application, the anchor node has a separate set of possible labels, $L^{*}$ , which is disjoint from the common label set, $L$ , that is shared by the remaining nodes, that is, $L^{*} \cap L = \emptyset$ . As an example, for the toy scenario in Figure 1, we would have $L^{*} = {P t, Z r}$ .

3.2 Additive path-based model

The key challenge in graph regression, where the input graphs are of varying size, is to extract or construct features from the graphs that are informative for predicting the target response. By transforming each graph into a numerical vector of equal length, $h : G \to ℝ^{p}$ , one obtains a matrix of tabular input data that can be plugged into supervised machine learning methods. A relevant set of features can be constructed manually using domain knowledge, prior to learning the model or it can also be learned automatically from the data as part of the learning procedure. The latter approach is more generally known as representation learning (Bengio et al., 2013), as the goal is to learn a compact representation of raw (and often ‘unstructured’) input data that captures as much information as possible of the original data with respect to some criterion. In particular, in supervised learning it is essential to learn a representation that is informative for predicting the target response. The field of representation learning is dominated by deep learning approaches and, in particular, the class of GNNs is the primary tool for graph-based prediction (Wu et al., 2021). While deep learning approaches are often superior in terms of predictive accuracy on non-tabular data, a main drawback is that the inferred representation, defined through the mapping function h(·), is in general difficult to interpret, as the individual components of the graph become entangled within the representation.

In this work we propose a hybrid feature learning approach where we define our feature space over all possible labelled paths, starting from the anchor node and define the associated features simply as the number of times the labelled path occurs in a graph. More specifically, using the above notation, each graph G_i will be mapped to a vector of counts with respect to a given collection of labelled paths, denoted by $P = {\{l_{u}\}}_{u = 1}^{p}$ , that is $h (G_{i}; P) = (M [G_{i}, l_{1}], \dots, M [G_{i}, l_{p}])$ and the regression model will thus base its prediction on the above vector of labelled path counts, $f : h (G) \to ℝ$ . Similarly, we will use $X_{T, P} = {(h (G_{i}; P))}_{i = 1}^{n}$ to denote the n × p count matrix for a given set of paths $P$ over the set of graphs contained in $T$ , that is, each column represents the count of a path and each row, denoted by $x_{i} = h (G_{i}; P)$ , represents the counts of the considered paths for a graph $G_{i} \in T$ . The proposed mapping thus provides a path-based representation that deals with the issue of various-sized graphs. Furthermore, given a vector of path-based counts, we will assume that f(·) follows an additive structure:

f (G_{i}; β) = \sum_{u = 1}^{p} \sum_{c = 0}^{c_{u}^{\max}} β_{u, c} \cdot 1 [M [G_{i}, l_{u}] = c],

(3.1)

where $β_{u} = \{β_{u, 0}, \dots, β_{u, c_{u}^{\max}}\}$ contains the count-specific coefficients of path l _u and $c_{u}^{\max}$ denotes the maximum number of times that specific path is observed in a graph in the training data. Thus, for a given input graph, β_u_, _c is added to the prediction of the model if path l _u is observed c times in the input graph. As the prediction of the above model can be decomposed into a sum of path-specific terms, it greatly facilitates the interpretation of the output produced by the model.

3.3 Path-based gradient boosting

In the assumed model class, the learning problem essentially consists of identifying the informative paths among all possible paths and fitting the path-specific coefficients, $β = {\{β_{u}\}}_{u = 1}^{p}$ . Assume that we are given an n × p feature matrix that covers all possible labelled paths within a given training set $T$ , as we only need to consider the unique labelled paths within $T$ . Our goal is then to train a model (that is, estimate β) by minimizing some specified loss function, $l (\cdot, \cdot)$ , with respect to $T$ ,

\underset{β}{\arg \min} \sum_{i = 1}^{n} l (y_{i}, f (G_{i}; β)) .

(3.2)

The above estimation problem is very challenging due to its high-dimensional nature. In practice, we assume that only a small fraction of the labelled paths are informative for predicting the target response and the key to efficient learning will rely on the ability to identify those paths as part of the learning procedure. To this end, we will employ boosting, which is a learning technique tailored to this type of setting.

More specifically, our algorithm is based on gradient boosting, which assumes a differentiable loss function and solves Equation (3.2) by iteratively approximating the negative gradient of the loss function with simple models, referred to as base learners. Then, to avoid overfitting, the algorithm is stopped at a certain iteration (m_stop), which is typically estimated by cross-validation and the base learners are assembled into a final boosting model. In terms of the type of base learners, we focus on tree stumps which makes only a single split on a single variable. By using tree stumps we will avoid interactions between paths, which ultimately results in an additive model of the form in Equation (3.1). In addition, tree stumps have the property of being ‘weak’, meaning that one stump alone does not have too big of an improvement of the model, a property that has been proven to be fundamental for boosting (Bühlmann and Yu, 2003). For more information about the general technique of boosting, we refer the reader to the overview article by Mayr et al. (2014).

3.3.1 Path-based Gradient Boosting with Expanding Feature Space

The main obstacle to a straightforward boosting procedure, as the one described above, is that the number of possible labelled paths quickly becomes too large for an exhaustive enumeration, even if we restrict the feature space to only the labelled paths observed in the given training set $T$ . For this reason, we modify the learning algorithm by introducing an adaptive approach that sequentially builds up the feature space. The key idea is illustrated in Figure 1. We start from an initial feature space defined over the anchor nodes. When the algorithm selects a path for the first time, it adds to the input matrix every one-node extension of the selected path that is present in the training data. If the algorithm selects a previously selected path, the input matrix is not updated during that particular iteration. At each step, the graph representations are summarized by an n × p_m matrix, where p_m is the size of the available matrix at iteration m, that is, the number of paths added up to that point.

We refer to the general algorithm that implements the above idea as PathBoost. Algorithm 1 shows the details of a generic version of PathBoost for regression and the key steps can be summarized as follows.

Initialization. The number of iterations is set to zero. The algorithm explores the training set and adds a column in the initial feature matrix for each label of the anchor node, v^*. As a result, we have an initial input matrix of dimension n × p₀, where $p_{0} = |L^{*}|$ , that only contain paths of length zero. In our particular application, p₀ will thus simply be the number of different metal centres. The model is initialized as the average of the response.

Boosting update (i-iv). The algorithm follows the usual component-wise boosting update scheme. The negative gradient is computed based on the loss function with respect to the model from the previous iteration (step [i]). Then, a base learner, in the form of a stump, is fitted to the negative gradients based on the current set of features, so that the optimal feature or path to split on is selected (step [ii]). The leaf values of the selected stump are estimated (step [iii]) and the model is updated with the new base learner (step [iv]).

Expand the input matrix (v-vi). This is the novel step of our boosting algorithm. Once a feature (or path) has been selected in step (ii), the algorithm checks whether the path has been selected before or not. In the former case, the boosting iteration simply updates the effect of an already activated path feature and the algorithm moves on to the next iteration. In the latter case, the algorithm explores the graphs in the training data and adds any existing path, that extends the considered path with one node, to the current set of paths.

Stop. Once the desired number of iterations has been reached, the algorithm stops and outputs the current estimate of the model, which can be expressed in the form of Equation (3.1).

3.3.2 Path Importance

As a boosting algorithm, PathBoost has a natural way to evaluate the importance of a feature in the resulting model. At each iteration, a base learner is fitted on the negative gradients based on the available set of features and the results are compared in terms of the reduction of the loss function. Thus, at each iteration, we compute the reduction in terms of a loss function for each path.

Algorithm 1

PathBoost

The simplest way to define feature (or path) importance is to sum up the contribution of each selected path (that is, the one that leads to the largest reduction of the loss function) over all iterations. In the end, for each path, we will have a total reduction of the considered loss given by the inclusion of that path in the model. A different approach is to evaluate the difference in terms of the loss function not in absolute terms but relative to the second-best option. In this way, a path that could be substituted in the model by another path (with correlated feature values) and does not get an inflated importance, compared to almost equally important alternatives. On the other hand, the importance of two similarly informative paths may be underestimated, as they are interchangeable and a less relevant path is assigned a higher importance.

We will consider both of the above variants and we will rescale the path importance measure with respect to the most important path, such that final values range from 0 (no importance) to 100 (most important), as is standard in the statistical learning framework (Hastie et al., 2009, Ch. 10.13.1). In the adaptive setting, we note that only a small fraction of typically shorter paths are available for the algorithm to split on early in the learning phase, when most of the reduction in loss occurs. Thus, path importance measures based on loss reduction will implicitly favour shorter paths.

3.3.3 Implementation Details

While the general method is fully described above, here we present the most relevant details of our particular implementation of PathBoost, which is available in the supplementary material or online at https://github.com/Claudio-Me/Path-Boosting-Public.git.

Stopping criterion. Instead of penalizing model complexity explicitly, we control it through early stopping. Given a fixed learning rate, η, the tuning parameter that controls the bias-variance trade-off is then the number of iterations, m_stop, that can be found by cross-validation. Due to the lack of monotonicity in the cross-validation error curve, moreover, to avoid getting stuck in a local minimum, we introduce a quantity called patience. Its value defines for how many consecutive iterations the cross-validation error should be larger than the current one for the point to be identified as a ‘suitable’ tuning parameter value, that is, chosen to be used as m_stop.

Patience. In theory, the value of patience should be chosen as large as possible to ensure the finding of the global minimum. In practice, this is not feasible from a computational point of view and not necessarily desirable from a modelling point of view. In fact, a larger number of iterations dramatically increases the length of the considered paths, causing memory issues and overly complex models, without much improvement in the prediction ability. As the cross-validation error is a monotonous decreasing function of patience, we suggest using the elbow rule to choose the value in practice (see also Section 5.3).

Path importance. Both approaches for measuring path importance, as described in Section 3.3.2, are implemented in PathBoost, as the two approaches complement each other in providing a more complete overview of the importance of the considered paths. In addition, a useful feature, specific to the considered setting, is that we can analyze the path importance locally in terms of a specific class of compounds that share the same metal centre or globally across all classes of compounds. In the latter case, we sum up the local importance measures such that each local contribution to the global importance is weighted by the size of the respective classes in the training data, to not overestimate the importance of paths associated with compounds that are less common in the data.

Prediction. When predicting the response for a new observation, we need to match the paths of the graph of the new compound to those used by the model. This can be done in two ways: Either by finding all possible paths in the new graph and then checking if they are included in the model or by looking for all paths included in the model and checking if (and how many) are present in the new observation. We choose to implement the latter, as we can guide the search by looking at specific paths and therefore reduce the computational time.

4 Simulations

Before applying our algorithm to the real data, we evaluate its behaviour in a controlled setting, that mimics the characteristics of the tmQM dataset. This allows us to illustrate how the algorithm works and assess its performance.

4.1 Synthetic data

We generate synthetic data by following a simple procedure: We first identify some specific paths of the molecular structures that are present in a sufficiently large number (300) of compounds contained in the tmQM dataset; we then generate a response that only depends on these selected paths, adding some noise. Specifically, the response is generated from the simple model

y = \sum_{u = 1}^{p} β_{u} M (\cdot, l_{u}) + ϵ,

(4.1)

where $ϵ \sim N (0, σ^{2}), M (\cdot, l_{u})$ denotes how many times the uth path l _u is present in the compounds and

β_{u} = \frac{b_{u}}{d_{u} + 1}, with b_{u} \sim Uniform (2, 3)

(4.2)

and d_u is the length of the path. We added the latter term to simulate the chemists’ conjecture that the relevant structures are close to the metal centre.

We generate data in three different scenarios. Each scenario is run 200 times (replications), keeping the values of the regression coefficients fixed (they are generated only once). In every replication, the number of iterations, m_stop, is computed via five-fold cross-validation and we set patience = 3.

Scenario 1. A very simple scenario, in which the relevant paths are the metal centre (Ni), the path (Ni, N) and the path (Ni, N, C). The number of compounds that contain at least one of these paths is 384 (307 of which are used for training and 77 for the testing) and we discarded all other compounds. The regression coefficients obtained from Equation (4.2) are $β_{(N i)} = 2.13, β_{(N i, N)} = 1.42$ and $β_{(N i, N, C)} = 0.92$ . The noise is generated with σ² = 0.2. The task is very simple and it is mainly provided for illustrative purposes.

Scenario 2. An intermediate scenario, in which the relevant paths extend farther from the metal centre, namely (Ni, N, C), (Ni, N, C, C), (Ni, N, C, C, C), (Ni, N, C, C, C, Br). Their related regression coefficients, from Equation (4.2), are 0.75, 0.53, 0.55 and 0.47, respectively. This scenario is a bit more challenging, because our algorithm must select some paths that are not directly relevant (namely [Ni] and [Ni, N]) before activating the directly relevant ones as selectable features. We use this scenario to check whether our algorithm is able to explore beyond the initial set of paths until it reaches the relevant paths. The sample size in this scenario is 311: in addition to the 253 compounds that include at least one of the aforementioned paths, we have 48 compounds whose artificial y is only affected by the noise $(ϵ \sim N [0, 0.2])$ .

Figure 2

Average true positive rate (TPR) as a function of the number of iterations. The shadowed area shows the range (interval between the smallest and highest value) of TPR in the 200 replications. Left panel: Scenario 1. Right panel: Scenario 2

Scenario 3. Finally in the most complex scenario we consider 48 paths (see Section S.2 of the Supplementary Materials for the complete list). These paths are chosen in a way that the one-step-longer paths are present in approximately half of the compounds that contain the previous path. In other words, if a metal centre (A) is present in 64 compounds, we select (A, B), (A, B, C) and (A, B, C, D) that are approximately present in 32, 16 and 8 compounds, respectively. In this scenario the total sample size is 253, with 202 compounds randomly assigned to the training set and 51 to the test set. As this scenario is the closest to the real data, we evaluate the algorithm with different signal-to-noise ratios: σ² ∈ {0.2, 0.5, 0.8, 1.1, 1.4, 1.7}.

4.2 Results

4.2.1 Scenario 1

As expected, the algorithm performs well in terms of variable selection in this highly simplified setting. The most complicated path, (Ni, N, C), is an extension of its sub-paths (Ni) and (Ni, N), meaning the algorithm is in a sense guided towards it. As we can see in the left panel of Figure 2, the algorithm always selects and updates the correct paths in the first iterations, after which it also starts selecting noisy paths. When we apply the five-fold cross-validation procedure, the average m_stop is 30.38, with a considerable variability (SE = 8.29) due to the small sample size. In these iterations, the average number of selected paths is 7.42 (SE = 3.99), so PathBoost tends to include a few noisy paths. The test error, computed on a test set generated in the same way, is a bit larger (average 0.48, with SE = 0.10) than the oracle one (we used σ² = 0.2).

4.2.2 Scenario 2

In this scenario, it is much more difficult for the algorithm to find the true paths. Although the relevant paths are nested in this case as well, they are farther from the metal centre and their effects are much smaller. Remember that the size of the coefficients β are inversely proportional to the path length (see Equation 4.2) and here we have one relevant path of length five. In addition, the algorithm necessarily needs to select some irrelevant paths, as the shortest relevant path (Ni, N, C) can only be reached by first selecting (Ni) and (Ni, N). This is perhaps not a very realistic scenario, as it is unlikely that nested paths do not influence the response while the longer path does, but it is a useful setup for investigating the ability of our algorithm to explore the feature space.

Table 1

Scenario 2. Summary of the results obtained by PathBoost in the 200 replications of the experiment. The average number of times the path is updated and its average importance are reported for the 10 most important paths, with the related standard error in brackets. The paths over the dashed line are generated with a direct effect on the response, those under the dashed line without.

Path	Number of Updates	Importance
(Ni, N, C)	14.89 (2.06)	19.07 (0.42)
(Ni, N, C, C)	24.31 (3.72)	22.81 (1.33)
(Ni, N, C, C, C)	14.02 (1.80)	32.58 (1.34)
(Ni, N, C, C, C, Br)	1.26 (0.86)	0.05 (0.03)
(Ni)	1.00 (0.00)	100.00 (0.00)
(Ni, N)	10.28 (1.93)	23.61 (0.27)
(Ni, N, C, C, Br)	1.36 (0.67)	0.08 (0.05)
(Ni, N, C, C, O)	1.59 (1.05)	0.05 (0.03)
(Ni, O)	1.06 (0.91)	0.04 (0.05)
(Ni, N, C, C, Cl)	0.92 (0.82)	0.02 (0.02)

The above mentioned difficulties are reflected in the true positive rate (TPR), which is considerably lower than in the previous scenario, as shown in the right panel of Figure 2. From this figure, we can also see that in all 200 replications of the experiment the algorithm selects the noisy paths (Ni) and (Ni, N) in the first two iterations and then selects the relevant path (Ni, N, C) at iteration three. Interestingly, the algorithm manages to select paths (Ni, N, C, C) and (Ni, N, C, C, C) very early, such that at iteration five the algorithm has already selected 3/4 of the relevant paths without any false positives except for the two initial ones that it had to select (resulting in a TPR of 3/5). The final relevant path, (Ni, N, C, C, C, Br), is very rare, in that it is present in only two observations and it has a minimal effect (i.e., very hard to discover). When only considering the iterations until the cross-validated stop value, we can see that (Ni, N, C, C, C, Br) is rarely selected (on average only 1.26 times, see Table 1) and relatively late, as the measurement of importance suggests.

After capturing the information of the relevant paths, the algorithm starts selecting noisy paths quite early, noticeably around the 17th iteration. The average stopping iteration m_stop obtained by five-fold cross-validation in this scenario is 82.82 (SE = 20.85), meaning PathBoost tends to select quite many noisy paths. On average, our algorithm selects 19.07 paths (SE = 5.92), so there are on average more than 15 false positives. The selection of too many variables is typical of boosting algorithms (see, e.g. Strömer et al., 2022). Importantly, however, Table 1 shows that the algorithm correctly focuses on modelling the effect of the relevant paths, and only updates the effect of noisy paths a few times. The noticeable exception is (Ni, N), but this is due to the fact that it shares information with all the relevant paths (which are its ’child’, ’grandchild’ and ’great-grandchild’).

Table 1 also shows the importance of the paths computed by PathBoost. The importance is relative to the most important path (whose importance is set to be 100), that in this case is the metal centre (Ni). While the metal centre is not directly associated with the response in our data-generating mechanism, it obviously includes part of the information belonging to the paths starting with it and it is naturally selected in the first iteration. More interestingly, PathBoost seems to correctly infer that it should not focus on it and only selects it one time in each replication. Excluding the metal centre, the most interesting result here in terms of path importance is that PathBoost correctly assigns the largest importance to the relevant paths. The exceptions are (Ni, N, C, C, C, Br), which not surprisingly got a small importance value and (Ni, N), which has a large value because it is the parent of the relevant (and very important) (Ni, N, C). Another consequence of the fact that (Ni, N), in a sense, ‘steals’ importance from its child and grandchild nodes is that (Ni, N, C), (Ni, N, C, C) and (Ni, N, C, C, C) have increasing importance despite being generated with decreasing importance. Basically, the ‘spurious’ importance of (Ni, N) is subtracted from its descendants in proportion to the distance.

Figure 3

Scenario 3. Left panel: TPR as a function of the number of iterations. Right panel: Test error (mean squared error) as a function of the irreducible error. In both panels the shadowed area shows the interval between the smallest and the highest value (range) obtained in the 200 replications.

Finally, the error is relatively large, being on average 0.92 (SE = 0.19), compared to a generated irreducible error of σ² = 0.2. This may again be partially explained by the selection of noisy paths close to the metal centre, in particular path (Ni, N).

4.2.3 Scenario 3

This is the most realistic scenario, with a higher level of difficulty for the algorithm to find the true generating model. This is reflected in a higher value of the stopping iteration obtained by the five-fold cross-validation procedure (average m_stop = 312.32, with SE = 32.40). The increased number of iterations also led to an increase in the number of selected paths (on average 92.47, with SE = 10.83), including noisy ones (see left panel of Figure 3). As in Scenario 1, PathBoost correctly identifies the relevant paths in the first 20–25 iterations and then starts to also include noisy paths. This is partially caused by collinearity problems. For example, some irrelevant paths, (Sc, N, Si), (Y, C, B), (Y, C, N), (Ni, N, C, C, F), are perfectly correlated with relevant paths. A correlation larger than 0.7 to a relevant path is also influencing the selection of the noisy paths (Ni, N, C, C, C, Br), (Ni, N, N, S) and (Ni, N, C, C, C, Cl). The average test error obtained in the 200 iterations, for the model with the stopping iteration selected by cross-validation, is 0.90 (SE = 0.06), which is larger but not dramatically larger than the oracle one (0.2).

We also evaluated the predictive performances of PathBoost with an increasing noise value, by reducing the signal-to-noise ratio. The results are reported in the right panel of Figure 3. As expected, the test error increases as the value of σ² in Equation 4.1 is increased. Although there is a higher variability in predictive performance under a reduced signal-to-noise ratio (shadowed area), the average test error (in terms of MSE) grows linearly as a function of the unexplainable variance. Our results with σ² = 0.2, therefore, seem generalizable to contexts with lower signal-to-noise ratios.

5 Main application

We are now ready to apply PathBoost to our case study. As described in Section 2, the dataset contains 60 799 molecules, with 30 different metal centres. The HOMO/LUMO gap is the continuous response. We randomly split the data into a training set (48 639 compounds) and a test set (12 160).

5.1 Implementation details

As is common in regression problems, here we use the squared error as the loss function. In terms of the boosting algorithm, we use tree stumps as base learners and a learning rate η = 0.3 to further ‘weaken’ them. The selected learning rate is the default value in many boosting implementations, including XGBoost (Chen and Guestrin, 2016).

Since gradient boosting is based on sequential improvements, it is usually not possible to parallelize the algorithm. In our case, however, each path grows from a specific anchor node or metal centre and paths that do not share the metal centre do not interact. In practice, this essentially means that our boosting algorithm fits a different sub-model for each specific class of compounds, where the class of compounds is defined by its metal centre. Importantly, this insight enables us to implement a form of parallelization that facilitates training on larger datasets. Specifically, we use the following computational trick: We first fit PathBoost separately for each sub-model, resulting in 30 initial learning paths. From these, at each boosting iteration, we then select the best update to iteratively build up a final learning path covering all sub-models without any additional fitting. More details of this fitting scheme can be found in Section S.4 of the Supplementary Material.

Note that the above procedure is different from fitting 30 separate boosting models: By selecting the best update at each iteration, the algorithm potentially avoids updating paths that do not contribute to the general aim but are only useful for explaining a variation of a potentially irrelevant class of compounds, for example, all relevant information may already have been captured within that class. Moreover, it only requires the computation of a single m_stop, that is, avoid performing 30 separate time-consuming cross-validation procedures.

To simplify the search for relevant paths, we also decided to limit the maximum path length to six. This is consistent with the conjecture that HOMO/LUMO gaps are mainly influenced by the characteristics of the molecules’ centres. Preliminary analyses (see Figure S3.2 in the Supplementary Materials) corroborated this idea. This simplification allows us to avoid long chains of carbon atoms, greatly speeding up the algorithm at a minimum cost (for more information, see Section 5.3).

Regarding the tuning of the algorithm, the optimal number of iterations has been found by five-fold cross-validation, using a patience value equal to 68 following the elbow rule (this choice is discussed in Section 5.3, see also Figure 6). The m_stop value obtained with this procedure is 15 948, resulting in a total of 64 186 considered paths, of which 9 700 were included in the final model.

Figure 4

Main application. Left panel: Test error as a function of the iterations. The vertical line shows the value of m_stop obtained by five-fold cross-validation, the horizontal line the null model (only metal centres). Right panel: Scatter plot contrasting observed and predicted HOMO/LUMO gaps in the test set (at iteration m_stop).

5.2 Results

Figure 4 summarizes the results of the case study. We obtained a test mean square error around 4 × 10⁻⁴, which is comparable to current competitors (see Section 5.3). Interestingly, from the left panel of Figure 4 we can also see that the test error continues to decrease after the stopping iteration has been reached. Therefore, in theory, our model would have profited from some additional iterations. However, the improvement after the selected iteration is limited and this early stop does not influence the quality of the prediction too much. Also note that, with very few exceptions, the predicted and the observed values of the HOMO/LUMO gap in the test set are in general reasonably similar (Figure 4, right panel).

As mentioned in Section 3, one relevant feature of PathBoost is the ability to evaluate the paths’ importance. Figure 5 reports the top 30 paths ranked by their importance: The left panel shows that, unsurprisingly, the most relevant features are the metal centres. As we saw in scenario two of the simulation study, this is partly an artefact of the search algorithm, which always starts from the metal centre and, therefore, assigns to it the first large improvement. However, it may also be related to the characteristic of the problem at hand: The closer the atom is to the metal centre, the larger the chance to affect the HOMO/LUMO gap. More interestingly, the right panel shows the top 30 paths for importance when excluding the metal centres. Noticeably, only a few paths of length one are included in the list, albeit the two most important have this length, while paths that capture ring structures, such as (Zn, C, C, C, C, C), (Zn, C, C, C, C, N), (Ru, N, C, C, C) and (Pt, N, C, C, C, C), are among the top 30, despite they potentially enter into the model only in a later stage (the algorithm need to pass through the partial cycles to reach them). This is not in contrast to the chemists’ conjecture that the most relevant paths are those close to the molecular centre (all atoms in a ring structure are indeed close to the molecular centre), but it suggests that particular atomic structures, such as rings, may be key factors for the prediction of the HOMO/ LUMO gap.

Figure 5

Main application. Top 30 paths by importance (left panel) and top 30 paths by importance when excluding the paths of length zero (right panel). Here the importance is defined as the sum of the overall improvements to the previous step and they are scaled in a way that the most important path has importance 100.

Figure 6

Main application. Left panel: Cross-validation error as a function of the patience. Right panel: Test error with the results of a few values of patience (namely 15, 41, 69, 157, 498) highlighted by vertical lines.

5.3 Remarks

In addition to the main results, we can gain further insights into the algorithm’s performance by analysing the following points.

Choice of patience. As shown in Figure 6, the impact of the choice of patience in terms of test error is not very large, in particular after value 69. As suggested in Section 3.3.3, we used the elbow rule to select its value, and ended up with patience = 69 (it corresponds to 15 948 iterations). In general, from this analysis, it seems that a more drastic choice (e.g., patience = 15) would not have harmed the accuracy dramatically, yet sped up the computations considerably. Importantly, when comparing the variable importance plot obtained with patience = 69 to that obtained at a later stage (25 415 iterations), we did not find any difference in the top 30 list of paths. This means that the reduced test error is only related to small modifications of the effect of existing paths or that the paths added in a later stage are not very influential.

Alternative variable importance. As mentioned in Section 3.3.2, PathBoost also allows us to measure the importance of the variable in terms of relative gain (to the second-best option) rather than absolute gain. The results are reported in Figure S5 of the Supplementary Material, but the differences are minimal. Many of the top important paths listed in Figure 5 are also present in this version, although in different positions. The main difference is that this list contains shorter paths and does not include ring structures, most probably because the information provided by the full cycle is not much larger than that of the same cycle without the last atom.

Comparison with 30 separate implementations of PathBoost. As mentioned in Section 5.1, one could potentially fit 30 separate boosting models by implementing PathBoost to the subsets of molecules that share the same metal centre. In our case, the results are very similar in terms of prediction ability, without the need to implement 30 separate cross-validation procedures. See Section S.6 of the Supplementary Material for further details on this comparison.

Comparison with other approaches. To give an idea of the relative performance of our algorithm to its competitors, we look at the results provided by Kneiding et al. (2023). In their study, two algorithms that also explore the molecular structure to predict the HOMO/LUMO gap were applied to the same data. The errors in terms of mean absolute error (MAE in Hartree) of these two methods, SchNet (Schütt et al., 2018) and EdgeUpdate (Jørgensen et al., 2018), are of order 10⁻² (1.2 × 10⁻² and 1.02 × 10⁻², respectively) and they are therefore comparable to ours (MAE = 1.5 × 10⁻²). In contrast to our algorithm, though, SchNet and EdgeUpdate do not provide any variable importance measure, making it more challenging to identify the structural characteristics driving the predictions. Kneiding et al. (2023) also showed that a message passing neural network (MPNN) developed by Gilmer et al. (2017) reduces the MAE by one order of magnitude (6.02 × 10⁻³). This improvement is achieved by feeding the model with additional information about the nodes (atoms) and edges (bonds). As for SchNet and EdgeUpdate, there is no variable importance measurement or any other tool to improve explainability. For a comparison of our method with plain applications of XGBoost (after a pre-processing step to tabulate the data), see Section S3.3 in the Supplementary Materials).

6 Conclusions

In this article, we proposed a boosting algorithm, PathBoost, that simultaneously explores a graph and fits a model to predict the HOMO/LUMO gap of a chemical compound. Although the algorithm has been developed for this particular application, it may be easily extended to other contexts in which the input data are provided as a graph with a designated anchor node. The main characteristic of the algorithm, namely the iterative expansion of the feature space, could potentially be very useful also outside the considered context of graph-based input data. In general, a similar type of boosting approach could be applied in situations where there is a potentially very high-dimensional feature space, yet each potential feature can be considered an extension or combination of some initial simple seed features, for example, in search of relevant interaction terms based on an initial set of input variables.

Concerning the target application, instead, it would be useful to increase the prediction ability of the model by incorporating the characteristics of the atoms and edges. Currently, our algorithm is only able to explore the molecular connectivity (graph topology), without considering quantities such as the atomic number of the atoms, the strength of the chemical bonds and so on. The inclusion of this additional information has been shown to be useful in the context of neural networks, where Kneiding et al. (2023) managed to reduce the error of a neural network by one order of magnitude. It seems very reasonable that the same will be true for a boosting algorithm and we are planning to extend PathBoost in this direction.

Considering the strength of a chemical bond in a boosting algorithm such as PathBoost can also be useful in the definition of molecular representation. By only including edges with a relevant impact on the result, indeed, our algorithm can suggest which bonds should be represented/plotted and which should stay implicit. Being more connected to variable selection than to prediction ability, though, one needs to be particularly careful with the choice of the tuning parameter.

One advantage of our algorithm is the ability to evaluate the variable importance. Our current implementation, based on the deviance, is simple and generally effective, but may face some difficulties in specific situations. For example, in the case of nested relevant paths, it is not guaranteed that the variable importance is correctly allocated among the paths (tendentially, shorter paths are favoured). Handling cases like this is not straightforward and the algorithm would benefit from future work on this issue.

Finally, one simplification directly connected to our chemistry application is the presence of a single anchor node, the metal centre. In theory, the algorithm can be extended to situations in which this assumption is relaxed, that is, working with multiple anchor nodes (though well-defined and limited in number). While the graph exploration could proceed similarly as here, additional care should be devoted to the paths’ definition, to avoid duplications when the paths contain multiple anchor nodes.

Footnotes

Acknowledgements

The authors would like to thank: Hannes Kneiding for his help with the data, Andreas Mayr for initial discussion and the two anonymous reviewers for their helpful comments.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: CM: CompSci (EU Horizon 2020 MSCA, n. 945371). JP: Integreat (RCN, n. 332645). DB: catLEGOS (RCN, n. 325003), Hylleraas Centre (RCN, n. 262695), NOTUR (n. NN4654K, NS4654K). RDB: Integreat (RCN, n. 332645), Plumbin’ (RCN, n. 323985).

Supplementary materials

References

Balcells

and Skjelstad

(2020) tmQM dataset—quantum geometries and properties of 86k transition metal complexes. Journal of Chemical Information and Modeling, 60, 6135–46.

Bengio

, Courville

and Vincent

(2013) Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35, 1798–1828.

Bühlmann

and Yu

(2003) Boosting with the L2 loss: Regression and classification. Journal of the American Statistical Association , 98, 324–39.

Butler

, Davies

, Cartwright

, Isayev

and Walsh

(2018) Machine learning for molecular and materials science. Nature , 559, 547–55.

Chen

and Guestrin

(2016) XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Pages 785–94.

Dara

, Dhamercherla

, Jadav

, Babu

and Ahsan

(2022) Machine learning in drug discovery: A review. Artificial Intelligence Review , 55, 1947–99.

Fei

and Huan

(2010) Boosting with structure information in the functional space: An application to graph classification. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . Pages 643–52.

Gilmer

, Schoenholz

, Riley

, Vinyals

and Dahl

(2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning (ICML) . Pages 1263–72.

Hastie

, Tibshirani

and Friedman

(2009) The Elements of Statistical Learning: Data Mining, Inference and Prediction , 2nd Edition. New York: Springer.

10.

Jørgensen

, Jacobsen

and Schmidt

(2018) Neural message passing with edge updates for predicting properties of molecules and materials . 32nd Conference on Neural Information Processing Systems, Montreal, Canada.

11.

Khomskii

(2014) Transition Metal Compounds . Cambridge University Press.

12.

Kneiding

, Lukin

, Lang

, Reine

, Pedersen

, De Bin

and Balcells

(2023) Deep learning metal complex properties with natural quantum graphs. Digital Discovery , 2, 618–33.

13.

Kneiding

, Nova

and Balcells

(2024) Directional multiobjective optimization of metal complexes at the billion-system scale. Nature Computational Science , 4, 263–73.

14.

Kudo

, Maeda

and Matsumoto

(2004) An application of boosting to graph classification. Advances in Neural Information Processing Systems , 17, 729–36.

15.

Mayr

, Binder

, Gefeller

and Schmid

(2014) The evolution of boosting algorithms. Methods of Information in Medicine , 53, 419–27.

16.

Pan

, Wu

, Zhu

, Long

and Zhang

(2017) Boosting for graph classification with universum. Knowledge and Information Systems , 50, 53–77.

17.

Saigo

, Nowozin

, Kadowaki

, Kudo

and Tsuda

(2009) gboost: A mathematical programming approach to graph classification and regression. Machine Learning , 75, 69–89.

18.

Scarselli

, Gori

, Ah Chung Tsoi Hagenbuchner

and Monfardini

(2009) The graph neural network model. IEEE Transactions on Neural Networks and Learning Systems , 20, 61–80.

19.

Schütt

, Sauceda

, Kindermans

P-J

, Tkatchenko

and Müller

K-R

(2018) SchNet–A deep learning architecture for molecules and materials. Journal of Chemical Physics , 148, 241722.

20.

Sodhi

(2019) Metal complexes in medicine: An overview and update from drug design perspective. Cancer Therapy and Oncology International Journal , 14, 25–32.

21.

Strömer

, Staerk

, Klein

, Weinhold

, Titze

and Mayr

(2022) Deselection of base-learners for statistical boosting—with an application to distributional regression. Statistical Methods in Medical Research , 31, 207–24.

22.

Wang

, Kalantar-Zadeh

, Kis

, Coleman

and Strano

(2012) Electronics and optoelectronics of two-dimensional transition metal dichalcogenides. Nature Nanotechnology , 7, 699–712.

23.

, Pan

, Chen

, Long

, Zhang

and Philip

(2021) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems , 32, 4–24.

24.

Yun

(2016) Use of transition metal compounds in solar and biomass energy. Nano Energy , 30, 52–9.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.40 MB

A path-based boosting algorithm for exploring transition metal compounds

Abstract

Keywords

1 Introduction

2 Motivating application

3 PathBoost: Path-based boosting

3.1 Preliminaries and notation

Figure 1

3.3.2 Path Importance

Algorithm 1

PathBoost

4 Simulations

4.1 Synthetic data

Average true positive rate (TPR) as a function of the number of iterations. The shadowed area shows the range (interval between the smallest and highest value) of TPR in the 200 replications. Left panel: Scenario 1. Right panel: Scenario 2

4.2.1 Scenario 1

4.2.2 Scenario 2

Table 1

Scenario 3. Left panel: TPR as a function of the number of iterations. Right panel: Test error (mean squared error) as a function of the irreducible error. In both panels the shadowed area shows the interval between the smallest and the highest value (range) obtained in the 200 replications.

5 Main application

5.1 Implementation details

Figure 4

Figure 5

Main application. Left panel: Cross-validation error as a function of the patience. Right panel: Test error with the results of a few values of patience (namely 15, 41, 69, 157, 498) highlighted by vertical lines.

6 Conclusions

Footnotes

Acknowledgements

Declaration of Conflicting Interests

Funding

Supplementary materials

References

Supplementary Material