Abstract
Empirical performance evaluation is crucial for algorithm configuration and performance optimization. Prior work showed that comparing the running time of two algorithms can be accelerated by evaluating them on strategically selected instances. We explore this approach in the context of automated algorithm configuration, adapting prior methods to leverage empirical performance models and introducing two active learning-inspired methods. We evaluate these methods on two performance comparison situations arising during configuration, achieving speedups of 5 to 3,000 times over the random instance sampling method of state-of-the-art configurators. We then integrate the best methods into the model-based configurator sequential model-based algorithm configurator (SMAC). In two of five running time optimization scenarios, we nearly double the performance gain of SMAC. An ablation study confirms that instance selection drives this improvement, indicating substantial potential for advancing algorithm configuration.
Keywords
Introduction
Many state-of-the-art solvers for non-deterministic polynomial-time hard (NP-hard) problems, such as Boolean satisfiability (SAT) (Falkner et al., 2015; Xu et al., 2008) or mixed integer programming (MIP) (Hutter et al., 2010; Xu et al., 2011), come with parameters that enable users to adapt the inner workings of the solver to the problem instances they are trying to solve. This process of algorithm configuration has traditionally been conducted manually or through simple search procedures such as random search, which is still the approach used in many applications, despite the ready availability of tools for efficiently automating the configuration process. The question of finding which parameter values should be used for an algorithm to perform well on a given set of problem instances is known as the automated algorithm configuration (AAC) problem.
The AAC problem can be formally defined as follows (see, e.g., Hoos, 2012).
Given a target algorithm
Note that typically the running time of the target algorithm is limited by an upper bound defined by means of a cutoff time
Comparing the performance of two configurations of a given algorithm is a key element of procedures for solving the AAC problem, since such comparisons are performed many times during the configuration process. However, in an automated algorithm configurator, the most computationally expensive task is to evaluate the quality of candidate parameter configurations. Executing time-consuming runs of the target algorithm on different problem instances to determine which parameter settings achieve the best performance requires substantial resources, and time is often wasted on less promising configurations as well as on instances that require a long running time to solve, regardless of the configuration utilized. With the increasing focus on sustainability, the computational resources and the environmental impact associated with the use of artificial intelligence methods should be put under scrutiny, providing additional incentives not only to configure algorithms but also to reduce the computational cost of AAC. Several lines of research attempt to tackle this problem, mainly focusing on the idea of discarding configurations that are not sufficiently promising. For anytime algorithms, such as machine learning methods, there has been work on early stopping less promising runs based on learning curves (see, e.g., Domhan et al., 2015; Luo et al., 2019), while adaptive capping mechanisms, such as the ones included in paramILS and irace (Hutter et al., 2009; López-Ibáñez et al., 2016), permit the early stopping of evaluations of configurations unlikely to be competitive with previously evaluated ones. Those lines of research are focused on the idea of discarding configurations deemed insufficiently promising.
On the other hand, in our previous work (Matricon et al., 2021), inspired by the field of active learning, we explored the idea of selecting on which instance to compare two given algorithms and introduced the per-set efficient algorithm selection (PSEAS) problem. This problem appears during AAC, albeit in a slightly different form. Rather than selecting an algorithm, the configurator needs to select a specific configuration of an algorithm among others. Building on research from several areas, we try to identify instances that help discriminate between the compared configurations. We argue that carefully selecting instances and avoiding long evaluations that provide only a limited amount of information allow the configurator to decide faster whether or not it should reject less promising configurations.
The PSEAS problem is solved by using empirical performance data from other algorithms as a prior to estimate which instances have more discrimination potential. The comparison of two configurations of a single algorithm during configuration lacks access to this prior knowledge. However, model-based algorithm configurators, such as the sequential model-based algorithm configurator (SMAC), learn an empirical performance model that can serve as a prior for instance selection. Moreover, access to a model allows us to build on active learning methods, since they also address the question of which instance should be evaluated next, albeit with a slightly different goal. Our contributions are as follows: We define two phases of comparison within the configuration process. The first phase takes place on a subset of instances on which configurations have been run before, and the second on instances never seen before. We adapt the two best-performing selection methods from our previous work (Matricon et al., 2021) with the performance model used in model-based configurators, add two methods inspired by the active learning literature (Gu et al., 2014), and evaluate them on five benchmarks. Our empirical evaluation shows that, depending on the problem instances and their running time distribution, the decision to stop evaluating a less promising configuration can be reached 5 to 3,000 times faster than with random sampling, the method used in current state-of-the-art configurators. We implement within SMAC (Hutter et al., 2011b) a statistical test to discard configurations as soon as there is sufficient statistical evidence for doing so, as well as the selection mechanisms that showed the best result on our initial evaluation. We evaluate the resulting configurator on five configuration scenarios from the literature. We find that the method shows much potential, almost doubling over the improvement reached by vanilla SMAC on two out of five scenarios. We perform an ablation study to verify that the selection mechanism is indeed responsible for the observed improvements. We find that when used in a stand-alone fashion, the statistical testing mechanism we use degrades the performance of SMAC, confirming that the origin of the improvement lies in the selection, or the combined effect of both new mechanisms.
The remainder of this article is organized as follows: Section 2 explains the comparison phases within an automated algorithm configurator and the methods we adapted, Section 3 describes the setup of our computational experiments, the datasets, and the implementation decision we made, Section 4 shows the obtained result when evaluating the methods separately on the two previously described phases without integrating them into the full configuration process, and Section 5 shows the results obtained as a result of these enhancements.
Comparison of Two Configurations
We want to efficiently compare two configurations of a single algorithm. To do so, we need to gather sufficient statistical evidence while using the least possible amount of computing time.
Instance Selection
Following the definition of AAC in the Introduction,
When comparing a challenger configuration
In both cases, we seek to iteratively choose an instance

Workflow of our comparison strategy.
In phase 1,
In phase 2, we also have a subset
In prior work (Matricon et al., 2021), we have used selection methods to decide which of two given solvers for an NP-hard problem performs best. To do so, each instance is assigned a score designed to reflect the relevance of choosing that instance both in terms of information obtained and cost incurred. The highest-scoring instance is chosen iteratively until one solver is deemed to have shown better performance than the other. Since we are working with similar types of solvers, we expect that a similar approach would be promising in our situation. We assign scores to instances from
Differing from the context of PSEAS, we have access to an EPM trained on all pairs
We adapted two of the best methods tested on PSEAS to support the partial-information context. Note that these methods do not take advantage of the model in phase 1, while in phase 2, they are using the predictions given by the model as if they were ground truth. We did not adapt the information-based method, as it relies on assumptions regarding the performance distribution that could not be made in the current context. For PSEAS, this method relied on distributions estimated on the runs of
Baseline: Uniform Random Sampling
This is equivalent to assigning every instance the same score, and thus sampling an instance uniformly at random.
Discrimination
This method, originally inspired by the work of Gent et al. (2014), aims to choose the instance that best discriminates between the best and other configurations. We say that a configuration
Variance
This approach is based on the intuition that an instance with high variance is likely to discriminate between two configurations. To also take into account the cost of running this instance, we divide the variance by the mean running time of the instance. Note that, according to the literature (Matricon et al., 2021), the underlying distribution of running times follows a Cauchy distribution and would thus not have a well-defined mean or variance. However, due to running times being bounded by 0 and the cutoff time, it is a truncated Cauchy distribution, which is well-behaved and has a mean and a variance. Our score is thus the relative variance
Uncertainty–Diversity–Density
This method is inspired by the work of Gu et al. (2014) from the active learning literature mentioned earlier. We decided to take the core ideas for their classification model and adapt it to our regression model. We named this approach uncertainty–diversity–density (UDD), because it is based on a combination of three scores: uncertainty, diversity, and density. All three scores are scaled and translated to the interval
Uncertainty
This corresponds to UDD with
Stopping Criterion
At each phase, we need to decide when we consider that sufficient statistical evidence has been gathered. In Algorithm 1, this decision appears in lines 7 and 12. Based on previous works (López-Ibáñez et al., 2016; Matricon et al., 2021), we use a Wilcoxon matched-pairs signed-rank test (Conover, 1998) with a significance level of 0.05.
Experimental Setup
Our first goal is to include the instance selection methods in a configurator at both phases and to assess the impact of these modifications on the performance. However, directly including them without decomposing the mechanism into smaller, more easily analyzed components would give us little to no information about which of the components show the desired impact.
Experiments
We divided our evaluation into two main sections, each subdivided into two research questions. First, we evaluated our methods outside the configurator on artificially generated running time data. We conducted experiments to evaluate the performance of the selection methods, separately for phase 1 and phase 2 defined earlier. We designed two sets of experiments aiming to answer the following questions: How does the selection method perform when comparing a new configuration to the incumbent on the subset of instances for which we already collected information throughout the configuration run as seen in phase 1? How does the selection method perform when comparing a new configuration to the incumbent on all instances, selecting instances for which we did not collect information throughout the configuration run as seen in phase 2?
Then, we included the best-performing methods in a state-of-the-art configurator and evaluated their performance. Since SMAC does not include any statistical test to decide when to stop the evaluation of a challenger configuration, we conducted an ablation study to evaluate the impact of the test we newly introduced without instance selection. This allows us to answer the following questions: Do sophisticated instance selection mechanisms allow us to improve over picking instances uniformly at random? How does the introduction of a statistical test during the comparison impact the performance of the configurator—in our case SMAC?
Datasets
We used configuration scenarios taken from the algorithm configuration library AClib (Hutter, López-Ibáñez et al., 2014) or derived from these. Table 1 provides some details regarding the datasets we used. The number of clusters is computed using the mean-shift (Comaniciu & Meer, 2002) implementation of
Characteristics of the Benchmark Instance Sets.
Characteristics of the Benchmark Instance Sets.
We evaluated our method in two NP-hard problems that have been well-studied in the algorithm configuration literature: SAT and MIP. For each, we chose two widely used datasets from AClib and added a more recent and harder dataset to test the limits of our methods.
For SAT, we used CF and IBM from AClib and generated a new set of cryptography instances based on the work of Nejati and Ganesh (2019) (we used the sha256 encoding, 16 to 60 rounds, and an input size of
For MIP, we used RCW2, REG200 from AClib and added a more difficult dataset based on the work of König et al. (2021), which is comprised of challenging neural network verification problems. For this last dataset, we set the cutoff time to 9,000 s, such that 70% of the instances can be solved by the default configuration before reaching this time limit. For these scenarios, we chose CPLEX, since it is well known in the literature and also prominently used in AClib.
Evaluation Inside a Configurator
We chose two TSP scenarios and three MIP scenarios. Because we had to run many different versions of the configurators, we selected well-studied scenarios with a relatively low time budget. For TSP, we used two datasets from AClib (EAX and LKH on rue-1000-3000) due to their short configuration time (see, e.g., Pushak & Hoos, 2020). For MIP, we used three datasets from AClib (REG200, CLS and RCW2) well-known from the literature (see, e.g., Hutter et al., 2010). For this scenario, we use a cutoff time of 300 s (following Cáceres et al., 2017), since our method is more suited for situations in which runs might be cut off upon reaching the time limit. Using a cutoff time of
Implementation Details
Our implementations are available on GitHub.
1
We used Python 3.9 with the libraries
The UDD method requires a distance function
Since the discrimination and UDD methods have parameters, we tuned those with a simple grid search on a separate scenario (Kissat with the SWGCP dataset from AClib). For discrimination, we evaluated values in
Evaluation Outside of a Configurator
To carry out our empirical investigation, a dataset of configurations and their associated performance scores were required. To obtain such a set, we generated 100 random configurations uniformly at random for each solver and ran them on all instances of the datasets included in the respective configuration scenarios. This allowed us to collect performance data on many pairs of problem instances and algorithm configurations. We used the same random forest model as in SMAC (Hutter et al., 2011b) as an EPM. We trained this EPM on the previously described performance dataset. To evaluate how efficient our methods will be along a configuration run, we trained the EPM on various amounts of performance data: the number of known configurations is in
Evaluation Inside the Configurator
This evaluation required us to include the selection method in the configurator. As our first evaluation is based on the inner working of SMAC, we included the methods in SMAC3 version
We included in SMAC a Wilcoxon matched-pairs signed-rank test (Conover, 1998) with a significance level of 0.05 between the runtime of the incumbent
Due to the large number of statistical tests involved, we need to account for the problem of multiple testing. First, we do not perform tests before running
Due to the large computation time required to evaluate every possible combination of methods between phase 1 and phase 2, we had to carefully select a subset of possible experiments; specifically, we only considered the best-performing methods from the first set of experiments at each phase of the configuration. Because our method involves adding a Wilcoxon test to stop comparisons early, we also evaluate its impact separately, to gain further insights into the observed behavior. To compare the performance of two versions of the configurator, we want to look at the expected best performance of the best found configuration. Since a user would typically run the configurator several times and select the configuration found to perform best on the given training data (which corresponds to the so-called standard protocol), we apply the following protocol: we run the configuration eight times with seeds from 1 to 8, repeatedly sample five runs uniformly at random from that set of 8, and identify the best of these according to performance on the training set. We used 1,000 such samples to estimate the probability distribution of the quality of the result produced by each configurator on each configuration scenario. We then compared the medians of these empirical distributions. This is similar to procedures used in the literature (see, e.g., Anastacio & Hoos, 2020; Pushak & Hoos, 2020).
Execution Environment
The first set of experiments was run on a high-performance compute cluster running CentOS Linux operating system version 8.5. Each node is equipped with two Intel Xeon E5-2683 CPUs with 16 cores and 40 MB cache each, as well as 94 GB RAM. The second set of experiments was run on a high-performance cluster running Rocky Linux operating system version 9.3. Each node is equipped with two AMD EPYC 7543 CPUs with 32 cores and 256 MB of cache each as well as 1 TB of RAM.
Evaluation Outside the Configuration Process
This section describes the results obtained from our first set of experiments. The goal of these experiments was to evaluate how well the selection methods perform at both phases described in Section 2, independently of the whole configuration procedure around. We show aggregated results here, but the raw results and scripts to generate more visualizations are available in our Git repository.
Comparing Configurations on Known Instances
To answer the first question—How does the selection method perform to compare a new configuration to the incumbent on the subset of instances for which we already collected information throughout the configuration run as seen in phase 1?—we consider phase 1 (see Section 2.1). We populate
Figure 2 shows the collected accuracy over the time spent to make the comparison for two examples. Figure 2(a) is a case in which the discrimination and variance methods are significantly more accurate than the three others at any given time, while UDD and uncertainty show lower accuracy than random sampling. Figure 2(b) is a case in which discrimination and variance methods start with an advantage over random but quickly reach the same accuracy; once again, UDD and uncertainty perform substantially worse.

Mean accuracy of the Wilcoxon test (
Figure 3 summarizes the previously described curves by computing the area under the curve (AUC) for all tested amounts of prior data; the higher the AUC, the faster and more accurately the decision can be taken. This visualization allows us to examine how the methods compare and also illustrate the impact of the prior data used on the empirical performance model. In all our scenarios, we can see a clear correlation between the amount of configurations in

Area under the curve of the mean accuracy of the Wilcoxon test (
Regarding the selection methods, randomly sampling instances performs well, but in most cases, the discrimination and variance approaches are superior.
The results on the IBM dataset exacerbate the tendencies observed on others. Compared to datasets where little to no instances remain unsolved within the given cutoff time by a well-chosen configuration, timeouts occur on about a third of the IBM dataset. On the other hand, half of the instances are solvable within a few seconds. This large variation in running times means that selecting the wrong instance can have a dramatic effect on the overall running time and would explain why random sampling does not perform as well on this scenario as on the others. The fact that the UDD and Uncertainty scores do not account for the expected running time on an instance also penalizes them strongly.
To answer the second question—How does the selection method perform to compare a new configuration to the incumbent on all instances, selecting instances for which we did not collect information throughout the configuration run as seen in phase 2?—we consider phase 2 (see Section 2.1). We populate
For each method and each considered pair of

Time used (in s) before deciding that one configuration is better than the other based on a Wilcoxon test (
To evaluate the performance of the selection methods, we computed the median time used to run the instances selected by each of the methods for each prior data and reported it in Table 2. The statistical significance of differences in the medians was tested with a permutation test (significance level of
Median Time in Seconds for Each Method Over Every Tested Prior Data, With Lowest Medians Boldfaced (Statistical Significance According to a Permutation Test With
The results shown in this section indicate that the best-performing methods to discriminate between two configurations of the same algorithm within a limited amount of time are the ones based on the running time variance on each instance and on their discrimination power. We notice that both methods inspired by the active learning literature are not performing as well. While we wanted to assess these methods on our problem, this was to be expected, since they were designed with a different goal in mind. Indeed, the field of active learning focuses on improving the accuracy of the model, whereas we only use a model to avoid having to run each configuration on each instance. Improving the accuracy of this model can serve our purpose, but it is not our final goal.
We note that the experiments reported here made use of randomly chosen configurations of a given algorithm. As a result, the variation in running times between these configurations is much larger than that expected during an actual configuration run, which focuses on high-performance configurations. While this certainly does not invalidate our results, it implies that we should not expect speedups as large as the ones observed in Table 2 when including our methods inside a configurator.
Evaluation Inside a Configurator
As previously shown, applying instance selection and performing a statistical test allows us to spend less time on comparing the performance of two configurations through two expected behaviors: early stopping of the evaluation of less promising configurations and performance comparisons on less time-consuming instances. In this section, we include the instance selection mechanism inside a model-based configurator in order to evaluate if the previously observed results can be translated to the performance of the configurator itself. To do so, we expanded the prominent sequential model-based configurator SMAC. However, since SMAC does not include a statistical test, the two aspects of our methods have to be evaluated separately. First, we evaluate SMAC-IS (SMAC with Instance Selection), a version of SMAC in which we added at both phases of the both parts of the instance selection method, namely a Wilcoxon matched-pairs signed-rank test (Conover, 1998) with a significance level
Impact of the Instance Selection Methods
To answer our first question, we implemented the instance selection mechanisms inside SMAC at the two phases identified earlier and named this new version SMAC-IS. Table 3 shows the median performance values of the best configurations distribution. We validate the statistical significance of the differences with a Mann–Whitney
Median Performance of SMAC-IS With the Selection Methods Random (Rand), Variance-Based (Var), and Discrimination-Based (Disc) at Both Phases.
Median Performance of SMAC-IS With the Selection Methods Random (Rand), Variance-Based (Var), and Discrimination-Based (Disc) at Both Phases.
Note. Boldfaced values are better than those for vanilla SMAC. The lowest median is underlined. All underlined medians are significantly different from others based on a Mann–Whitney
Compared to vanilla SMAC, SMAC-IS improved three of our five scenarios. In particular, the EAX on rue-1000-3000 scenario (Table 3(a)), which showed an improvement by simply adding the statistical test, displays even further improvement with most instance selection methods; at best, from a default performance of
For CPLEX on REG200 (Table 3(e)), SMAC-IS improves slightly over SMAC, but for CPLEX on CLS (Table 3(c)) it does not, despite being able to find a better configuration than the default. At the other end of the spectrum, for LKH on rue-1000-3000 (Table 3(b)) SMAC-IS returns configurations that perform even worse than the default values in half of the cases.
To evaluate the impact of the statistical test, we examined the performance of SMAC-IS with random sampling at both phases, which corresponds to vanilla SMAC with a Wilcoxon test to discriminate between the performance of the incumbent and one of the challenger configurations at both phases of the configuration. We dub this variant SMAC-W (SMAC with Wilcoxon test). We show the median of those distributions in Table 4(a). Similarly to the previous results, we validated the statistical significance of the differences using a Mann–Whitney
Comparison of SMAC and SMAC-W, Respectively Without and With a Wilcoxon Test to Decide Whether a Challenger Configuration Should be Kept Longer.
Comparison of SMAC and SMAC-W, Respectively Without and With a Wilcoxon Test to Decide Whether a Challenger Configuration Should be Kept Longer.
In all except one scenario, the use of a statistical test for early stopping of the comparison has a negative effect. To further investigate these results, we looked at how many times the challenger replaces the incumbent and on how many instances the configurations are evaluated. These results are shown in Table 4(b). We note that the number of accepted incumbents during a run is significantly lower when using the test. Vanilla SMAC accepts the incumbent up to twice as often than SMAC-W for CPLEX on CLS. Moreover, since incumbents get rejected more quickly, the number of instances on which the configurations are evaluated does not increase as quickly in SMAC-W as in SMAC. We can expect that running on a smaller number of instances prevents the configurator from seeing the full range of instances on which the algorithm should perform well, leading to overfitting. This is especially evident for the LKH scenario, on which the expected performance of SMAC-W is worse than the default on the test set. We also noticed that the only case in which the number of instances seen during configuration is higher for SMAC-W corresponds to the only scenario in which SMAC-W performs better than SMAC. Based on those results, we can answer our research question and state that in most cases, adding a statistical test to SMAC hinders its performance.
Since the instance selection mechanism did not allow us to improve over SMAC on all scenarios, we looked into the characteristics of each scenario to better understand what might allow instance selection to reach its full potential.
When we look at each selection phase separately, there is no clear trend in terms of which method performs best at any of those. One expectation was that for scenarios with a low running time, the overhead induced by our methods would hinder the process, but the instance selection performed slightly better than SMAC on one out of our two scenarios with short running times (CPLEX on CLS and REG200), so this hypothesis does not hold in our experiments. Another expectation was that the homogeneity of the dataset would strongly impact the ability to select the right instances and to accurately decide which configurations to drop. However, the best and worst outcomes were obtained on the same dataset, rue-1000-3000, on which we found nine clusters of instances when applying a simple mean shift algorithm, which is the highest number among our datasets. Moreover, two seemingly homogeneous datasets, namely CLS and REG200, show very different outcomes. However, the number of clusters does not capture how far those clusters are from each other, which would impact the difficulty to select representative instances.
Thus, based on our results, we do not see a clear trend regarding what kinds of scenarios would benefit (or not) from our instance selection mechanism. We note, however, that in two out of five scenarios, we were able to nearly double the improvement obtained by SMAC. This improvement demonstrates that in some scenarios, selecting the instances on which to run the configurations at hand can significantly improve the performance of a general-purpose algorithm configurator.
Conclusion
Inspired by the success of instance selection when comparing algorithms (Matricon et al., 2021), we adapted four methods from several fields (Gu et al., 2014; Matricon et al., 2021) that could be applied to select instances in the context of AAC. We identified two steps of AAC procedures at which the selection mechanism could be applied and designed two sets of experiments to assess the performance gains thus obtainable. In the first, we consider a situation in which the performance of an incumbent configuration on a set of instances is known and we want to determine whether the challenger configuration, whose performance is unknown, performs better on this set. In the second, two similarly performing configurations have to be evaluated on unknown instances. Our results show that in both cases, there is considerable potential in the use of those methods, in particular the ones based on the variability in running time or on discrimination power.
Based on those encouraging results, we included the two best selection mechanisms identified in the first phase of our study at both identified steps of the configuration process within the prominent and state-of-the-art SMAC3 configuration system. On half of the considered scenarios, selecting on which instances to run the first and second phases, on top of performing a Wilcoxon test to decide when to stop the comparison between the current incumbent and a challenger configuration, makes it possible to achieve better performing configurations within the same configuration budget, sometimes reaching major improvement compared to SMAC. However, we not yet found a clear way to decide which instance selection method to apply or which scenarios have the potential to benefit from them. Moreover, we studied the impact of solely adding the Wilcoxon test and found that, in most scenarios, using the test degrades the configuration process of SMAC. We note that on the scenarios we have studied, use of the test lowers the number of accepted challengers, likely discarding well-performing configurations by mistake, and tends to slow down the addition of new instances to the pool of instances on which configurations are evaluated. This second point could potentially lead to a form of over-fitting. Those observations confirm that the selection mechanism is the one leading to the observed improvements.
This work opens the door to a more principled way of deciding on which instances the configurations should be evaluated. While more research is needed to decide which specific method to apply in practice, selecting instances during AAC shows great potential.
Footnotes
Author Contributions
Marie Anastacio was leading the project. She wrote code, planned and ran experiments, analysed the results and wrote the paper. Théo Matricon wrote a large part of the code, participated in designing the experiments and analysing the results, and assisted with writing the paper. Holger H. Hoos supervised the project, advised on the design of experiments and methodology, participated in analysing the results and with writing the paper.
Funding
The authors acknowledge support through an Alexander von Humboldt Professorship in Artificial Intelligence held by Holger H. Hoos.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
