Sage Journals: Discover world-class research

Abstract

This paper investigates the general problem of comparing multidimensional simulation output with a given data set (e.g., real-world historical data). This problem frequently arises in verification, validation, and calibration of simulation models with spatial output statistics as in weather/climate, epidemic, swarm/crowd, social systems, communication networks, and many other applications where the simulation output is distributed across various locations or geographical regions. In the case of univariate simulation output, two-sample statistical hypothesis tests such as the t-test are commonly used. For simulation models with multidimensional and spatial output statistics, the Hotelling’s two-sample test is widely used as the benchmark method in the simulation literature. However, the Hotelling’s test assumes that the two samples come from multivariate Gaussian distributions with equal covariance matrices, which may not be the case in many applications. To address this gap, this paper proposes a double-bootstrap method based on the Wasserstein distance for comparing two multidimensional samples. Unlike the Hotelling’s test and other parametric approaches, the proposed method does not require restrictive distributional assumptions, enabling a wider range of applications and contributing to verification, validation, and calibration of simulation models with multidimensional output. Computational experiments are performed to assess the test power, and the results indicate that the proposed method outperforms the Hotelling’s test and various other approaches. The proposed method’s applicability is illustrated through two examples related to random walk of swarm particles on a two-dimensional space and a realistic engineering application involving simulation of unmanned aerial vehicle (UAV) communication systems.

Keywords

Stochastic simulation verification and validation bootstrapping multivariate two-sample testing Wasserstein distance optimal transport

1. Introduction

Stochastic simulation is a common analysis tool for modeling a wide range of man-made and natural systems, namely manufacturing,¹ healthcare,² supply chains,³ marketing,⁴ military,⁵ education,⁶ and weather/climate.⁷ Regardless of the application area, the need to compare the simulation output against a given set of data arises in different steps of the simulation modeling and analysis process.⁸ For instance, model validation requires comparing the simulation output against historical data from the real-world system under study, simulation models of queueing systems are verified by comparing the simulation output against an expected distribution such as analytical results obtained from queueing theory, and calibrating simulation models involves finding a parameter configuration whose outputs closely align with historical data and expected behavior of the system being modeled. This paper deals with the general problem of comparing multidimensional simulation output with a given data set (hereafter referred to as the target data). The multidimensional case is common in simulations with spatial output statistics such as temperature distribution or infection spread across geographical regions in weather and epidemic simulations as exemplified in Figure 1.

Figure 1.

Examples of spatial simulation output: (a) Temperature distribution predictions across the United States by the National Weather Service’s High-Resolution Rapid Refresh (NWS HRRR) simulation model. (b) Infections across neighborhood tabulation areas in an agent-based simulation of Covid-19 spread in New York City.⁹

In the field of simulation, two-sample statistical hypothesis testing is predominantly used for performing such comparisons to determine whether the simulation output differ from the target data.⁸ For the case of univariate simulation output (e.g., time in queue or number in system), parametric and nonparametric tests such as the two-sample t, paired-t, F-test, Wilcoxon signed-rank or rank sum test, and the Mann–Whitney U test are commonly used. The choice of the test depends on the type of statistics being compared (e.g., equality of the means or medians) and whether the two samples meet the test’s underlying assumptions such as normality, equality of variances, and sample size requirements.⁸ For multivariate and multidimensional simulation output, the Hotelling’s two-sample $T^{2}$ test¹⁰ is the benchmark and most common approach in the simulation literature.^11,12 However, a critical downside of the Hotelling’s test is that, similar to other parametric approaches, it requires assumptions regarding the form of the underlying distribution of the two samples. In particular, the Hotelling’s test assumes the two samples come from multivariate Gaussian distributions with equal covariance matrices. If those assumptions are true, the Hotelling’s test is shown to be one of the most powerful tests asymptotically.¹³ However, in many simulation applications, the spatial output statistics do not follow Gaussian distributions and/or the two samples being compared do not come from distributions with equal variances. This significantly limits the applicability of the Hotelling’s test and necessitates alternative methods to support model verification, validation, and calibration in such cases.

This paper introduces a double-bootstrap procedure for comparing two multidimensional samples, which enables statistical comparison of spatial simulation output versus target data in situations where the Hotelling’s test is not applicable due to violation of its underlying assumptions (i.e., normality and equality of variances). More specifically, the proposed method employes the Wasserstein distance as a probability metric to quantify the distance between two probability distributions, and two levels of bootstrapping are performed to estimate the variance of the distance measure (first bootstrap) and the cut-off point for the test statistic (second bootstrap). The contributions and advantages of the proposed method are:

The proposed method outperforms existing methods in detecting scale differences (i.e., difference in spatial variance) between two multidimensional samples. This is an important contribution because many existing tests, such as the Hotelling’s test, are more sensitive to location differences (i.e., difference in spatial average) but have low or no power for scale differences.

The proposed method does not require restrictive distributional assumptions such as normality, equality of the covariance matrix, or other distributional assumptions of existing parametric methods including the Hotelling’s test. This is an important advantage as it expands practical applications.

The proposed method addresses the general paucity of Wasserstein distance-based bootstrap tests as reported in a literature survey.¹⁴

While the implementation presented in this paper considers the classical Wasserstein distance where the metric space is $R^{d}$ endowed with the Euclidean metric, the proposed double-bootstrap procedure can accommodate any other generalized definition of the Wasserstein distance as well as other metrics of “distance” between probability distributions.¹⁵ Similarly, any method for calculation of theWasserstein distance can be used, allowing the performance of the proposed method to improve as faster methods become available in the future. This also makes the proposed method applicable to both binned and unbinned data.

The main limitation of the proposed method lies in its high computational time as bootstrapping and Wasserstein distance calculation are both computationally intensive. This limitation is further investigated in the computational experiments presented in this paper and implications for real-world applications are discussed.

The remainder of this paper is organized as follows. The following section reviews the related literature. The proposed Wasserstein distance-based double-bootstrap procedure is then presented. Computational experiments are then performed to compare the power of the proposed method with the Hotelling’s test and several other multidimensional two-sample tests. In addition, two simulation applications with spatial output statistics are provided to show the efficacy of the proposed method and its advantage over the benchmark Hotelling’s test. Finally, conclusions and future extensions are discussed.

2. Literature review

The Hotelling’s two-sample $T^{2}$ test is the benchmark method in the simulation literature for verification and validation of multivariate response models as reported in several review and tutorial papers.^11,12,16 Computational experiments show that the Hotelling’s test is a robust method for validating geostatistical simulations in terms of the spatial average, variance and regional semi-variogram.¹⁷ The Hotelling’s test is also incorporated into multi-criteria simulation-optimization approaches.¹⁸ In another paper, a systematic procedure for monitoring batch processes in semiconductor manufacturing is presented that provides a combined health index of each batch through the Hotelling $T^{2}$ analysis.¹⁹

Besides the field of simulation, the need to test whether two d-dimensional independent samples have the same underlying distribution is a common problem that arises in many other research areas and applications, namely astronomy,²⁰ psychology,²¹ biology,^22,23 medical,²⁴ and educational research.²⁵ The literature on multidimensional two-sample testing contains numerous parametric and nonparametric approaches that are less common in simulation research and practice. The review here provides examples of some of the main classes of approaches bearing in mind that it is impossible to cover the breadth of this massive body of work in a short note like this. Table 1 provides a summary comparison of the approaches reviewed here that are also included in the computational experiments in the following section.

Table 1.

Positioning of the proposed method and a summary comparison with selected existing approaches.

Method	Distributional assumptions	Dimension limits	Data type
Hotelling’s $T^{2}$ test¹⁰	Gaussian with equal covariance matrix	Any dimension	Unbinned data
Peacock test²⁶	Nearly distribution-free	Two-dimensional	Unbinned data
Fasano and Franceschini²⁷	Nearly distribution-free	Three dimensions	Unbinned data
Székely and Rizzo²⁸	Distribution-free	Any dimension	Unbinned data
Aslan and Zech²⁹	Distribution-free	Any dimension	Unbinned data
Gretton et al.³⁰	Distribution-free	Any dimension	Unbinned data
The proposed method	Distribution-free	Any dimension	Binned and unbinned data

Some of the univariate tests have been extended to multivariate settings. For example, the two-sample Kolmogorov–Smirnov test is generalized to two dimensions by Peacock,²⁶ and later to the three-dimensional case by Fasano and Franceschini,²⁷ which are summarized in Table 1. Multivariate two-sample tests analogue to the Wald–Wolfowitz runs test have been introduced.^31,32 There are also multivariate rank-based tests.^33–35 Several multivariate nonparametric methods have been proposed that can be used for high-dimension, small-sample-size settings.^32,36,37 Distribution-free tests based on the idea of optimal non-bipartite matching³⁸ and shortest Hamiltonian path³⁹ are also proposed.

More recently, three other families of tests have emerged and gained popularity. These include kernel-based tests, such as the one proposed by Gretton et al.³⁰ Another interesting class of tests based on energy statistics,²⁸ i.e., functions of interpoint distances, which have thorough theoretical foundations but the results are often only asymptotic. While these two classes of tests seem fairly different on the first look, it has been shown that there is a close relationship between the energy distance and kernels.⁴⁰ Two energy tests by Székely and Rizzo²⁸ and Aslan and Zech²⁹ as well as the kernel-based test by Gretton et al.³⁰ are summarized in Table 1 and included in the test power comparison experiments presented later in this paper. The other family of tests involve Wasserstein two-sample testing. In a survey of these tests,¹⁴ their connection to multivariate methods involving energy statistics and kernel-based tests as well as univariate methods like the Kolmogorov–Smirnov test are discussed. The survey paper also points out the lack of attention in the current literature to Monte Carlo techniques like bootstrap tests, which is another gap addressed by the proposed method in this paper besides its contributions to the simulation field as highlighted in the previous section.

3. The proposed method

The proposed method uses the Wasserstein distance to measure the distance between two multidimensional samples and employs a double-bootstrap method to estimate the critical value for the test. This section first provides a brief introduction to Wasserstein distance and then describes the proposed double-bootstrap procedure.

3.1. Wasserstein distance

Wasserstein distances frequently appear in mathematics, optimization, and probability and statistics, and originate from the inquiry into how to optimally transport a pile of earth into a hole of equal volume but different shape⁴¹ and a later formulation by Kantorovich⁴² that allows splitting among multiple destination holes. Wasserstein distances are used in statistics in several ways as categorized in a review paper.⁴³ In particular, the Wasserstein distance is commonly used as a probability metric to quantify the distance between probability distributions.¹⁵ In simple words, when comparing two distributions, one can be seen as a mass of earth and the other as a collection of holes, where the Wasserstein distance measures the least amount of work needed to fill the holes with earth, in other words, the minimum effort required to reconfigure the probability mass of one distribution to reconstruct the other distribution. The following definition involves the classical case where the metric space is $R^{d}$ endowed with the usual Euclidean metric. However, it can be generalized to laws defined on much more general spaces. Since the proposed double-bootstrap method is applicable regardless of how the Wasserstein distance is defined or calculated, an assessment of other cases is out of the scope of this work as it would unnecessarily add to the length and technical complexity of this paper with no clear added practical value for simulationists.

Given an exponent $p \geq 1$ , the $p$ -Wasserstein distance between probability measures $μ$ and $ν$ on $R^{d}$ with finite $p$ -moments is defined as follows:

W_{p} (μ, ν) = {(\inf_{γ \in Γ (μ, ν)} \underset{R^{d} \times R^{d}}{ʃ} | | x - y | |^{p} d γ (x, y))}^{1 / p},

where $Γ (μ, ν)$ is the set of all joint probability measures $γ$ on $R^{d} \times R^{d}$ whose marginals are $μ$ and $ν$ , i.e., satisfying $γ (A \times R^{d}) = μ (A)$ and $γ (R^{d} \times B) = ν (B)$ for all Borel subsets $A, B \subset R^{d}$ . Elements $γ \in Γ (μ, ν)$ are called couplings of $μ$ and $ν$ , that is, joint distributions on $R^{d} \times R^{d}$ with prescribed marginals $μ$ and $ν$ on each axis. It is important to note that $W_{p}$ represents a proper distance in that it (1) is nonnegative, (2) is symmetric in $x$ and $y$ , and (3) satisfies the triangle inequality. In the literature, the term optimal transport refers to the optimization problem defining the Wasserstein distance. In the discrete case of the formulation by Kantorovich,⁴² an intuitive interpretation of the above analytic definition is that, given a $γ \in Γ (μ, ν)$ and any pair of locations $(x, y)$ , the value of $γ (x, y)$ represents the proportion of $μ$ ’s mass at $x$ ought to be transported to $y$ in order to reconfigure $μ$ into $ν$ , and the interpretation of $W_{p} (μ, ν)$ as the minimal effort required to recover $ν$ ’s mass distribution from that of $μ$ follows by quantifying the cost of moving a unit of mass from $x$ to $y$ by $| | x - y | |^{p}$ . Below, the discrete optimization problem is defined more formally as it is related to the Wasserstein distance calculations in the computational experiments presented in the following sections. The discretized view of the Wasserstein distance also naturally lends itself better to spatial simulation output statistics that are distributed over pre-defined geographical regions such as counties or states.

Let $μ$ and $ν$ be only accessible through two discrete samples $X = {x_{i}}_{i = 1}^{n_{1}}$ and $Y = {y_{i}}_{i = 1}^{n_{2}}$ , respectively. The corresponding empirical distributions are given by $\hat{μ} = \sum_{i = 1}^{n_{1}} π_{i} δ_{x_{i}}$ and $\hat{ν} = \sum_{i = 1}^{n_{2}} β_{i} δ_{y_{i}}$ , where $δ_{x}$ is the Dirac function at location $x \in R^{d}$ and $π_{i}$ and $β_{i}$ are probability masses, i.e., $\sum_{i = 1}^{n_{1}} π_{i} = \sum_{i = 1}^{n_{2}} β_{i} = 1$ . Let $\hat{Γ} = {γ \in (R^{+})^{n_{1} \times n_{2}} | γ 1_{n_{2}} = \hat{μ}, γ^{T} 1_{n_{1}} = \hat{ν}}$ be the set of probabilistic couplings between the two empirical distributions where $1_{n}$ denotes an $n$ -dimensional vector of ones. Then, the discrete optimization problem can be expressed as follows:

γ^{*} = \arg min_{γ \in \hat{Γ}} 〈 γ, C 〉,

(1)

where $〈 ., . 〉 denotes$ the Frobenius dot product (i.e., $〈 A, B 〉 = Tr (A^{T} B))$ and C is the cost matrix quantified by $| | x - y | |^{p}$ .

3.2. The double-bootstrap procedure

Bootstrapping is a simulation procedure based on resampling of the data. There are several reference books that the interested reader can refer to for more information about such techniques.⁴⁴ Bootstrapping methods have also been used in the field of stochastic simulation for input modeling, parameter estimation, and output analysis^45–47 when comparing ordinal statistics of two samples such as percentiles for which no standard parametric test exists.

Algorithm (1) presents the proposed double-bootstrap method which can be summarized as follows. The two samples are first combined under the null hypothesis that they come from the same underlying multivariate distribution. Pooling the samples is also expected to produce tighter confidence intervals than if the data were not pooled. Then, two bootstrap procedures are performed. Each iteration of the first bootstrap procedure involves drawing (with replacement) two samples from the pooled data and calculating the Wasserstein distance between them. At the end of the first bootstrap runs, a $t$ -like statistic is formed using the variance of the bootstrappedWasserstein distances. In order to compute the threshold cutoff value for the test, a second bootstrap procedure is performed by replicating the above process to obtain multiple observations of the $t$ -like statistic which are then used to calculate the ( $1 - α$ )% percentile of the test statistic. The outcome of the test at an $α$ level of significance is then determined by comparing the test statistic formed based on the original samples with the cutoff value produced at the conclusion of the second bootstrap procedure. The code for a sample implementation in MATLAB including a minimal working example is available on the author’s website.

4. Computational experiments

In statistics, hypothesis tests are assessed based on two metrics: power and validity. The test power refers to the probability of correctly rejecting the null hypothesis when the two samples are from different distributions (i.e., 1 – type II error). The test validity (which is related to type I error) represents the probability of incorrectly rejecting the null hypothesis when the two samples are from the same distribution. When these probabilities cannot be derived analytically, as is the case in this paper due to the complexity of the Wasserstein distance and relaxing distributional assumptions, the test power and validity are assessed asymptotically via computational experiments. This section presents the results of three sets of experiments. In the first set of experiments, the proposed method’s power is assessed by repeating the test many times on samples taken from different distributions to compute the probability of rejection. The proposed method’s power is compared with several existing two-sample tests, namely the Hotelling’s $T^{2}$ test,¹⁰ Peacock test,²⁶ Fasano and Franceschini test,²⁷ two more recent tests based on energy statistics by Székely and Rizzo²⁸ and Aslan and Zech,²⁹ and a kernel-based test by Gretton et al.³⁰ In the second set of experiments, the proposed method’s validity is assessed by repeating the test many times under the null hypothesis (i.e., with samples taken from the same distribution) to show that the empirical type I error aligns with the nominal type I error ( $α$ or level of significance), supporting that the proposed method is asymptotically valid. Finally, in the third set of experiments, the computational time of the methods are compared.

4.1. Statistical power comparisons

The methods are assessed under scale (spatial variance) difference scenarios, where two bivariate Normal distributions are compared: one with independent standard normal margins and the other with independent margins with a mean of zero (i.e., same location parameters for the two samples) but a different scale (variance) for one of the margins. The reason behind considering two-dimensional Gaussian distribution scenarios is threefold: (1) this enables a baseline for comparison, by comparing with the Hotelling’s $T^{2}$ test, which is the asymptotically most powerful test but assumes multivariate Gaussian distributions with equal variances; (2) some of the existing tests, such as the Peacock test, are not applicable to higher dimensions; and (3) this makes the results relevant to when the two samples are spatial outputs of stochastic simulation models. Simulation output statistics typically represent averages of large numbers of more basic observations within a replication or the result of batching individual replications, hence the outputs follow an approximately Normal distribution by appealing to the central limit theorem.⁴⁸

Comparison results for a sequence of increasing differences in scale are presented in Figure 2. In line with the common approach in the literature, the proportion of times a test rejected the null hypothesis is used as an estimate of its power. The rejection percentages reported are based on 200 experiments (i.e., 200 tests) under each respective parameter setting at a 5% level of significance (similar results were obtained at other significance levels). In all experiments, a sample size of 5000 is considered. Such sample sizes are common in many applications such as biology (e.g., comparison of cell distributions which often involves thousands of cells), astronomy (i.e., comparison of mass or light density in large object clusters), social sciences (e.g., multivariate comparisons based on census data collected from a large population), as well as many comparisons that arise in the field of stochastic simulation where the simulationist has full control over the sample size through the number of simulation replications hence large sample sizes are often used to reduce sampling error. In addition, this ensures a sufficient sample size under discretization of the space which is utilized for calculation of the Wasserstein distance in the implementation used in these experiments via the discrete optimization problem in Equation (1) considering the Euclidean distance to compute the cost matrix C (i.e., with the exponent $p = 2$ ). More specifically, a grid is superposed over the two samples used in the test with its upper and lower limits determined based on the maximum and minimum values among both samples. TheWasserstein distance is shown to provide robust results even with a coarse grid.⁴⁹ The results presented here correspond to the case where each dimension of the grid is divided into 10 equal-length intervals while additional experiments with various grid sizes led to the same general findings as those reported here. It is worth noting that a gridding approach is not a requirement as any other method for calculating the Wasserstein distance can be incorporated into the proposed double-bootstrap procedure. Here, the discretized approach is used as it naturally lends itself better to spatial simulation output statistics which are often distributed over discretized and pre-defined geographical regions such as counties or states.

Figure 2.

The results of test power comparisons. The horizontal axis is the difference between an entry in the covariance matrix for the two bivariate normal distributions being compared.

As shown in Figure 2, the proposed method reports the highest probability of rejection under all scenarios and shows higher power in detecting small differences in scale as compared to existing methods. For instance, the proposed method’s probability of rejection is more than twice as that of four of the methods at the smallest scale difference of 0.05 considered in these experiments. In addition, the proposed method converges to a 100% rejection probability faster at a scale difference of 0.2, while all other methods except Aslan and Zech energy test report a lower rejection probability. It is worth clarifying that the poor performance of the Hotelling’s $T^{2}$ test under these scale difference scenarios was expected as this test assumes multivariate Gaussian distributions with equal spatial variances, which significantly limits its applicability for verification, validation, and calibration of simulation models as will be shown in two simulation applications in the following section.

4.2. Type I error and validity of the proposed method

The validity of the proposed double-bootstrap procedure under the above parameter configurations is empirically investigated. To assess the type I error, 300 independent cases are simulated under the null hypothesis. As in previous experiments, a sample size of 5000 is used. For each case, the test statistic was calculated and compared to the bootstrapped critical value under different $α$ levels varied from 0.025 to 0.975 in increments of 0.025. Figure 3 shows the empirical type I error as a function of $α$ (the nominal type I error) where the two samples being compared are drawn from the same bivariate Gaussian distribution with independent standard normal margins. It is observed that the diagonal line fits the points well. Similar results were obtained under all other parameter settings. Therefore, the empirical results support that the double-bootstrap procedure is asymptotically valid.

Figure 3.

Comparison of the empirical and nominal type I error as the level of significance ( $α$ ) changes.

4.3. Computational time comparisons and practical implications

Figure 4 shows the computational time as the sample size changes. The average execution times are based on 200 experiments under a scale difference of 0.05 from Figure 2. All experiments are implemented in MATLAB and are run on a typical computer with 1.60 GHz Core i5 CPU and 32GB of RAM. The proposed method has the highest execution time for sample sizes 500, 1000, and 5000. This is expected as bootstrapping and Wasserstein distance calculation are computationally costly even though the fast method proposed by Pele and Werman⁵⁰ is used in these experiments for the calculation of the Wasserstein distance. However, as the sample size increases, the execution time of the other methods grows exponentially while it remains fairly flat for the proposed method. This is an advantage of the proposed method and is primarily due to the gridding (or binning) approach used, which transforms each sample into a multidimensional histogram. This effectively makes the Wasserstein distance calculation independent of the sample size. Note that calculating the Wasserstein distance between two given multidimensional histograms takes the same amount of time regardless of the sample size used to generate the two histograms in the first place. The slight increase in the proposed method’s execution time for higher sample sizes is due to the additional computation needed to perform resampling and gridding in each iteration of the bootstrapping. As shown in Figure 4, the proposed method has a shorter execution time than the method by Aslan and Zech²⁹ for sample sizes greater than 10,000, and shorter execution time than the tests by Gretton et al.³⁰ and Székely and Rizzo²⁸ for sample sizes greater than 20,000.

Figure 4.

Computational time (log scale) as a function of the sample size.

4.3.1. Implications for practice

The statistical power and computational time comparisons in Figures 2 and 4 highlight the advantages and limitations of the proposed method and provide important insights for practical applications. Due to its high computational cost, the proposed method is not suitable for applications involving real-time decision-making (e.g., when comparing the simulation output generated by a digital twin with real-world data from the corresponding physical system). However, for applications involving offline decision-making where computational time becomes less relevant, the proposed method is recommended due to its high statistical power. In addition, for applications involving large sample sizes (e.g., large-scale agent-based simulation of swarms or human populations), the proposed method becomes attractive as its computational cost is less sensitive to increases in sample size. In special cases, additional information may be available on the two samples being compared. For example, if the two samples are known to be from two multivariate Gaussian distributions with equal variances, then the Hotelling’s $T^{2}$ test¹⁰ is recommended as it is one of the most powerful tests asymptotically for detecting location differences (i.e., difference between the means of two multidimensional samples). However, if such information is not available, and as long as computational time is not a binding constraint, then the proposed double-bootstrapping method is recommended as it does not require restrictive distributional assumptions.

5. Sample applications with spatial simulation output

This section illustrates the applicability of the proposed method under two simulation applications with spatial output, namely simulation of swarms and unmanned aerial vehicle (UAV) communication systems. Unlike the previous section that focused on performance comparisons where many instances of each problem setting are generated to compute probabilities of rejection, this section aims to mimic how the proposed method will actually be used by simulationists in reality, and thus each problem setting is solved only once as is done in real-world applications. For example, when assessing the validity of a particular model, the simulationist would perform a statistical test to compare the simulation output with real-world data.

In addition, the only other method included in the results presented in this section is the Hotelling’s test. The focus on the Hotelling’s test is driven by two reasons: (1) it is the benchmark method in the field of simulation for comparing multidimensional simulation output, and (2) it has the fastest computational time based on the results from the previous section. Therefore, it is important to show how the Hotelling’s test performs in general applications where the simulation output may not follow any theoretical distribution. If the Hotelling’s test performs well in such applications despite potential deviations from its underlying assumptions, then the use of the proposed method is not justified due to its high computational cost. However, the sample applications in this section show that the Hotelling’s test can lead to incorrect decisions. Since the two examples represent offline decision-making problems where computational time is not a binding constraint for the decision-maker, the proposed method becomes the preferred choice because of its high statistical power. Note that the other five methods are excluded from the results in this section because they reported a lower statistical power than the proposed double-bootstrap method and a higher computational time than the Hotelling’s test, hence they would clearly represent suboptimal choices when the computational time is not a deciding factor.

Finally, it is worth noting that there is an extensive stream of research in the statistics literature on how quickly an empirical distribution ${\hat{μ}}_{n}$ formed based on $n$ independent samples from a distribution $μ$ approaches $μ$ in the Wasserstein distance of any order, i.e., $W_{p} (μ, {\hat{μ}}_{n}) \to 0$ , which is indeed the case in the large- $n$ limit. The interested reader is referred to two of such studies^51,52 that propose tight upper bounds in general settings. However, a key distinction in the field of simulation is that the simulationist often has control over the choice of the sample size and can increase the size of the simulated data for higher accuracy. Given the focus of this paper on simulation applications, additional results under various sample sizes are not presented as it would lengthen the paper with no clear added practical value for simulationists.

5.1. Comparing random walk models in swarm simulations

The experiments in this section represent a situation where a sample of observations from a multidimensional random walk process is available and the goal is to identify which parameter configuration for the underlying random walk model the observed data might have come from. This is done by experimenting with different parameter configurations for the random walk model, where a sample is generated under each candidate parameter setting and a multidimensional two-sample test is performed to detect whether there is a statistical difference between the observed and simulated samples. A statistical difference would then indicate that the parameter setting under consideration does not provide an appropriate model for the observed data. Such comparisons are commonly needed when validating simulation models based on real-world data, for example, when determining an appropriate model for swarming behavior of animals or migration of cancer cells into surrounding tissue.

This example considers a random walk where particles start at the origin with coordinates $x_{0}$ and $y_{0}$ . Each step from one position to the next has a step size of $SS ~ Uniform (a, b)$ and a step angle $θ ~ Uniform (0, m π)$ , where $0 \leq m \leq 2$ . The target data set includes the final positions of 10,000 particles after 10,000 steps of the random walk generated by setting $x_{0} = y_{0} = 0$ , $SS ~ Uniform (0.7, 1.3)$ , and $m = 0.5$ . In the experiments analyzed here, the random walk is simulated under different step size values and the resulting final positions are compared with the target data. Figure 5 schematically shows how the output distributions of the particles’ final positions differ under the models considered in the comparisons. Model 0 corresponds to the model that was used to generate the target data with $SS ~ Uniform (0.7, 1.3)$ . Models 1–4 use a wider range for the step size, while models 5–8 use a narrower range, leading to both shape and variance differences without affecting the samples’ mean location. The specific $SS$ parameter choices for the alternative models are provided in Table 2. In all models, $x_{0} = y_{0} = 0$ and $m = 0.5$ with the same number of particles and steps as in model 0 that was used to generate the target data.

Figure 5.

A schematic illustration of the spatial output for the random walk models. The highlighted region represents the distribution of the target data to be tested against.

Table 2.

Statistical comparisons for the random walk example.

Model no.	Step size ( $SS$ )	Test outcome when compared to the target data generated by model 0
		Proposed method	Hotelling’s $T^{2}$ test
Model 1	$U (0.50, 1.50)$	Reject	Failed to Reject
Model 2	$U (0.55, 1.45)$	Reject	Failed to Reject
Model 3	$U (0.60, 1.40)$	Reject	Failed to Reject
Model 4	$U (0.65, 1.35)$	Reject	Failed to Reject
Model 5	$U (0.75, 1.25)$	Reject	Failed to Reject
Model 6	$U (0.80, 1.20)$	Reject	Failed to Reject
Model 7	$U (0.85, 1.15)$	Reject	Failed to Reject
Model 8	$U (0.90, 1.10)$	Reject	Failed to Reject

The proposed method correctly rejects the null hypothesis that the sample generated by models 1–8 and the target data generated by model 0 come from the same spatial distribution.

As shown in Table 2, the proposed method correctly rejects the null hypothesis that the samples generated by models 1–8 and the target data generated by model 0 come from the same distribution, while the Hotelling’s $T^{2}$ test failed to reject in all cases. Once again, the poor performance of the Hotelling’s test was expected as this test assumes multivariate Gaussian distributions with equal variances and is most powerful in detecting location differences between multidimensional samples. However, as shown in Figure 5, the random walk models considered here have the same location (spatial mean) but differ in terms of scale (spatial variance), which cause the Hotelling’s test to fail.

5.2. Comparing spatial output in a simulation of UAV communication systems

This example is adapted from a realistic engineering application⁵³ involving the simulation of UAV-to-UAV communications in an urban landscape. As schematically shown in Figure 6, the problem setting involves a set of UAVs that operate within a height range specified by $H = [h^{(-)}, h^{(+)}]$ , where $h^{(-)}$ and $h^{(+)}$ are the minimum and maximum flying heights, respectively, and $Δ_{h} = h^{(+)} - h^{(-)}$ denotes the height range size for the flying UAVs. The position of UAV $k$ is denoted by $w_{k} = (x_{k}, y_{k}, h_{k})$ , where $x_{k}$ and $y_{k}$ are the coordinates along the $x$ and $y$ axis, respectively, and $h_{k} \in H$ .

Figure 6.

A schematic illustration of the UAV-to-UAV communication system. Each UAV operates in its designated sub-region. The red arrow indicates signal obstruction by buildings while the green arrow indicates Line of Sight (LoS) between two UAVs.

The urban area where the UAVs operate is a 2.4 km $\times 1.75$ km region in Osaka, Japan as shown Figure 7. The building data are obtained from OpenStreetMap (OSM), which provides crowd-sourced map data available at https://www.openstreetmap.org/ . The urban area is divided into 25 sub-regions as a $5 \times 5$ grid and there is one UAV operating in each sub-region (i.e., total of 25 UAVs). The configuration analyzed here involves a receiver UAV (UAV-Rx) that operates in the sub-region at the center of the urban area. The UAV-Rx receives signal from the other UAVs located in other sub-regions, which act as simultaneously signal transmitting UAVs (UAV-Txs). The simulation output of interest is the spatial distribution of where the strongest signal received by UAV-Rx is transmitted from. In other words, the distribution assigns a probability to each sub-region representing the likelihood that the strongest signal received by UAV-Rx comes from the UAV-Tx that operates in that sub-region. Due to signal attenuation (i.e., weakening of the power density of the signal as it propagates through space) and obstruction by buildings, the probability distribution depends on the distance between the sub-regions and their building profiles as well as the height range $H = [h^{(-)}, h^{(+)}]$ for the flying UAVs. Figure 8 shows an example of signal obstruction by buildings.

Figure 7.

The urban area in Osaka, Japan. The 3D buildings are generated using OpenStreetMap data.

Figure 8.

Signal obstruction by buildings. The same scenario is shown from two different angles, where a green arrow indicates Line of Sight (LOS) and a red arrow indicates Non-Line of Sight (NLOS) due to obstruction by buildings.

The simulation model is implemented in MATLAB and can be summarized as follows. In each simulation replication, the UAVs are generated (one UAV per sub-region) by randomly sampling their latitude and longitude within the respective sub-region. UAV heights are randomly generated within the given height range $H = [h^{(-)}, h^{(+)}]$ . The following logic is used to ensure that UAV positions do not fall inside a building. If there is no building located at a sampled coordinate $(x_{k}, y_{k})$ for UAV $k$ , or if the coordinates overlap a building that is shorter than $h^{(-)}$ , then the UAV height $h_{k}$ is sampled from a uniform distribution between $h^{(-)}$ and $h^{(+)}$ . However, if the sampled coordinate overlaps a building with height $h_{B}$ such that $h^{(-)} \leq h_{B} < h^{(+)}$ , then $h_{k}$ is sampled from a uniform distribution between $h_{B}$ and $h^{(+)}$ . Otherwise, if the sampled coordinate overlaps a building that is taller than $h^{(+)}$ , then it represents an infeasible coordinate. In such cases, a new coordinate will be resampled and the above process is repeated until a feasible position is obtained. MATLAB’s SiteViewer is used to visualize the OSM data and generate the 3D buildings. Once all 25 UAVs are generated, MATLAB’s Communications Toolbox is used to determine three-dimensional signal obstruction by buildings and compute the signal strength (in power unit of measurement, dBm) received by UAV-Rx from the UAV-Txs, which will be then used to determine the sub-region where the strongest signal originates from.

For the statistical comparisons performed in this example, the target data set is generated by setting the height range to $H = [25, 35]$ meters (model 0). The four alternative models to be compared against the target distribution vary based on the height range parameters, namely $H = [15, 25]$ in model 1, $H = [35, 45]$ in model 2, $H = [45, 55]$ in model 3, and $H = [55, 65]$ in model 4. Figure 9 visualizes the target data and the spatial output under each alternative model based on 200 simulation replications. As shown in Table 3, the proposed method correctly rejects the null hypothesis that the samples generated by models 1–4 and the target data generated by model 0 come from the same distribution, while the Hotelling’s test failed to reject in two out of the four cases. Once again, the results support the applicability of the proposed method and the failure of the Hotelling’s test in detecting sample differences in a realistic example.

Figure 9.

Target data and spatial simulation output distributions. The height range parameter for the models are: (a) $H = [25, 35]$ in model 0 which is used to generate the target data; (b) $H = [15, 25]$ in model 1; (c) $H = [35, 45]$ in model 2; (d) $H = [45, 55]$ in model 3; (e) $H = [55, 65]$ in model 4.

Table 3.

Statistical comparisons for the UAV-to-UAV communication example.

Model no.	Height range ( $H$ )	Test outcome when compared to the target data generated by model 0
		Proposed method	Hotelling’s $T^{2}$ test
Model 1	$[15, 25]$	Reject	Failed to Reject
Model 2	$[35, 45]$	Reject	Failed to Reject
Model 3	$[45, 55]$	Reject	Reject
Model 4	$[55, 65]$	Reject	Reject

The proposed method correctly rejects the null hypothesis that the sample generated by models 1–4 and the target data generated by model 0 come from the same distribution.

6. Conclusion

In this paper, a novel method for comparing two multidimensional samples was proposed to enable validation, verification, and calibration of simulation models with spatial output in situations where the benchmark Hotelling’s two-sample $T^{2}$ test fails due to violation of its assumptions (i.e., normality and equality of variances). The proposed method employes the Wasserstein distance as a probability metric to quantify the distance between two probability distributions and involves two levels of bootstrapping to estimate the variance of the distance measure (first bootstrap) and the cut-off point for the test statistic (second bootstrap). The results of computational experiments for the case of Gaussian distributions showed that the proposed method significantly outperformed the Hotelling’s test as well as various other existing methods under scale differences. The efficacy of the proposed method and its advantage over the Hotelling’s test was also illustrated in two simulation applications involving spatial output, namely swarm movement and UAV communication systems. As for limitations, the results of computational time comparisons showed that the proposed method has a higher execution time than the other methods, especially for smaller sample sizes, as bootstrapping and Wasserstein distance calculation are computationally intensive. This limits the applicability of the proposed method to offline decision-making problems where computational time is not a binding constraint (as opposed to real-time decision-making that requires fast processing time).

The work presented here reveals the potential of bootstrap methods based on the Wasserstein distance. Bootstrapping methods, such as the one proposed here, can offer key advantages that extend their applicability beyond existing methods. For example, they do not require assumptions about equal sample sizes or variances, the underlying distributions being compared, or the form of the cost function used for calculating the Wasserstein distance. The latter also enables generalizations to other definitions and probability measures. In addition, while the discussions in this paper focused on the two-sample problem, the proposed method can also be used to perform goodness-of-fit tests, given a sample $X$ , by drawing and comparing to a sample from the specified distribution under consideration.

There are several possibilities for extensions and future research. For example, while it is possible to extend the proposed method to more than two populations by the Bonferroni multiple comparisons method, development of Wasserstein distance-based bootstrapping tests for comparing more than two samples without the need for Bonferroni correction would be an interesting area for future research. This is because performing multiple tests would reduce the overall test power due to the Bonferroni adjustments, especially when the number of candidate samples to be compared is large.⁵⁴ From an experimental point of view, other forms of Wasserstein distance and metrics can be defined on the space of probability measures and incorporated into the general double-bootstrap procedure proposed in this paper. For instance, simulation experiments can be performed to evaluate the effect of varying the exponent $p$ in the cost matrix quantified by $| | x - y | |^{p}$ or compare the test power under alternative forms of the cost function. The best-performing cost function is expected to vary depending on the specific application at hand. Therefore, future applied papers that use the proposed method could also investigate ways to optimize the cost function for their specific application or problem setting.

It is hoped that this paper and its future extensions along the above lines will enhance validation, verification, and calibration of simulation models with spatial output, and draw more attention to Wasserstein distance-based bootstrapping methods.

Footnotes

Acknowledgements

The author is indebted to the Bio-Inspired Networking Lab at the Graduate School of Information Science and Technology at Osaka University (Japan) for hosting him during his sabbatical leave when this research took place. The author is thankful to Dr Takeshi Hirai, whose research on UAV communication systems inspired the example presented in this paper.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

ORCID iD

Ashkan Negahban

Author biography

Ashkan Negahban is an Associate Professor of Engineering Management at The Pennsylvania State University, Great Valley School of Graduate Professional Studies (USA). He received his PhD and master’s degrees from Auburn University (USA) and his BS from University of Tehran (all in Industrial and Systems Engineering). His research involves stochastic simulation methods, primarily agent-based and discrete-event simulation. He also conducts research related to novel simulation-based learning environments in STEM education. His email and web addresses are aun85@psu.edu and .

References

Negahban

Smith

. Simulation for manufacturing system design and operation: literature review and analysis. J Manuf Syst 2014; 33: 241–261.

Mielczarek

Uzialko-Mydlikowska

. Application of computer simulation modeling in the health care sector: a survey. SIMULATION 2012; 88: 197–216.

Oliveira

Jin

Lima

, et al. The role of simulation and optimization methods in supply chain risk management: performance and review standpoints. Simul Model Pract Theory 2019; 92: 17–44.

Negahban

Yilmaz

. Agent-based simulation applications in marketing research: an integrated review. J Simul 2014; 8: 129–142.

Naseer

Eldabi

Jahangirian

. Cross-sector analysis of simulation methods: a survey of defense and healthcare. Trans Govern People Process Policy 2009; 3: 181–189.

Negahban

. Simulation in engineering education: the transition from physical experimentation to digital immersive simulated environments. SIMULATION 2024; 100: 695–708.

Mehan

Guo

Gitau

, et al. Comparative study of different stochastic weather generators for long-term climate data simulation. Climate 2017; 5: 26.

Sargent

. Verification and validation of simulation models. J Simul 2013; 7: 12–24.

Speir

Negahban

. Analyzing COVID-19 control strategies in metropolitan areas: a customizable agent-based simulation tool. In: Proceedings of the 2020 Winter Simulation Conference, Orlando, FL, 14–18 December.

10.

Hotelling

. The generalization of student’s ratio. In Kotz

Johnson

(eds.) Breakthroughs in Statistics. New York: Springer, 1991, p. 45–53.

11.

Balci

. Validation, verification, and testing techniques throughout the life cycle of a simulation study. Ann Oper Res 1994; 53: 121–173.

12.

Sargent

Goldsman

Yaacoub

. A tutorial on the operational validation of simulation models. In: Proceedings of the 2016 Winter Simulation Conference, Washington, DC, 11–14 December, pp. 163–177. New York: IEEE.

13.

Chen

Friedman

. A new graph-based two-sample test for multivariate and object data. J Am Stat Assoc 2017; 112: 397–409.

14.

Ramdas

Trillos

Cuturi

. On Wasserstein two-sample testing and related families of nonparametric tests. Entropy 2017; 19: 47.

15.

Gibbs

. On choosing and bounding probability metrics. Int Stat Rev 2002; 70: 419–435.

16.

Balci

Sargent

. Validation of multivariate response models using Hotelling’s two-sample T2 test. SIMULATION 1982; 39: 185–192.

17.

Emery

. Statistical tests for validating geostatistical simulation algorithms. Comput Geosci 2008; 34: 1610–1620.

18.

Mebarki

Castagna

. An approach based on Hotelling’s test for multicriteria stochastic simulation-optimization. Simul Pract Theory 2000; 8: 341–355.

19.

Chao

Tseng

Wong

, et al. Systematic applications of multivariate analysis to monitoring of equipment health in semiconductor manufacturing. In: Proceedings of the 2008 Winter Simulation Conference, Miami, FL, 7–10 December 2008, pp. 2330–2334. New York: IEEE.

20.

Koen

Siluyele

. Multivariate comparisons of the period–light-curve shape distributions of Cepheids in five galaxies. Monthly Notic Royal Astron Soc 2007; 377: 1281–1286.

21.

Hakstian

Roed

Lind

. Two-sample T–2 procedure and the assumption of homogeneous covariance matrices. Psychol Bullet 1979; 86: 1255–1263.

22.

McDonald

Dunn

. Statistical tests for measures of colocalization in biological microscopy. J Microscopy 2013; 252: 295–302.

23.

Chen

Qin

. A two-sample test for high-dimensional data with applications to gene-set testing. Ann Sta 2010; 38: 808–835.

24.

Boyett

Shuster

. Nonparametric one-sided tests in multivariate analysis with medical applications. J Am Stat Assoc 1977; 72: 665–668.

25.

Morse

. Minsize2: a computer program for determining effect size and minimum sample size for statistical significance for univariate, multivariate, and nonparametric tests. Educ Psychol Meas 1999; 59: 518–531.

26.

Peacock

. Two-dimensional goodness-of-fit testing in astronomy. Monthly Notic Royal Astron Soc 1983; 202: 615–627.

27.

Fasano

Franceschini

. A multidimensional version of the Kolmogorov–Smirnov test. Monthly Notic Royal Astron Soc 1987; 225: 155–170.

28.

Székely

Rizzo

. Energy statistics: a class of statistics based on distances. J Stat Plan Infer 2013; 143: 1249–1272.

29.

Aslan

Zech

. Statistical energy as a tool for binning-free, multivariate goodness-of-fit tests, two-sample comparison and unfolding. Nucl Instrum Method Phys Res Sect A 2005; 537: 626–636.

30.

Gretton

Borgwardt

Rasch

, et al. A kernel method for the two-sample-problem. Adv Neural Inform Proces Syst 2006; 19: 513–520.

31.

Weiss

. Two-sample tests for multivariate distributions. Ann Math Stat 1960; 31: 159–164.

32.

Friedman

Rafsky

. Multivariate generalizations of theWald-Wolfowitz and Smirnov two-sample tests. Ann Stat 1979; 7: 697–717.

33.

Randles

Peters

. Multivariate rank tests for the two-sample location problem. Commun Stat Theory Method 1990; 19: 4225–4238.

34.

Hettmansperger

Oja

. Affine invariant multivariate multisample sign tests. J Royal Stat Soc Series B 2018; 56: 235–249.

35.

Choi

Marden

. An approach to multivariate rank tests in multivariate analysis of variance. J Am Stat Assoc 1997; 92: 1581–1590.

36.

Henze

. A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann Stat 1988; 16: 772–783.

37.

Hall

Tajvidi

. Permutation tests for equality of distributions in high-dimensional settings. Biometrika 2002; 89: 359–374.

38.

Rosenbaum

. An exact distribution-free test comparing two multivariate distributions based on adjacency. J Royal Stat Soc Series B: Stat Methodol 2005; 67: 515–530.

39.

Biswas

Mukhopadhyay

Ghosh

. A distribution-free two-sample run test applicable to high-dimensional data. Biometrika 2014; 101: 913–926.

40.

Sejdinovic

Sriperumbudur

Gretton

, et al. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann Stat 2013; 41: 2263–2291.

41.

Monge

. Mémoire sur la théorie des déblais et des remblais. Histoire De L’académie Royale Des Sciences De Paris 1781: 666–704.

42.

Kantorovich

. On the translocation of masses. J Math Sci 2006; 133: 1381–1382.

43.

Panaretos

Zemel

. Statistical aspects of Wasserstein distances. Ann Rev Stat Appl 2019; 6: 405–431.

44.

Conover

. Practical Nonparametric Statistics. 3rd ed. New York: John Wiley, 1980.

45.

Negahban

. Simulation-based estimation of the real demand in bike-sharing systems in the presence of censoring. Europ J Oper Res 2019; 277: 317–332.

46.

Negahban

. Estimating the true arrival, balking, and reneging processes from censored transactional data: a simulation-based approach. SIMULATION 2022; 98: 597–614.

47.

Negahban

Smith

. The effect of supply and demand uncertainties on the optimal production and sales plans for new products. Int J Prod Res 2016; 54: 3852–3869.

48.

Boesel

Nelson

Kim

. Using ranking and selection to “clean up” after simulation optimization. Oper Res 2003; 51: 814–825.

49.

Vissio

Lucarini

. Evaluating a stochastic parametrization for a fast–slow system using the Wasserstein distance. Nonlin Proces Geophys 2018; 25: 413–427.

50.

Pele

andWerman

. Fast | robust earth mover’s distances. In: IEEE 12th international conference on computer vision, Kyoto, Japan, 29 September–2 October 2009, pp. 460–467. New York: IEEE.

51.

Weed

Bach

. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 2019; 25: 2620–2648.

52.

Lei

. Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces. Bernoulli 2020; 26: 767–798.

53.

Negahban

Hirai

. Simulation-based performance analysis of UAV-to-UAV communications in a realistic urban landscape considering 3D blockage effects. In: Proceedings of the 2024 IEEE 100th Vehicular Techonology Conference (VTC2024-Fall), Washington, DC, 7–10 October 2024.

54.

Negahban

. A framework for comparing stochastic simulation models against multidimensional data using the Wasserstein distance. J Simul. 2025. DOI: 10.1080/17477778.2025.2486664.

A Wasserstein distance-based double-bootstrap method for comparing spatial simulation output

Abstract

Keywords

1. Introduction

2. Literature review

3. The proposed method

3.1. Wasserstein distance

3.2. The double-bootstrap procedure

4. Computational experiments

4.1. Statistical power comparisons

4.2. Type I error and validity of the proposed method

4.3. Computational time comparisons and practical implications

4.3.1. Implications for practice

5. Sample applications with spatial simulation output

5.1. Comparing random walk models in swarm simulations

5.2. Comparing spatial output in a simulation of UAV communication systems

6. Conclusion

Footnotes

Acknowledgements

Funding

ORCID iD

Author biography

References