Sage Journals: Discover world-class research

Abstract

Social network analysis (SNA) is a tool for the operations researcher to understand, monitor, and exploit social and military structures which are key in the intelligence community. However, in order to study and influence a network of interest, the network must first be characterized; preferably to a known network model that captures a mixture of graphical properties exhibited by the social network of interest. In this work, we present a novel statistical method for both characterizing networks via a Binomial-Pareto maximum-likelihood approach and simulating the characterized network using a graph of mixed Barabási–Albert (BA, scale-free) and Erdös–Rényi (ER, randomness) properties. Characterization is performed through a combination of hypothesis tests and method of moments parameter estimation on Pareto and Doubly Truncated Binomial distributions. Application on real-world networks suggests that such networks may be characterized with a mixture of scale-free and random properties as modeled through BA and ER graphs. We demonstrate that our simulation methods are able to capture the degree distribution and density of the networks examined. These results demonstrate that this work establishes a statistical framework upon which network characterization and simulation may be accomplished, thus enabling the adaptation of such methods when generating, manipulating, and observing networks of interest.

Keywords

Bootstrapping network simulation network characterization power law scale free networks Barabási–Albert Erdös–Rényi degree distribution

1. Introduction

Social network analysis (SNA) is a tool for the operations researcher to understand, monitor, and exploit social, political, and military structures which have been key intelligence considerations through the ages. Indeed, throughout the years, analysts have charted the alliances, coalitions, and military order of battles for centuries in order to derive factors providing operational advantages. SNA tools, with its mathematical and statistical foundation, enable analysts to monitor, understand, and exploit network structures. In addition, the nature of warfare has evolved. While the classic image of the end of a war includes the unconditional surrender of a foe, this has not been the case in modern warfare. Warfare has shifted from “War between the people” (nations fighting to a distinct victory) to “War among the people” (non-nation states waging indefinite warfare).¹ The recent Joint Concept for Human Aspects of Military Operations (JC-HAMO)² recognizes that such wars “among the people” coupled with the need to “win the peace” require the consideration of human effects in executing military operations. JC-HAMO goes on to point out that in planning conflict and building the peace, it is necessary to consider the social, cultural, physical, informational, and psychological elements that create a desired effect on behavior. Building on the use of SNA in counter terrorism studies, it is the application of an analysis approach that can aid future joint planning. An understanding of SNA approaches will continue to be of value in the intelligence preparation of the battlespace, but with the adoption of the joint concept on human effects in planning operations, it will also be a planning analysis tool. Today, the intelligence and operational communities widely use SNA in order to analyze relationships between individuals within groups of interest.³ In some cases, SNA is used in this context to monitor and possibly influence a network of interest. From a security and defense standpoint, individuals within the network may represent friendlies or potential threats, or as applied to non-humans, could represent relations among assets or critical resources such as a network of computers, the power grid, or water sources. Regardless of the network itself, it is not possible to understand, monitor, and potentially act upon a network of interest without first characterizing the underlying structure of the network. Thus, such relational data are present in many critical resources, computer infrastructures, and organizational social groups of interest, and most commonly, such relational data are represented structurally as a network. Using the network structure, the data can be graphed via nodes and connected by edges in order to visualize and analyze the data. Examples include agents (nodes) and the operatives with which they communicate (represented via edges) in an organized crime network, individuals executing code on a computer network which is tracked through the computer’s Internet protocol (IP) address (node) and its connections (via edges) to other computers, or drones in a cooperative network. Invariant to the network of interest, continual observation and analysis of the data are enabled if the associated graph of the data (represented via nodes and edges) can be characterized to a specific or known network model for reproduction, simulation, and augmentation during analysis. A known network model allows the data to be studied and acted upon to observe behavior in its graphical form.

These network models are built upon known graphical properties of the network, mostly related to how nodes are connected to each other, such as scale-free, random, or small world properties. The scale-free property has been observed to represent many social, physical, biological, and man-made phenomena^4–7 and was first discovered in networks as described by De Solla Price.⁸ One well-known graphical model that exhibits the scale-free property was proposed by Barabási and Albert.⁴ Their model is governed by the concept of preferential attachment in which new entities are more likely to form connections with those that are already well connected in the network. Random properties can be found in graphical models such as that proposed by Erdös and Rényi,⁹ and the small world property, in which nodes are closely connected, can be found in graphical models such as that proposed by Watts and Strogatz.¹⁰

Although there have been other more complicated models for graphical properties established in the literature, such as random graphs,¹¹ the Barabási–Albert (BA) model has been widely studied due to its simplicity and its scale-free property. More recently, although, and despite the claim that the scale-free property is inherent in many real-world networks, Broido and Clauset¹² have shown that scale-free networks are indeed rare, and most networks are at most partially scale-free. This conjecture is what has motivated our research; that many real-world networks may partially possess properties such as the scale-free property. Identification of the scale-free property for even a portion of the network is important, particularly if the remainder of the network can be characterized through noise or randomness (or other identifiable and tractable properties).

Characterization of relational data into a network model must, therefore, be capable to identify and capture several network properties. Most characterization into a known network model is accomplished mathematically, mainly through applications of graph matching and classification methods.¹³ Many such methods have been developed.¹⁴ For instance, the methods found in Broido and Clauset¹² may be used to determine if the entire network is scale-free; however, as for most mathematical methods, it is an involved computational process which becomes compounded if the method is applied to the various sub-networks of a network, that is, under the idea that a network is only partially scale-free, such methods would need to iteratively test for structures associated with the scale-free property on each sub- or partial network. We propose that a shorter, more rapid test based upon statistical theory may provide a means to test both the entire network and, through simple construction, identify portions of the network that could in fact follow known properties such as the scale-free property.

Statistical methods use known graph properties to characterize a network. These methods are useful because once a network is represented as a graph, the distribution of graphical properties such as the graph nodal degrees may take on a specific structure. For instance, the networks that are scale-free possess a unique property in which the distribution of the network degree associated with the nodes of the network follows the power law distribution. Furthermore, the degree distribution for well-known network models such as the BA and ER is known. In particular, it has been shown theoretically that the degree distribution of the BA⁴ model follows a Pareto distribution with a parameter of $β = 2$ . Although there have been attempts of statistically testing for preferential attachment within a network,¹⁵ there appears to be no statistical test of hypothesis published for the degree distribution of the BA model under the linear preferential attachment model.¹¹ Similarly, the degree distribution of the ER model follows a Binomial distribution.^9,16 While goodness-of-fit methods for fitting a Pareto distribution as well as its more general form, the power law, exist in a general sense,^6,17,18 we propose a simple hypothesis testing approach that directly investigates graphical properties following theoretical-derived statistical distributions. Specifically, our method allows for the assumption of a network’s degree distribution following the expected degree distribution of the BA and ER graph. Consequently, it is possible to characterize the scale-free property of the network in addition to the randomness property of a network.

Therefore, this paper establishes a statistical framework upon which network characterization and simulation may be accomplished by presenting a novel statistical method for both characterizing networks via a Binomial-Pareto maximum-likelihood approach and simulating the characterized network using a graph of mixed BA and ER properties (scale-free and randomness). The network characterization method is used specifically to (1) test a set of observed network data for its scale-free property through tests of hypotheses based on the Pareto distribution and (2) to model any inherent noise (randomness) by fitting a Doubly Truncated Binomial distribution using method of moments estimation. This framework is useful for researchers who want to characterize and monitor networks, as it provides a means by which to computational model networks and to detect changes and evolutions within the network more efficiently through identification of network properties via statistical tests. As such, statistical methods can be applied to characterize a network, or a portion of a network, as a mixture of network properties. Even characterizing a portion of the network may be of use as an entire network may not be of interest as much as portions of the network. Furthermore, the network can be partitioned into those portions exhibiting known properties and that portion for which properties need to be developed and newly characterized.

In order to demonstrate our proposed statistical framework for network characterization and simulation using multiple network properties as discussed, we derived the associated tests of hypotheses and demonstrated their usefulness for network characterization and then illustrated our proposed method for simulating a network with a mixture of properties. Therefore, this paper is organized as follows: first, the BA model and the representation of its degree distribution via the Pareto distribution are described. Then, we derive the test of hypothesis for each of the parameters of the Pareto distribution followed by the Union-Intersection Test (UIT) for simultaneously testing these parameters. A simulation of the BA network is conducted in order to calculate the power of the individual tests as well as the UIT for determining if a network or a portion of a network follows the scale-free property. Then, we demonstrate our statistical framework for network characterization and simulation by first describing a method for characterizing a network as a mixture of network properties, specifically demonstrated for a mixture of scale-free and randomness properties as exhibited via the degree distributions associated with a BA network (scale-free) and an ER network (randomness). Characterization is performed through a combination of hypothesis test on Pareto and method of moments parameter estimation on Doubly Truncated Binomial distribution. We then follow up with a proposed method for simulating a characterized network as a function of both the BA and ER networks. Finally, we apply our framework on real-world networks in an attempt to test the extent to which the networks exhibit both the scale-free and randomness properties and to study how well our methods for characterization and simulation describe the empirical networks as a mixture of properties. We conclude with a discussion on our statistical framework and the implication that they present to network modeling.

2. Background

In this section, we will review previous works in network analysis and modeling. Specifically, we will introduce the works of Barabási and Albert⁴ and Erdös and Rényi⁹ and discuss the two basic types of network model used in our approach. In addition, we will review a method of estimating a parameter of the Doubly Truncated Binomial distribution used in our model. We then introduce a select number of network measures that will be used to study the goodness-of-fit of our model.

2.1. Network models

2.1.1. BA model

The BA model is based on two mechanisms that govern the scale-free property of real-world networks: (1) networks expand continuously by the addition of new nodes and (2) new nodes attach preferentially to other nodes that are already well connected. The BA model operates by first starting with an initial number of nodes, $m_{0}$ , each having no edges. This is followed by an iterative process of adding a single node with $m$ edges, where the edges are connected to an existing node $i$ with degree $d_{i}$ based on the linear preferential attachment probability, $π (d_{i})$ , where:

π (d_{i}) = \frac{d_{i}}{\sum_{\forall j} d_{j}}

(1)

is the probability that node $i$ will be attached to the new node.

The nodal degree of the BA scale-free graph can be derived using the mean field theory as described by Barabási and Albert.⁴ Let $m_{0}$ be the number of nodes initially included in the graph and $m$ be the number of edges added at each iteration over $t$ iterations, where $m \leq m_{0}$ . Then, the size of the graph is $N = m_{0} + t$ , the total number of edges $E = mt$ , and consequently, the total degree of the graph $\sum_{i = 1}^{N} d_{i} = 2 mt$ .

The probability density function (PDF) for a given node having degree $x$ is as follows:

f (x) = 2 m^{2} x^{- 3} \frac{t}{N}

where $t = N - m_{0}$ , $x \in [0, \infty)$ .

Thus, for finite $N$ , the degree distribution is written as $f_{d_{i}} (x) = 2 (m \sqrt{(N - m_{0}) / N})^{2} x^{- 3}$ which implies that the degree distribution of the BA graph follows a $Pareto (m \sqrt{(N - m_{0}) / N}, 2)$ distribution. Notice that as the size of the graph increases, $N \to \infty$ , the distribution of the degree converges to a $Pareto (m, 2)$ distribution. The Pareto distribution, so named in honor of the early works done by Pareto and Busino,¹⁹ is a form of the power law. Estimating the parameters of the Pareto distribution can be accomplished using the Maximum-Likelihood Estimation (MLE), and from there, a test of hypothesis on the parameters can be performed.

2.1.2. ER model

The first commonly used random graph generating model was proposed by Erdös and Rényi,⁹ where a graph is generated by an algorithm that connects any pair of nodes via an edge with probability $p$ , and in which each edge is independent from every other edge. This results in a graph of $N$ nodes and $m$ edges having an equal probability of $p (1 - p)^{(\begin{matrix} N / 2 \end{matrix}) - m}$ for all possible undirected simple graphs of $N$ nodes and $m$ edges. Consequently, the distribution of the degrees for all possible realization of the ER networks follows the Binomial distribution:

P (X = x) = (\begin{matrix} N - 1 \\ x \end{matrix}) p^{x} {(1 - p)}^{N - 1 - x}, x = 0, 1, \dots, N - 1

and the Poisson distribution:

P (X = x) = \frac{{(Np)}^{x}}{x!} e^{- Np}, x = 0, 1, \dots

as $N$ approaches infinity. One downside to the ER network is that it is not scale-free.⁴ However, given its history, the ER model can effectively describe a truly random network which can be characterized as noise. In addition, the ER model is widely studied in the literature as a baseline when making comparisons for network metrics and classification. For this research, we explore smaller sized networks, and therefore examine the Binomial distribution form of the degree distribution for the ER model.

2.1.3. Mixed models

When modeling complex real-world networks, it may be useful to consider a mixture model. A mixture model allows each node in a network to be distributed according to a mixture of components belonging to distributions of interest. For instance, a network may be a mixture of scale-free and random components. One example of this generative model is the Stochastic Block Model (SBM) as defined by Nowicki and Snijders.²⁰ For our purpose, we are interested in a mixture of the BA and ER models since they contain known network characteristics. Therefore, following the ideas of Schmidt and Morup²¹ who discuss the likelihood of a network taking a certain form given the components to which the nodes belong, we will be estimating the network model to which each node belongs using an SBM for degree distributions of the BA and ER network models.

2.2. Estimating the parameters of the Doubly Truncated Binomial distribution

It is well known that the degree distribution of an ER graph is a Binomial distribution.^9,16 However, our approach only examines connected graphs; therefore, a degree of zero is impossible. In addition, our mixture approach will also force the ER representation of the network on the lower portion of its degree distribution, essentially placing an upper bound on the degree distribution. Shah²² derived a method of moments estimator, $\hat{p}$ , for probability parameter, $p$ , of a Doubly Truncated Binomial distribution utilizing the first three sample moments. Shah has shown that given the first three sample raw moments denoted $a_{1}, a_{2},$ and $a_{3}$ , the estimate for $p$ is as follows:

\hat{p} = \frac{a_{3} - (k + N - h + 1) a_{2} + k (N - h + 1) a_{1}}{(N - 2) a_{2} + [(N - 1) (h - k) - N (N - 2)] a_{1} + N (N - h) (k - 1)}

(2)

where $k$ is the lower bound, $h$ is the number of discrete groupings truncated on the upperside, and $N$ is the sample size. Although this estimator is not the MLE estimator, Shah has also shown that the efficiency of this method of moments estimator relative to the MLE is quite high at 0.9294. We opt for the method of moments estimate as it allows the use of the sample moments as oppose to the population moments of the distribution which are unknown in our application.

2.3. Network measures

Although our work focuses on the degree distribution of a network, we now define several other network measures that we compute for comparison in our data example in a later section. Closeness is a measure of how close a node is to all other nodes in the graph, and it is defined as the inverse of the sum of pairwise distances $d (n_{i}, n_{j})$ between the nodes:

C_{C} (n_{i}) = \frac{1}{(\sum_{j = 1}^{N} d (n_{i}, n_{j}))}

(3)

Betweenness measures the involvement of a node within the direct interactions between two other nodes (nodes in the middle). Node betweenness index is the sum of the probabilities of the shortest path that goes through node $i$ between all node pairs $j < k$ , where $j \neq i, k \neq i$ , and can be calculated as follows:

C_{B} (n_{i}) = \sum_{j < k} \frac{g_{jk} (n_{i})}{g_{jk}}

(4)

where $g_{jk} (n_{i})$ is the number of shortest paths that contains node $i$ , and $g_{jk}$ is the total number of shortest paths between $j$ and $k$ .

The clustering coefficient for a given node measures the number of connections among the node’s neighbors,¹⁰ and is related to the transitivity concept in the social network literature, where transitivity implies the idea of “a friend of a friend is a friend.” It is defined as the proportion of local relationships among neighbors compared to the potential that all of the neighbors are connected. The mathematical formulation is given by:

C_{CL} (n_{i}) = \frac{| {e_{j, l} : n_{j}, n_{l} \in K_{n_{i}}, e_{i, j} \in E} |}{\frac{k_{n_{i}} (k_{n_{i}} - 1)}{2}}

(5)

where $K_{n_{i}} = {n_{j} : e_{i, j} \in E}$ is the neighborhood of node $n_{i}$ , $e_{i, j}$ is the edge between $i$ and $j$ , and $k_{n_{i}} = | K_{n_{i}} |$ is the size of the neighborhood. The quantity in the denominator is the maximum number of edges possible in a graph of size $k_{n_{i}}$ . Another version of the group clustering coefficient was discussed by Wasserman and Faust¹⁶ as a measure of triadic closure in a graph given by the ratio of existing triangular relationships or triads over the potential triads and is given by:

C_{CL} = \frac{6 \times (# of triangles)}{Total # length two paths}

(6)

3. Methodology

In this section, we will introduce a network characterization method that utilizes the degree distribution of the BA and ER networks. The characterization method is a combination of statistical tests of the Pareto distribution and method of moment estimator of the Doubly Truncated Binomial distribution. The results of these tests from the network characterization are then used in simulating a realization of the network.

3.1. Statistical test of hypothesis for the Pareto distribution

Given a random sample of size $N$ from a $Pareto (m, β)$ distribution, it is easily shown (1) that the loglikelihood function $L (m, β | \tilde{x}) = N β \ln m + N \ln β - (β + 1) \sum_{i = 1}^{N} \ln (x_{i})$ , where $x_{i} \in [m, \infty)$ , (2) that the MLE for $m$ is ${\hat{m}}_{MLE} = x_{(1)}$ , the smallest observed value, and (3) that the MLE for $β$ is as follows:

{\hat{β}}_{MLE} = N {[\sum_{i = 1}^{N} \ln \frac{x_{i}}{x_{(1)}}]}^{- 1} = N {[\ln \frac{Π_{i = 1}^{N} x_{i}}{x_{(1)}^{n}}]}^{- 1}

3.1.1. Statistical test of hypothesis for parameter m of a Pareto(m, β) distribution

Let $W (x) = x_{(1)}$ , the MLE for $m$ and a sufficient statistic.²³ The PDF for random variable $X_{(1)}$ , the smallest observed value from a $Pareto (m, N β)$ , can be shown to be $f_{X_{(1)}} (x) = m^{N β} N β x^{- β N - 1} I_{[m, \infty)} (x)$ which is a $Pareto (m, N β)$ distribution. Consider a one-sided hypothesis test for a particular value of $m$ , call it $m_{1}$ , through the hypothesis $H_{0} : m \leq m_{1}$ versus $H_{A} : m > m_{1}$ where $β$ is known. The likelihood ratio test (LRT) can be obtained by:

λ^{*} (W (x)) = \frac{max_{m_{1}} L^{*} (m | W (x))}{max_{m} L^{*} (m | W (x))} = {(\frac{m_{1}}{x_{(1)}})}^{N β}

which has a rejection region of ${x : λ^{*} (W (x)) \leq c} = {x : x_{(1)} \geq (m_{1} / c^{\frac{1}{N β}})}$ . Therefore, a level $α$ hypothesis test where $α = P (X_{(1)} \geq m_{1} / c^{\frac{1}{N β}})$ will reject $H_{0}$ if:

x_{(1)} \geq (\frac{m_{1}}{α^{\frac{1}{N β}}})

(7)

A detailed power analysis for this test was conducted by Mohd-Zaid.²⁴ The power of the test on $H_{0} : m \leq m_{1}$ versus $H_{A} : m > m_{1}$ was computed for a $Pareto (m, 2)$ with $m \in [1, 7]$ at increments of 0.2 for each BA network of size $N \in {2^{k} : k = 5, 6, \dots, 15}$ using 1000 iterations of the test described in Equation (7) with $α = 0.05$ . Power was computed as the proportion of times $H_{0}$ was rejected out of the 1000 iterations. Note that the true $m_{1}$ for the BA network is dependent on $N$ for smaller networks such that $m_{c} = m \sqrt{(N - m_{0}) / N}$ , and therefore, power should be equal to $α$ at this value of $m_{c}$ . The power of the test converges to a steady state as $k$ increases but drops to $α$ as $m_{1}$ approaches $m_{c}$ which indicates that the Type-II error in a small neighborhood of $m_{c}$ is fairly high. In addition, as soon as $m_{1} > m_{c}$ , the power of the test for $β$ quickly approaches zero which is expected as shown in an example with $m = 2$ in Figure 1.

Figure 1.

Power curve for the test on $m$ for $m = 4$ .

3.1.2. Hypothesis test for parameter β of a Pareto(m, β) distribution

Consider a simple two-sided test, $H_{0} : β = β_{0}$ versus $H_{A} : β \neq β_{0}$ , which is identical to the simple test $H_{0} : β = β_{0}$ versus $H_{A} : β = β_{1}$ described in Appendix 1. It should also be noted that in cases where $m$ is known, the statistic $T$ becomes $\ln (Π_{i = 1}^{N} x_{i} / m^{N})$ and the transformation $2 β_{0} T$ has the distribution of $χ_{2 N}^{2}$ .

The power of the test for $H_{0} : β = 2$ versus $H_{A} : β \neq 2$ was computed for $β \in [1, 3]$ at increments of 0.02 for each BA network of size $N \in {2^{k} : k = 5, 6, \dots, 15}$ . A plot of the power curve for this test is given in Figure 2, and it is apparent that the power of the test improves rapidly as the size of the network gets larger. However, for a fixed $k$ , we obtained identical power curves for different values of $m$ , so we conclude that the power of this test invariant to $m$ .

Figure 2.

Power curve for the test on $β$ .

3.1.3. UIT for both parameters of Pareto(m, β)

A UIT can be formed if the null hypothesis can be expressed as an intersection. In this case, we can test for $m$ and $β$ simultaneously by forming the hypotheses:

\begin{matrix} H_{0} & : {m \leq m_{1} \cap β = β_{0}} \\ H_{A} & : {m > m_{1} \cup β > β_{0} \cup β < β_{0}} \end{matrix}

(8)

If we define $C (x) = inf_{γ \in {m, β}} λ_{γ} (x)$ , where $λ_{γ} (x)$ is the LRT for the individual tests, then $λ_{γ} (x)$ is a level $α$ test, and the UIT based on $C (x)$ is a level $α$ test. Therefore, the rejection region for Equation (8) is given by:

{\begin{matrix} x : x_{(1)} \geq \frac{m_{1}}{α^{\frac{1}{N β_{0}}}} \\ or T \leq \frac{z_{α} \sqrt{(N - 1)} + (N - 1)}{β_{0}} \\ or T \geq \frac{z_{1 - α} \sqrt{(N - 1)} + (N - 1)}{β_{0}} \end{matrix}}

If $P_{C} (q)$ and $P_{λ} (q)$ are the power functions for the tests based on $C$ and $λ$ , respectively, then $P_{C} (q) \leq P_{λ} (q)$ for every $q \in {(m, β)}$ . Hence, the power of the UIT will be bounded by the power of the individual LRTs at the specified level of $θ$ . Therefore, the power of the UIT at each level of $θ$ is simply the maximum of the power of the each individual test at the particular level. The power of the UIT can be visualized using a surface plot with respect to $(m, β, P_{C} (q))$ on the $(x, y, z)$ axes, respectively.

As shown by Mohd-Zaid,²⁴ the power of the UIT is an improvement over the individual tests except when $β_{0} = 2$ and $m_{1}$ is greater than the true m at which point power levels off at 0.05. Therefore, if a given distribution is truly not from a Pareto distribution, the probability of rejecting $H_{0}$ will be higher for $β_{0} = 2$ when $m_{1}$ is hypothesized lower than the true value, $m \sqrt{(N - m_{0}) / N}$ , as opposed to one that is higher. However, if we hypothesize $m_{1}$ to be the true value, it is inconsequential if $β_{0}$ is overestimated or underestimated as it will result in roughly the same probability of rejection. In addition, similar to the power of the individual tests, the power of the UIT improves as $k$ increases, particularly on the $β_{0}$ axis. That is, for smaller $k$ , the combination of $m_{1}$ and $β_{0}$ affects the power of the test much more than when $k$ is large, at which point only $β_{0}$ affects the power of the test.

3.2. Correction for the test on simulated BA network

Network data were simulated to calculate the power of each test. A data set comprised of degree distributions of simulated BA networks for various parameter and size combinations was generated using the igraph package in R.²⁵

The parameter selection for the simulation is listed in Table 1, where 1000 independent networks were generated for each of the 44 combinations of graph parameter and sizes.

Table 1.

Parameters for network simulation.

Parameters	Size
$m^{*} \in {1, 2, 4, 6}$	$N = 2^{k}; k \in {5, 6,, 14, 15}$

From this point forward, we will let $Z = 2 β_{0} T$ and let $\bar{z}$ be the sample mean of $Z$ for convenience. Recall that $Z$ has the distribution $χ_{2 (N - 1)}^{2}$ . Upon inspection of the simulated data, the evolution of the network generation causes the degree of the nodes to be correlated to one another due to the preferential attachment nature of the BA graph. The correlation causes the expected value of $Z$ to be slightly biased and the variance to be much smaller than the expected variance when compared to the $χ_{2 (N - 1)}^{2}$ distribution. This is not unexpected as Li et al.²⁶ have observed that the preferential attachment model causes biases in the structure of the graph where high-degree nodes are interconnected, and Bubeck et al.²⁷ have theoretically shown that the resulting graph of preferential attachment is very dependent on the starting graph. In addition, Mohd-Zaid et al.²⁸ have empirically shown that the degree distribution of simulated BA networks does not follow the theoretical result originally derived by Barabási and Albert.⁴ Therefore, we simulated another set of data to study the behavior of the bias in the expected value and variance and to derive a test for the scale-free property that corrects for this bias.

Although there is a bias in the sample mean and variance from the empirical distribution of $Z$ (Figure 3), it seems to converge as the network size increases. Furthermore, both of the ratios, $\bar{z} / E [Z]$ and $s_{z}^{2} / Var [Z]$ , can be modeled by the five parameter Biexponential (denoted as $f (k)$ ) and four parameter Gompertz (denoted as $g (k)$ ) models, respectively, for each $m$ as a function of $k$ with an $R^{2} \geq 0.99$ for both models (Table 3). The estimates of the parameters for the correcting scalars in Table 2 using $β_{0} = 2$ are given in Table 3. Therefore, the level $α$ test from Equation (12) becomes as follows:

\begin{matrix} T \leq \frac{z_{α} \sqrt{(N - 1) g (k)} + (N - 1) f (k)}{β_{0}} for β_{0} > β_{1} \\ or \\ T \geq \frac{z_{1 - α} \sqrt{(N - 1) g (k)} + (N - 1) f (k)}{β_{0}} for β_{0} < β_{1} \end{matrix}

(9)

which rejects $H_{0} : β = β_{0}$ for the alternative $H_{A} : β \neq β_{0}$ .

Figure 3.

Top: ratio of $s_{z}^{2} / Var [Z]$ versus network size. Bottom: ratio of $\bar{z} / E [Z]$ versus network size.

Table 2.

Biexponential and Gompertz models for $\bar{z} / E [Z]$ and $s_{z}^{2} / Var [Z]$ , respectively.

Correcting scalars	Nonlinear model
$\bar{z} / E [Z]$	$f (k) = a + b \exp {- ck} + d \exp {- hk}$
$s_{z}^{2} / Var [Z]$	$g (k) = a + (b - a) \exp {- \exp {- c (k - d)}}$

Table 3.

Parameter estimates for $f (k)$ and $g (k)$ .

$m$	$f (k)$					$g (k)$
$m$	$a$	$b$	$c$	$d$	$h$	$a$	$b$	$c$	$d$
1	0.8653	313.24	0.6947	−313.66	0.6958	0.1168	0.1859	0.6626	4.7800
2	0.9264	2.6131	0.5890	−5.5817	0.8596	−0.1465	0.1229	0.4391	1.6905
4	0.9610	3.4200	0.5532	−11.951	0.8753	−0.0064	0.0710	0.4558	4.9327
6	0.9733	4.1794	0.5390	−18.966	0.8754	−0.0084	0.0503	0.4259	5.3396

The power of each test of hypothesis in Equation (9) for the BA network was evaluated assuming an $α = 0.05$ level of significance for each test. In this implementation of the test on $m$ , the estimate $x_{(1)}$ will always be equal to $m^{*} \in {1, 2, 4, 6}$ due to the way the BA network is simulated where the smallest degree possible for any generation of the graph is $m^{*}$ . This fact causes the test to behave differently than it would have with a theoretical distribution as shown in section 3.1.1. Therefore, we can consider the test with two possible true values: $m = m^{*}$ or $m = m^{*} \sqrt{(n - m^{*}) / n}$ . The power of the test described by Equation (7) on $H_{0} : m \leq m_{1}$ versus $H_{A} : m > m_{1}$ is computed similar to the process as in section 3.1.1. With $m = m^{*}$ , the power curve suggests that for $k \geq 11$ ( $k \geq 9$ for $m^{*} = 1$ and $k \geq 10$ for $m^{*} = 2$ ), the power of the test converges to a steady state similar to the general test for $m$ in section 3.1.1. However, the power of the test is very poor for $k \leq 10$ ( $k \leq 8$ for $m^{*} = 1$ and $k \leq 9$ for $m^{*} = 2$ ), where the power drops to zero even before $m_{1}$ approaches $m$ .

If $m = m^{*} \sqrt{(n - m^{*}) / n}$ , the test has a Type-I error of 100% for $m^{*} = 4, 6$ which implies that the test will always reject $H_{0}$ even when $H_{0}$ is true. Therefore, when implementing the test on a simulated BA network in such cases, letting $m = m^{*} \sqrt{(n - m^{*}) / n}$ instead of $m = m^{*}$ essentially renders the test unusable.

Using the appropriate values from Table 3 for $H_{0} : β = 2$ versus $H_{A} : β \neq 2$ , the power of the test was computed for $β$ similar to that of section 3.1.2. Again, the power of the test improves as $k$ increases, but unlike the general test for $β$ from Equation (12) and plotted in Figure 2, as $m$ increases, the power also improves considerably. Another noticeable difference is that the power increases at a much faster rate across all values of $k$ .

Due to the findings in Li et al.,²⁶ Bubeck et al.,²⁷ and Mohd-Zaid et al.,²⁸ the power of the test for $β_{0}$ with the assumption that $β_{0} = 2.16$ and 2.45 was also computed where the estimates for the correcting scalars were computed using the appropriate $β_{0}$ values. However, the result did not differ from when $β_{0} = 2$ which suggests that the power of the test is invariant of $β_{0}$ and is strictly a function of the difference from $β_{0}$ . This implies that regardless of the true exponent of the degree distribution for the BA network, the proposed test in Equation (9) is able to test a given network power law assumption if the transformation of the degree distribution, $2 β_{0} T$ , under the assumed $β_{0}$ follows that of the corrected transformation of the BA test as in Equation (9).

The power of the UIT for BA networks is an improvement over the individual tests except where it is stationary at 0.05 when $β_{0} = 2$ and $m_{1}$ is greater than the true value. In addition, due to the tighter variance of the empirical distribution of $2 β_{0} T$ , the power improves much faster when compared to the unadjusted UIT test for Pareto parameters in section 3.1.3. The power also varies with respect to the true value of $m$ , where higher power is observed for larger values of $m$ . The results are plotted in Figure 4 for various network sizes. Interestingly, Figure 4 shows that if $β_{0} = 2$ is hypothesized for a non-BA network, then a misspecification of $m_{1}$ that is lower than the true value of $m$ will result in a higher probability of rejection than an overspecification of $m_{1}$ . However, if our hypothesized $m_{1}$ is equal to the true value, then it does not matter if $β_{0}$ is misspecified, as it will result in the same probability of rejection. In addition, similar to the power of the individual tests, the power of the UIT improves as $k$ increases, particularly on the $β_{0}$ axis (for differences in $β_{0}$ ).

Figure 4.

Contour plot of the power for the BA UIT for $k = 5, 11$ from left to right, respectively, and $m = 2, 4$ from top to bottom, respectively.

4. Mixture model approach for characterizing scale-free networks

We will now propose a method that takes advantage of well-known degree distributions for the BA and ER networks. Each network model creates networks with degree distributions that can be approximated by the well-defined Pareto and Binomial distributions as previously described. However, we take a stepwise approach by first testing a proportion of a network’s degree distribution using our proposed UIT. This is performed in order to characterize said proportion of the network and model it using the BA model by assuming that it has the power law characteristics. The remainder of the degree distribution is then assumed to follow a Doubly Truncated Binomial distribution from the ER randomness property, and the parameters of the distribution are then estimated through the maximum-likelihood approach.

The UIT from section 3.2 is used by hypothesizing a value for $m_{1}$ such that the test is not rejected. However, since $m_{1}$ is the lower bound of the support for the distribution, this results in degree values that are smaller than the support which will cause $T$ to be biased. Therefore, the UIT on this set of $m_{1}$ needs to be performed on the truncated degree distribution which essentially reduces the original network to a sub-network that is associated with the truncated degree distribution. This sub-network contains the “main hubs” of the original network that connect the entire networks. In essence, it becomes a test of the network’s remaining central node remaining central nodes. In order to do so, the parameter estimates for $f (k)$ and $g (k)$ have to be obtained for the appropriate $m_{1}$ values. This is performed by interpolating the values of $f (k)$ and $g (k)$ between the ranges of $k$ and $m$ from Table 3 that gives us a surface over the $k$ and $m$ axes.

For the lower portion of the degree distribution where the degrees are smaller than $m_{1}$ , we obtain the method of moments estimation of the parameters associated with the Doubly Truncated Binomial distribution by assuming that the remaining degree distributions have random characteristics that can be modeled by the ER model. Since we are only considering connected networks, then the smallest degree possible for any given network is 1 which implies that we have a support of $[1, m_{1}]$ . In order to utilize the ER model, we need to obtain an estimate of the edge probability, $p$ . If we define the remaining degree distribution as $d = {x_{1}, \dots, x_{n}} \in [1, m_{1}]$ , then from Equation (2) showed that we have the estimate:

\hat{p} = \frac{a_{3} - (1 + n - h + 1) a_{2} + (n - h + 1) a_{1}}{(n - 2) a_{2} + ((n - 1) (h - 1) - n (n - 2)) a_{1}}

(10)

such that $a_{1}$ , $a_{2}$ , and $a_{3}$ are the sample mean, variance, and third central moment of $d$ , respectively, and $h = n - m_{1} + 1$ .

In order to investigate the goodness-of-fit of our characterization method described above, we propose a network simulation algorithm that takes into account the estimated parameters of the degree distribution. Schmidt and Morup²¹ proposed a Markov Chain Monte Carlo method to derive a model for the infinite mixture model using a limit of a finite parametric model. The SBM creates graphs with communities which we will use in the simulation of a characterized network. This algorithm generates a network of a given size that preserves the degree distribution of the characterized network. Algorithm 1 outlines the process for network generation using an SBM approach.²⁰ Using SBM, a network is generated via a probability matrix that defines the edge probability between a specified number of components (here as BA and ER) and the edge probability within each component. In this process, the edge probability matrix of the BA network of size $(N \times π_{BA})$ is first computed using 100 simulated networks using the parameters from the characterization. Here, $π_{BA}$ is the proportion of the degree distribution characterized as having a BA network degree distribution. One caveat to our approach is that we treat each node in the BA probability matrix as a single component of size one in the SBM. This defines the BA portion of the network. The ER portion is defined by a single cell that reflects the edge probability $p_{ER}$ . The probability of connectivity between the components can be arbitrary, but for our purpose, we assign $p_{ER}$ as the probability of connectivity which indicates that any node in the BA portion of the network can connect to a noise node by chance.

Algorithm 1.

MNM SBM function.

1: Given

θ = {N, π_{BA}, m, p_{ER}}

2: for

i

in 1 to 100 do
3: Create Barabási–Albert network,

G_{B A_{i}}

, of size

N \times π_{BA}

using linear preferential attachment with edge parameter

m

4: end for
5: Compute edge probability matrix

E_{BA}

by adding the number of edges

(a, b)

across all

G_{B A_{i}}

and divide by 100
6: Define vector

v_{ER}

s.t. each element in

{v_{1}, \dots, v_{(N \times π_{BA})}}

is equal to

p_{ER}

and are the edge probabilities between nodes in

G_{BA}

and

G_{ER}

7: Define a probability matrix

E

by adding

v_{ER}

as the last column and row of

E_{BA}

and defining the bottom right entry of

E

p_{ER}

8: Generate graph

G

using SBM with probability matrix

E

and component sizes

{1_{1}, \dots, 1_{N \times π_{BA}}, N \times (1 - π_{BA})}

9: Return graph

G

with

N

nodes

We now illustrate an application of the characterization and simulation methods on the well-known Zachary’s²⁹ Karate Club data set. This network describes the relationship between 32 students and 2 teachers in a karate school. Members are connected if they attended a common activity such as going to the same class. This network is undirected and weighted; however, we will not consider the weights on the degrees for our purpose. The club is split into two factions, each led by one of the teachers. In the characterization part, applying our UIT on the degree distribution of the network resulted in a best fit of $m_{1} = 3$ resulting in 65% (22 nodes) of the network having degree of at least three. We then fitted the Doubly Truncated Binomial distribution on the remainder of the degree distribution where the degree of the nodes is less than three resulting with a ${\hat{p}}_{ER} = 0.002$ .

For Part II, we then simulated 1000 networks using Algorithm 1 with parameter set ${N = 34, π_{BA} = 0.65, m = 3, p_{ER} = 0.002}$ . The mean degree, mean betweenness, mean closeness, clustering coefficient, and density were computed for each network. This created a bootstrapped distribution of the measures which can then be compared to the empirical network measures as a goodness-of-fit of the characterization and simulation steps. Figure 5 illustrates the distribution of the measures from the simulated networks in comparison to the measures of the empirical network. The true mean degree, clustering coefficient, and density were all captured within the 95% confidence bound of the bootstrapped measures. However, although mean betweenness and mean closeness were not captured, the empirical measures did not fall far outside the 95% confidence bound. Table 4 outlines the mean squared error (MSE), bias, and variance of the bootstrapped distributions with respect to the empirical values. We see that for MSE, bias, and variance for mean closeness are fairly small indicating that although the estimates did not capture the true value, they are extremely close, and therefore, our network model estimates for BA and ER describe the network reasonably well.

Figure 5.

Violin plots of the bootstrapped network measures of simulated Karate Club network.

Table 4.

MSE, bias, and variance of bootstrapped network measures for Karate Club.

Network measures	Empirical	MSE	Bias	Variance
Mean degree	4.5882	0.2633	−0.3549	0.1374
Mean betweenness	23.2353	68.1823	7.4369	12.8755
Mean closeness	0.0129	3.4007e−06	−1.6881e−03	5.5108e−07
Clustering coefficient	0.2556	0.0051	0.0523	0.0024
Density	0.1390	2.4181e−04	−0.0108	1.2616e−04

MSE: mean squared error.

5. Application on real-world networks

We applied our characterization and simulation methods to real-world data sets in order to see the usability of our methods.

As previously stated, identification of the scale-free property for even a portion of a network is important for characterizing and monitoring networks, and our approach allows those networks or sub-networks to be modeled through a BA graph. The data sets summarized in Table 5 consist of networks of various sizes that are available in the literature^36,40 from a variety of fields that are believed to be scale-free. It should be noted that all of these networks were treated as undirected networks for the analysis.

Table 5.

Real-world data description.

Network	Brief description	Type	$k : N = 2^{k}$	Reference
Karate Club	Social network of friendships	Undirected	5.0875	Zachary²⁹
Office	Interactions in a small business office	Directed	5.3219	Killworth and Bernard³⁰
Dolphin Social Network	Social network of dolphins	Undirected	5.9542	Lusseau et al.³¹
Les Miserables	Coappearance of Les Miserables characters	Undirected	6.2668	Knuth³²
Political Blogs	Hyperlinks between US politics weblogs	Directed	10.5411	Adamic and Glance³³
Facebook Social Circles	Social network of friendships	Undirected	11.97978	McAuley and Leskovec³⁴
High-Energy Theory Collaborations	Coauthorships between scientists	Undirected	13.0295	Newman³⁵
Astrophysics Collaborations	Coauthorships between scientists	Undirected	14.0281	Newman³⁵
Internet	Internet structure	Undirected	14.4870	Newman³⁶
High-Energy Physics Theory Citations	Network of paper citations	Directed	14.7612	Gehrke et al.³⁷ and Leskovec et al.³⁸
High-Energy Physics Phenomenology Citations	Network of paper citations	Directed	15.0762	Gehrke et al.³⁷ and Leskovec et al.³⁸
Condensed Matter Collaborations	Coauthorships between scientists	Undirected	15.3028	Newman³⁵
Google Webgraphs	Network of hyperlinks between webpages	Directed	19.7401	Leskovec et al.³⁹

The UIT was then conducted by hypothesizing that $m_{1} \in {1, 2, 3, 4, \dots}$ in order to see whether or not a portion of any of the networks can be hypothesized to follow the BA model. Recall, similar to the Karate Club example, hypotheses of $m_{1} > 1$ resulted in degree values that were outside of the support which caused $T$ to be biased. Therefore, the UIT on this set of $m_{1}$ needed to be performed on the right portion of truncated degree distribution which essentially reduced each original network to a sub-network that was associated with the truncated degree distribution. These sub-networks contained the “main hubs” of the original networks that connected the entire network. In essence, it became a test of the network’s remaining central nodes. In order to do so, the parameter estimates for $f (k)$ and $g (k)$ had to be obtained for the appropriate $m_{1}$ values. Since it is impractical to do so for every single value in ${2, 3, 4, \dots}$ due to computation time, we took the approach proposed by Clauset et al.⁶ that estimated the $m_{1}$ value that gives the best fit for $β_{0} = 2$ for each of the networks. From there, the correction parameters of $f (k)$ and $g (k)$ for the obtained $m_{1}$ values were estimated. Interpolating the values between the ranges of $k$ and $m$ then gave us a surface over the $k$ and $m$ axes for each correction function, $f (k, m)$ and $g (k, m)$ . Applying the UIT to sub-networks resulted in seven of the twelve networks to be characterized (not significantly different) as a BA network each with their respective truncation point $m_{1}$ , and each having an absolute z-statistic that is smaller than the critical value of 1.96 (Table 6). These sub-networks accounted for between 18% and 80% of the entire networks. Although removing the periphery nodes resulted in one sub-network that only accounts for 18% of its original network size, the actual size of the sub-network is still fairly large (4944 nodes).

Table 6.

z-statistic and ${\hat{β}}_{MLE}$ for real-world networks.

Network	${\hat{β}}_{MLE} \| m_{1}$	$N \| m_{1}$ (%)	$z \| m_{1}$	Sub-network
Karate Club	1.874	22 (65)	1.524\|3	BA
Office	2.001	32 (80)	0.947\|8	BA
Dolphin Social Network	2.059	41 (66)	−1.212\|4	BA
Les Miserables	2.023	41 (53)	−1.024\|6	BA
Political Blogs	2.030	202 (17)	−2.233\|54	Not BA
Facebook Social Circles	1.890	869 (22)	−0.3979\|66	BA
High-Energy Physics Theory Collaborations	2.037	2053 (35)	0.9750\|5	BA
Astrophysics Collaborations	1.987	2743 (17)	−5.678\|28	Not BA
Internet	2.741	15123 (66)	−70.21\|2	Not BA
High-Energy Physics Theory Citations	1.969	4944 (18)	−0.9927\|41	BA
High-Energy Physics Phenomenology Citations	1.971	7007 (20)	3.253\|36	Not BA
Cond Matter Collaborations	2.002	3636 (9)	−3.700\|21	Not BA
Google Webgraphs	1.993	31372 (4)	4.095\|41	Not BA

BA: Barabási–Albert.

All p-values are <0.0001.

Bold indicates not significantly different from Barabási–Albert network based on $z_{best} | m$ .

As previously motivated in the introduction, there are scenarios where the study of smaller networks is of interest. Using Office, Les Miserables, and Dolphin as examples since they are all fairly small networks with at most 77 nodes, we apply the same simulation procedure as we did on the Karate network with the networks’ respective parameters as shown in Table 7. Note that, although not shown here, we also found that the Dolphin network could have also been characterized with $m_{1} = 3$ . However, the best MLE was obtained when $m_{1} = 4$ . The combined characterization and simulation on these three networks resulted in the goodness-of-fit measures listed in Table 8, where it shows that the mean degree and density of all four networks were captured by the bootstrapped measures from their respective simulated networks. In addition, the clustering coefficient of the Karate and Dolphin networks were also captured. Furthermore, empirical mean betweenness and mean closeness were captured for the Les Miserables network. Although it is important to note that mean closeness was not captured for three of the networks and the MSE for the simulated networks are all in the magnitude of 10⁻⁶ or smaller, indicating that the simulated measures are very close. It is worth commenting that although the betweenness measure was not able to be captured very well, this may be due to the fact that our simulation method was comprised of two network generators that focused on scale-free and randomness properties. In order to be able to capture the betweenness property, a network model that models that specific property may be required. Furthermore, it is encouraging that a framework built upon network degree can capture many other network measures as well.

Table 7.

Office, Les Miserables, and Dolphin Social Network characterized parameters.

Network	$N$	$π_{BA}$	$m_{1}$	${\hat{p}}_{ER}$
Karate	34	0.65	3	0.0020
Office	40	0.80	8	0.0083
Dolphin	64	0.66	4	0.0009
Les Miserables	77	0.53	6	0.0014

Table 8.

Empirical measures and goodness-of-fit of simulated networks.

Empirical	Degree	Between	Closeness	Clustering coefficient	Density
Karate	4.5882	23.235	0.0129	0.2557	0.1390
Office	11.90	14.900	0.0147	0.4090	0.3051
Dolphins	5.1290	71.887	5.0367e−03	0.3088	0.0841
Les Miserables	6.5974	62.3636	5.1229e−03	0.4989	0.0868
MSE	Degree	Between	Closeness	Clustering coefficient	Density
Karate	0.2633	68.1823	3.4007e−06	0.0051	2.4181e−04
Office	0.4879	21.7197	1.6933e−06	0.0141	3.2079e−04
Dolphins	0.3582	182.50	7.8668e−07	0.0027	9.6271e−05
Les Miserables	0.1223	435.75	4.6309e−07	0.0287	2.1168e−05
Bias	Degree	Between	Closeness	Clustering coefficient	Density
Karate	−0.3549	7.4369	−1.6881e−03	0.0523	−0.0108
Office	−0.5052	4.4604	−1.2468e−03	0.1169	0.0130
Dolphins	0.507	−12.498	8.4056e−04	−0.0439	8.3109e−03
Les Miserables	0.1755	16.779	−5.4869e−04	−0.1652	2.3093e−03
Variance	Degree	Between	Closeness	Clustering coefficient	Density
Karate	0.1374	12.8755	5.5108e−07	0.0024	1.2616e−04
Office	0.2327	1.8248	1.3883e−07	4.3688e−04	1.5299e−04
Dolphins	0.1012	26.301	8.0146e−08	7.5269e−04	2.7199e−05
Les Miserables	0.0915	154.23	1.6204e−07	1.4174e−03	1.5835e−05

MSE: mean squared error.

Italicized bold indicates empirical network measure was captured by the 95% confidence bound of bootstrapped network measure.

In addition to the goodness-of-fit measures based on the distribution of simulated network measures, we also compared the degree distribution of each simulated network with the degree distribution of the empirical network using the Wasserstein test⁴¹ at $α = 0.05$ . The Wasserstein test has the null hypothesis $H_{0} :$ that the degree distributions are identical versus $H_{A} :$ that the degree distributions are not identical. We then record the number of instances of when $H_{0}$ was rejected out of the 1000 simulations to determine the level of accuracy of the simulation method in generating a network with a statistically similar degree distribution. Table 9 shows that with the exception of the Dolphin network at 70.5%, the simulation method was able to preserve the degree distribution fairly well with accuracy ranging from 89% to 100%. However, recall that the Dolphin network can also be characterized using $m_{1} = 3$ which may explain the slightly lower accuracy. In summary, our framework for characterization and simulation was successful in capturing many features of the networks fairly well and provides insight for the analysis of these networks.

Table 9.

Percent of instances out of 1000 where the simulated degree distribution is similar to the empirical degree distribution based on the Wasserstein test.

Network	%
Karate	100
Office	89
Dolphin	70.5
Les Miserables	100

6. Conclusion

The purpose of this paper is to establish a statistical framework to characterize and simulate a network using a mixture of network properties, and our semiparametric mixture model approach for characterizing and simulating a network shows promising results. Our simple approach of first characterizing the network and then simulating the network using the fitted parameters was able to capture the degree distribution and density of the empirical networks. For some networks, it was also able to capture other characteristics of the empirical networks such as clustering coefficient and density. Our approach also opens possibility to create larger mixture of network properties by including other network models that are said to characterize other real-world networks such as the Watts–Strogatz model, or test another network model as the basis of the characterization step. While useful in and of itself, the approach also provides a basis to generate test cases in the characterized class to be used in design of experiments and other operational analyses. Such mixture network models might be able to truly characterize any real-world model and is the focus of future research. It is now also possible to adopt the result obtained by Blaha et al.⁴² to visually display the real network using our approach that might give more insight to the real network. This can be done by studying a collection of simulated characterized networks and using the methodology used by Blaha et al.⁴² to study the most efficient way to visualize the network that will provide as much insight as possible to the network analyst. Although the data sets used in this research were not focused specifically on military and defense applications, our approach and findings can be applied to such applications. For instance, the central actors identified in sub-networks are often of interest in SNA defense applications such as identifying key or leading figures in organizations of interest. Similarly changes in degree distribution may indicate the loss or addition of devices on a cyber network. Therefore, providing analysts with these statistical tools to model and track such changes is paramount to the intelligence preparation of the battlespace and in the planning of military operations.

Footnotes

Appendix 1 Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Fairul Mohd-Zaid

Author biographies

Fairul Mohd-Zaid is a Mathematical Statistician at the United States Air Force Research Laboratory, Airman Systems Directorate. He has been conducting research in the areas of network analysis, statistical visualization, and topological data analysis with other research interests in multi-sensor image fusion and multivariate analysis.

Christine Schubert Kabban is a full Professor of Statistics at the Air Force Institute of Technology. She has been researching and practicing statistics for over 20 years in clinical, engineering, and statistical fields. Her current work focuses in applications to structural health monitoring, target detection, and autonomous systems and networks with hierarchical and complex multi-dimensional data.

Richard F Deckro is a Distinguished Professor of Operations Research in the Department of Operational Sciences and Director of the Future Operations Investigation Laboratory (FOIL) at the Air Force Institute of Technology. Professor Deckro’s research, teaching and consulting interests are in the areas of social network analysis; complex networks; military operations research; network models; and decision analysis. He is a Fellow of the Military Operations Research Society and is the sitting Vice Chair of the NATO Science and Technology Organization’s System Analysis & Studies (SAS) panel.

Wright Shamp is a Biostatistician at Johnson & Johnson Surgical Vision. His research interests include statistical process control, nonparametric density estimation, and Bayesian estimation.

References

Smith

. The utility of force: the art of war in the modern world. New York: Vintage Books, 2008.

Joint Concept for Human Aspects of Military Operations (JC-HAMO) . Washington, DC: Office of the Chairman of the Joint Chiefs of Staff, 2016, https://nsiteam.com/social/wp-content/uploads/2017/01/20161019-Joint-Concept-for-Human-Aspects-of-Military-Operations-Signed-by-VCJCS.pdf

Havig

McIntire

Geiselman

, et al. Why social network analysis is important to Air Force applications. Proc SPIE 2012; 8389: 83891E.

Barabási

Albert

. Emergence of scaling in random networks. Science 1999; 286: 509–512.

Newman

. Power laws, Pareto distributions and Zipf’s law. Contemp Phys 2005; 46: 323–351.

Clauset

Shalizi

Newman

. Power-law distributions in empirical data. SIAM Rev 2009; 51: 661–703.

Zhao

Yang

Zhang

, et al. Emergence of scaling in human-interest dynamics. Sci Rep 2013; 3: 3472.

De Solla Price

. Networks of scientific papers. Science 1965; 149: 510–515.

Erdös

Rényi

. On random graphs I. Publ Math: Debrecen 1959; 6: 290–297.

10.

Watts

Strogatz

. Collective dynamics of “small-world” networks. Nature 1998; 393: 440–442.

11.

Fienberg

. A brief history of statistical models for network analysis and open challenges. J Comput Graph Stat 2012; 21: 825–839.

12.

Broido

Clauset

. Scale-free networks are rare, 2018, https://arxiv.org/abs/1801.03400

13.

Conte

Foggia

Sansone

, et al. Thirty years of graph matching in pattern recognition. Int J Pattern Recogn 2004; 18: 265–298.

14.

Schubert Kabban

Mohd-Zaid

Deckro

. Modern methods for characterization of social networks through network models. In: Scala

Howard

(eds) Handbook of military and defense operations research. London: Chapman & Hall/CRC, 2020, pp. 171–192.

15.

De Blasio

Seierstad

Aalen

. Frailty effects in networks: comparison and identification of individual heterogeneity versus preferential attachment in evolving networks. J Roy Stat Soc C: App 2011; 60: 239–259.

16.

Wasserman

Faust

. Social network analysis: methods and applications. Cambridge: Cambridge University Press, 1994.

17.

Castillo

Hadi

. Fitting the generalized Pareto distribution to data. J Am Stat Assoc 1997; 92: 1609–1620.

18.

Choulakian

Stephens

. Goodness-of-fit tests for the generalized Pareto distribution. Technometrics 2001; 43: 478–484.

19.

Pareto

Busino

. Ecrits sur la courbe de la repartition de la richesse (Reunis et presentes par G Busino, originally published in 1896; Travaux De Droit, D’eä conomie, De Sociologie et de Sciences Politiques). Geneä ve: Droz, 1965.

20.

Nowicki

Snijders

TAB

. Estimation and prediction for stochastic blockstructures. J Am Stat Assoc 2001; 96: 1077–1087.

21.

Schmidt

Morup

. Nonparametric Bayesian modeling of complex networks: an introduction. IEEE Signal Proc Mag 2013; 30: 110–128.

22.

Shah

. On estimating the parameter of a doubly truncated binomial distribution. J Am Stat Assoc 1966; 61: 259–263.

23.

Casella

Berger

. Statistical inference. Pacific Grove, CA: Duxbury Thomson Learning, 2002.

24.

Mohd-Zaid

. A statistical approach to characterize and detect degradation within the Barabási-Albert network. PhD Thesis, Air Force Institute of Technology, Wright-Patterson AFB, OH, 2016.

25.

Csardi

Nepusz

. The igraph software package for complex network research. InterJ Complex Syst 2006; 1695: 1–9.

26.

Zhang

Small

. Emergence of scaling and assortative mixing through altruism. Physica A 2011; 390: 2192–2197.

27.

Bubeck

Mossel

Rácz

. On the influence of the seed graph in the preferential attachment model. IEEE T Netw Sci Eng 2015; 2: 30–39.

28.

Mohd-Zaid

Schubert Kabban

Deckro

, et al. Parameter specification for the degree distribution of simulated Barabási–Albert graphs. Physica A 2017; 465: 141–152.

29.

Zachary

. An information flow model for conflict and fission in small groups. J Anthropol Res 1977; 33: 452–473.

30.

Killworth

Bernard

. Informant accuracy in social network data. Hum Organ 1976; 35: 269–286.

31.

Lusseau

Schneider

Boisseau

, et al. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behav Ecol Sociobiol 2003; 54: 396–405.

32.

Knuth

. The Stanford GraphBase: a platform for combinatorial computing. New York: ACM, 1993.

33.

Adamic

Glance

. The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd international workshop on link discovery (LinkKDD’05), Chicago, Illinois, 21–25 August 2005, pp. 36–43. New York: ACM.

34.

McAuley

Leskovec

. Discovering social circles in ego networks, 2012, https://arxiv.org/abs/1210.8182?context=cs

35.

Newman

. The structure of scientific collaboration networks. Proc Natl Acad Sci U S A 2001; 98: 404–409.

36.

Newman

. Network data, 2013, http://www-personal.umich.edu/~mejn/netdata/

37.

Gehrke

Ginsparg

Kleinberg

. Overview of the 2003 KDD cup. ACM SIGKDD Explor Newslett 2003; 5: 149–151.

38.

Leskovec

Kleinberg

Faloutsos

. Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining (KDD’05), Chicago, IL, 21–24 August2005, pp. 177–187. New York: ACM.

39.

Leskovec

Lang

Dasgupta

, et al. Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 2009; 6: 29–123.

40.

Leskovec

Krevl

. SNAP datasets: Stanford large network dataset collection, 2014, http://snap.stanford.edu/data

41.

Ramdas

Trillos

Cuturi

. On Wasserstein two-sample testing and related families of nonparametric tests. Entropy 2017; 19: 47.

42.

Blaha

Arendt

Mohd-Zaid

. More bang for your research buck: toward recommender systems for visual analytics. In: Proceedings of the 5th workshop on beyond time and errors: novel evaluation methods for visualization (BELIV’14), Paris, 10 November 2014, pp. 126–133. New York: ACM.

Network characterization and simulation via mixed properties of the Barabási–Albert and Erdös–Rényi degree distribution

Abstract

Keywords

1. Introduction

2. Background

2.1. Network models

2.1.1. BA model

2.1.2. ER model

2.1.3. Mixed models

2.2. Estimating the parameters of the Doubly Truncated Binomial distribution

2.3. Network measures

3. Methodology

3.1. Statistical test of hypothesis for the Pareto distribution

3.1.1. Statistical test of hypothesis for parameter m of a Pareto(m, β) distribution

3.1.2. Hypothesis test for parameter β of a Pareto(m, β) distribution

3.1.3. UIT for both parameters of Pareto(m, β)

3.2. Correction for the test on simulated BA network

4. Mixture model approach for characterizing scale-free networks

5. Application on real-world networks

6. Conclusion

Footnotes

Appendix 1

Funding

ORCID iD

Author biographies

References