Estimation of the Hyperexponential Density with Applications in Sensor Networks

Abstract

This paper solves the problem of estimation of the parameters of a hyperexponential density and presents a practical application of the solution in sensor networks. Two novel algorithms for estimating the parameters of the density are formulated. In the first algorithm, an objective function is constructed as a function of the unknown component means and an estimate of the cumulative distribution function (cdf) of the hyperexponential density. The component means are obtained by minimizing this objective function, using quasi-Newtonian techniques. The mixing probabilities are then computed using these known means and linear least squares analysis. In the second algorithm, an objective function of the unknown component means, mixing probabilities, and an estimate of the cdf is constructed. All the 2M parameters are computed by minimizing this objective function, using quasi-Newtonian techniques. The developed algorithms are also compared to the basic EM algorithm, and their relative advantages over the EM algorithm are discussed. The algorithms developed are computationally efficient and easily implemented, and hence, are suitable for low-power and sensor nodes with limited storage and computational capacity. In particular, we demonstrate how the structure of these algorithms may be exploited to be effectively utilized in practical situations, and are hence ideal for sensor networks.

Keywords

Hyperexponential Density Bursty Traffic Engineering Parameter Estimation Nonlinear Optimization Sensor networks Rainfall Prediction

1. Introduction

The advent of sensor networks has spurred research to develop new methods for distributed data collection and processing. Estimation of parameters of physical quantities from data collected over a geographic area is one such task. Mixture density is a powerful family of probability density functions and is useful in representing some physical quantities for such an application. The vast majority of the literature on the subject of mixture densities pertains to Gaussian mixture models (GMM). By comparison, mixtures of exponentials have received far less attention. Notwithstanding, mixtures of exponentials have significant applicability in a number of fields, including computer networks, biology, engineering, and medicine. The class of general mixtures of exponential distributions has relevance in linear systems theory and differential equations (Chauveau [1]). Cao et al [2] discuss modeling of rainfall and contract valuation with hyperexponential densities. Specifically, in the paper by Cao et al [2], the amount of rainfall in an area is modeled using a mixture of two exponential densities. A further exposition of generalized hyperexponential distribution functions may also be found in Botta et al [3]. The focus of this paper is on a subset of exponential probability densities, which will be henceforth referred to as hyperexponential densities. We develop two algorithms for estimating the parameters of hyperexponential densities. The strength of these algorithms lies in the fact that the methods presented require little computational power and that the computations can be decomposed efficiently. These features enable an effective solution that is appropriate for sensor networks as outlined in section I-A.

The probability distribution function (pdf) of an M-component hyperexponential density is

f (x) = \sum_{i = 1}^{M} π_{i} λ_{i} e^{- λ_{i} x},

(1)

where the mixing proportions are given by $π = [π_{1}, \dots, π_{M}]^{T}$ , and the means of each component are $[\frac{1}{λ_{1}}, \dots, \frac{1}{λ_{M}}]$ . Given N independent, identically distributed (iid) samples of data, the problem dealt with here is to estimate the parameters $λ = [λ_{1} \dots λ_{M}]^{T}$ and π of the hyperexponential density. In the following subsection, sensor networks are introduced and we demonstrate the applications of hyperexponential densities in a sensor network environment.

1.1. Sensor Networks

We now introduce applications in a sensor network environment, which can be modeled accurately by hyperexponential densities. Sensor networks (Akyildiz et al [4] and Rentala et al [5]) consist of many sensor nodes arranged in either an ordered or random fashion. Figure 1 shows such a sensor network. Data is transmitted via a multi-hop approach from the sensor nodes to special nodes called base stations. Sensor nodes are ideal for sensing environmental changes, especially in hostile or remote locations or over large geographical areas. One important category of environmental observation is rainfall observation in forestry and agriculture. The relevance of the statistical properties of rainfall is expounded in Hansen and Ines [6]. In particular, the quantity of rainfall in an area is decidedly important due to its effect on processes in nature, such as, solute leaching, soil erosion, and crop water stress response. Long-tailed distributions as discussed in the previous sections are useful for modeling the amount of rainfall in an area. Moreover, the amount of rainfall is commonly modeled as a two-component hyperexponential density with pdf

f (x) = \frac{α}{β_{1}} e^{- \frac{x}{β_{1}}} + \frac{1 - α}{β_{2}} e^{- \frac{x}{β_{2}}},

(2)

where α, β₁ and β₂ are estimated parameters ([2] and [6]).

Figure 1

Sensor network.

Sensor nodes fitted with soil moisture sensors are ideal for measuring rainfall and collecting data that would allow for statistical modeling of rainfall. In large, remote, forested areas or large agricultural fields or both, sensor networks would be valuable for collecting rainfall data. Typically, many sensor nodes are scattered or positioned over a large area and these nodes function until their batteries are expended. Due to the limited battery supply in each sensor node, transmissions by sensor nodes should be kept to a minimum. Furthermore, each sensor node has a relatively meager amount of memory (usually about 4K) and processing ability. Therefore, it is also desirable to reduce the complexity of calculations performed by a sensor node.

In the following sections, we develop algorithms for estimating the parameters of a hyperexponential density, which can also be used to obtain the parameters of equation 2 in our rainfall example. The algorithms are also designed so as to work well in a sensor network environment with energy and computational power constraints. That is to say, the data communication and computational power requirements of the sensor nodes are very minimal. In the following subsection, additional motivation for studying hyperexponential densities is presented, as we look at applications in modeling long-tailed distributions.

1.2. Long-tailed Distributions

In recent years, the volume of traffic on large-scaled networks and across the Internet has increased tremendously, necessitating serviceable statistical models for traffic flow and analysis. Statistical models are required for the design of underlying IP-based transport layer protocols, efficient data gathering and evaluation, and statistical analysis of corresponding random processes and variables (Markovitch and Krieger [7]). The work of Leland et al [8] demonstrates that both LANs and WANs exhibit long range dependencies, and hence, classical Poisson-based methods are insufficient for such large-scale networks. Typical Poisson methods predict early fluctuations in network traffic and hinge on the assumption that these anomalies will smooth out over a long period of time. In reality, these fluctuations occur over a wide range of time scales generating high variability and self-similar behavior (Schwefel [9]). The notion of self-similarity and fractal geometry was introduced by Mandelbrot [10] to describe naturally occurring processes using mathematical formulae. Self-similar processes are structurally similar over many different time scales. This phenomenon leads to long range dependencies in network traffic. Indeed, the concept of self-similarity and long range dependencies go hand-in-hand.

Ostensibly, an alternative to classical Poisson methods is sought. To this end, long-tailed distributions have shown much success. A distribution is a long-tailed distribution (also referred to as a heavy-tailed distribution) if its complementary cumulative distribution function (ccdf), F^c, decays slower than exponentially, i.e. if

lim_{x \to \infty} e^{α x} F^{c} (x) = \infty,

(3)

for all α > 0 (Feldmann and Whitt [11]). The ccdf is defined as the complement of the cumulative distribution function (cdf), as follows

F^{c} (x) = 1 - F (x),

(4)

where F(x) is the cdf. Conversely, a distribution is a short-tailed distribution if its ccdf decays exponentially, i.e. if

lim_{x \to \infty} e^{α x} F^{c} (x) = 0.

(5)

for some α > 0. Of note is a special case of long-tailed distributions called the power-tail distribution. A distribution is said to be power-tail distribution if

F^{c} (x) \sim α x^{- β} as x \to \infty,

(6)

where α and β are positive constants and the operator ∼ is defined such that f(x) ∼ g(x) implies that

lim_{x \to \infty} \frac{f (x)}{g (x)} = 1.

(7)

Two common examples of long-tailed distributions that are used widely in network performance analysis are the Pareto and Weibull distributions. In addition, the Pareto distribution is a power-tailed distribution; however, the Weibull distribution is not.

Recent studies have demonstrated that long-tailed distributions model many network characteristics aptly. For example, long-tailed distributions have been valuable in modeling World Wide Web (WWW) traffic, since the said traffic often originates from heterogeneous sources [7]. Long-tailed distributions have also been shown to be successful in modeling file transfer protocol (FTP) connections and intervals between connection requests [11]. As effective as the long-tailed distribution has been in capturing the statistical characteristics of large scale networks, there are some deficiencies of this model. Most notably, long-tailed distributions are generally very difficult to analyze. An example is the task of analyzing the performance figures of the basic M/G/1 queue which becomes quite involved if the service-time distribution is Pareto. Furthermore, unlike many short-tailed distributions, expressions for the Laplace transforms of long-tailed distributions are quite complex. Laplace transforms are generally useful for analyzing the distributions by numerical transform inversion. Standard non-parametric estimators such as the histogram, projections, and kernel estimators are not suitable for long-tailed distributions and often exhibit poor estimates or “spikes” in the tail region. It is well-known that kernel estimators suffer from spurious noise appearing in the tail of the estimator [12]. Markovitch and Krieger [7] have explored some non-parametric procedures for estimating and approximating long-tailed distributions. Two of these procedures include transformation functions that map a long-tailed density into a pdf with compact support, and polygrams. Polygrams are histograms with variable bin widths.

The aforementioned problems associated with long-tailed distributions do not appear in short-tailed distributions. Accordingly, a natural proposal for alleviating the problems of long-tailed distributions is to approximate the said long-tailed distributions with corresponding short-tailed distributions. This is justifiable by the fact that although long-tailed distributions do not have compact support, for practical purposes there is only a finite interval for which the distribution in question takes on meaningful values. Therefore, one approach for approximating long-tailed distributions is to effectively truncate the long-tailed distribution, confining it to a finite interval. Intuitively, this method encompasses the essential features of the long-tailed distribution. In spite of this, the truncation approach may not be a suitable technique since the resulting function has discontinuities for values in the real, positive domain. An alternative to approximating the long-tailed distributions by truncating is to use hyperexponential densities. Hyperexponential densities have exhibited much success in approximating long-tailed distributions and for constructing network performance models ([11] and [7]). Moreover, the analysis of hyperexponential densities is tractable and the Laplace transforms are simple expressions.

1.3. Organization

Two computationally efficient, tractable, and easily implemented algorithm for estimating the parameters of a hyperexponential density are developed and tested. The first algorithm — hereafter, referred to as Algorithm 1 — is a two-step procedure. The first step involves estimating the component rates γ of the hyperexponential distribution. The approach here is to develop equations that express π in terms of γ. These expressions are then substituted into an objective function which is a function of γ and an estimate for the cdf of the hyperexponential distribution. Minimizing this objective function yields the required estimates of λ. The second step estimates the steady state probabilities π given that the component means are known by making use of linear least squares analysis. The second algorithm — Algorithm 2 — estimates π and λ in one phase by minimizing an objective function that is a function of π, λ and an estimate for the cdf of the hyperexponential distribution. The relative merits of each algorithm are also compared and discussed.

Our algorithms are also compared to maximum-likelihood techniques and the expectation-maximization (EM) algorithm (Dempster et al [13]), in particular. Finally, we demonstrate how our algorithms are adapted to work well in a sensor network environment.

2. Algorithm 1

Most ML techniques for computing the parameters of hyperexponential densities intrinsically require estimation of 2M parameters. Likewise, any optimization procedure that uses or evaluates the hyperexponential density directly, also involves the estimation of 2M parameters. In this section, equations expressing π in terms of λ and functions of the samples of data are derived, thus reducing the number of unknown parameters to M. Making use of these equations, an objective function is formulated and the values of λ are calculated by minimizing this objective function. It is assumed that each mixing proportion is strictly positive, ensuring that each component plays a role in influencing the resulting pdf. Algorithm 1 essentially reduces the number of unknown parameters by incorporating additional information from the structure of the mixture pdf and from samples of data.

2.1. Expressions for π

Nonlinear transforms are employed in order to reduce the number of unknown parameters. Specifically, the Laplace transform of the exponential density is exploited. Define $α = [α_{1} \dots . . α_{M}]^{T}$ where each α_i is a distinct, real, positive value. The Laplace transform of the pdf in equation (1) scaled by a factor of $\frac{1}{α_{i}}$ is

E [{\frac{1}{α_{i}} e}^{- \frac{x}{α_{i}}}] = \frac{1}{α_{i}} \sum_{j = 1}^{M} π_{j} \int_{0}^{\infty} e^{- \frac{x}{α_{i}}} λ_{i} e^{- λ_{i} x} d x

(8)

= \sum_{j = 1}^{M} π_{j} \frac{1}{\frac{1}{λ_{j}} + α_{i}}

(9)

Let A be the matrix with elements $\frac{1}{\frac{1}{x_{j}} + α_{i}}$ at row i and column j. Also, let $α = [α_{1} \dots ., α_{M}]^{T}$ be the vector of expectations where $α_{i} = E [\frac{1}{α_{i}} e^{- \frac{x}{α_{i}}}]$ . Therefore, in matrix notation, equation (9) is

A π = a and

(10)

π = A^{- 1} a,

(11)

provided that A is nonsingular. Matrix A is a Cauchy matrix and hence, has certain nice properties (Boras [14]). For instance, the inverse of a Cauchy matrix can be represented as an explicit expression (Knuth [15]) and thus, B = A⁻¹ is an M × M matrix with elements

b_{i j} = \frac{\prod_{1 \leq k \leq M} (\frac{1}{λ_{j}} + α_{k}) (\frac{1}{λ_{k}} + α_{i})}{(\frac{1}{λ_{j}} + α_{i}) (\prod_{\binom{1 \leq k \leq M}{k \neq j}} (\frac{1}{λ_{j}} - \frac{1}{λ_{k}})) (\prod_{\binom{1 \leq k \leq M}{k \neq i}} (α_{i} - α_{k}))},

(12)

at row i and column j and where λ₁ ≠ λ₂ ≠ … ≠ λ_M ≠ 0 and α₁ ≠ α₂≠ … ≠ α_M. Clearly, each λ_i is distinct by the assumption that the mixing proportions for each component are all non-zero, and each α_i is distinct by assumption. Hence, A is clearly invertible. From equations (11) and (12) the steady state probabilities may be expressed as

π_{i} (λ) = \sum_{j = 1}^{M} b_{i j} a_{j} .

(13)

This gives a means of computing π if the component means and expectations in equation (9) are provided.

2.2. Determination of λ

The expressions for the mixing probabilities developed in the previous section afford a means of reducing the number of unknown variables to just M. In this section, an algorithm to obtain this M component mean vector is formulated by fitting a candidate cdf to the given exact cdf. The approach iteratively attempts to improve the fit of a real or estimated cdf to that of a cdf constructed from successive approximations of the parameter set. The cdf of a hyperexponential density is

F (d) = 1 - \sum_{k = 1}^{M} π_{k} e^{- λ_{k} x} .

(14)

Let $\tilde{λ} = [{\tilde{λ}}_{1}, \dots, {\tilde{λ}}_{M}]^{T}$ and $\tilde{π} (\tilde{λ}) = [{\tilde{π}}_{1} (\tilde{λ}), \dots, {\tilde{π}}_{M} (\tilde{λ})]^{T}$ be the current approximations for λ and π, respectively. Hence, an approximate or candidate cdf is given as

\tilde{F} (x, \tilde{λ}) = 1 - \sum_{k = 1}^{M} {\tilde{π}}_{k} (\tilde{λ}) e^{- {\tilde{λ}}_{k} x} .

(15)

Notice that the only unknown in this candidate cdf is $\tilde{λ}$ . The error of the fit of this candidate cdf at a single point x is defined to be

(F (x) - \tilde{F} (x, \tilde{λ}))^{2} .

(16)

This result can be extended over the entire domain of x — i.e. the set of all real numbers greater than or equal to zero (R⁺) — to give the total error by integrating equation (16) for all x ∊ R⁺. Unfortunately, this integral does not furnish a simple, tractable expression for the total error. Thus in practice, the integral would have to be numerically evaluated which is a somewhat expensive task. Realize that for practical purposes, the entire domain of R⁺ need not be considered. A viable approximation of the total error may be obtained by computing the error at a finite number of arbitrary points, [x₁, …, x_m] over the region of interest and summing up the error as follows

d (\tilde{λ}) = \sum_{k = 1}^{m} (F (x_{k}) - \tilde{F} (x_{k}, \tilde{λ}))^{2} .

(17)

Observe that $d (\tilde{λ}) \geq 0$ for all ${\tilde{λ}}_{i} > 0$ and 1  i  M. Therefore, $d (\tilde{λ})$ is bounded from below and has a global minimizer (Borwein and Lewis [16]). Moreover, this minimum is known to be zero and for an ideal candidate cdf, $d (\tilde{λ}) = 0$ . Obviously a better approximation of the total error is obtained if m is made large. The component means are obtained by minimizing (17) with respect to $\tilde{λ}$ . However, the objective function in question is not a convex function of $\tilde{λ}$ in general. Therefore, the Newtonian methods for minimization cannot be directly applied. Nonetheless, the problem of finding $\tilde{λ}$ is posed as a constrained nonlinear optimization problem. The first set of constraints ensure that λ_k > 0 for all 1  k  M. The second set are as a result of ensuring that π are valid, non-zero probabilities, i.e.

\sum_{k = 1}^{M} π_{k} (\tilde{λ}) = 1 and

(18)

0 < π_{k} (\tilde{λ}) < 1 for all 1 \leq k \leq M .

(19)

2.3. Estimation of λ from Statistical Data

In order to minimize the objective function developed, there are several quantities that need to be estimated from the samples of data. In this section, estimators for these quantities are devised. Given n samples of data w = {w₁… .,w_n}, and assumed distinct constants α_i, an estimator for a_i is

a_{i} = \frac{1}{n} \sum_{k = 1}^{n} e^{- α_{i} w_{k}} .

(20)

Define â to be the estimate of a from the samples of data.

The equations derived previously implicitly assume that the cdf F(x) is exact and available. Of course this is not the case, and an estimate for the cdf must be obtained from the samples of data. The cdf is a good choice for function-fitting and estimation in general for the hyperexponential density for a number of reasons. First, the cdf is a smooth, monotonically increasing function. Second, the cdf can be estimated easily and accurately from a finite number of samples of data. Third, given the chosen implementation for the cdf, a lookup for a value runs in O(log n) time which is quite fast. A piecewise continuous estimate of the cdf is obtained as follows. Sort the observations of data {w₁, …,w_n} to produce the values {y₁, …, y_n} such that y₁  y₂  …  y_n and {y₁, …, y_n} is a permutation of {x₁, …, x_n}. Hence, the estimate of the cdf is defined as,

\hat{F} (x) = {\begin{cases} 0, & x < y_{1} \\ \frac{i - 1}{n - 1}, & x = y_{i} \\ \frac{i - 1}{n - 1} + \frac{x - y_{i}}{(y_{i} + 1 - y_{i}) (n - 1)}, & y_{i} < x < y_{i + 1} \\ 1, & x > y_{n} . \end{cases}

(21)

Define $\tilde{λ}$ as the estimates of λ and substitute equation (21) into (17) giving the new objective function

\hat{d} (\hat{λ}) = \sum_{k = 1}^{m} (\hat{F} (z_{k}) - \hat{F} (z_{k}, \hat{λ}))^{2} .

(22)

Minimizing this objective function produces the estimates $\tilde{λ}$ .

2.4. Minimization Procedure

The current minimization problem involves finding the minimum of a nonlinear objective function subject to nonlinear constraints, and so the problem is a nonlinear programming task. Sequential quadratic programming methods are generally accepted as being the best nonlinear programming techniques (Palambros and Wilde [17]) and thus, are our chosen methods for minimizing (22). An abridged overview of sequential quadratic programming (SQP) methods is presented here. A further account on the details of SQP procedures and their derivations may be found in [17], Nocedal and Wright [18] and Fletcher [19]. Let {x₁, x₂, …} be a sequence of iterates of the minimization procedure which approach a solution x∗ that is a minimum. SQP algorithms function by approximating each iterate x_k by a quadratic programming subproblem. The next iterate is computed by minimizing this subproblem. Quadratic programming involves minimizing objective functions that are quadratic multivariate functions with linear constraints. Inequality constraints posed as nonlinear functions need to be linearized and approximated. The main difficulty is to formulate the quadratic subproblems in such a manner that good steps are made for each iterate and the overall SQP algorithm has good convergence properties and is efficient.

For the process of finding stationary points of a multivariate objective function subject to one or more constraints, it is common to use Lagrange multipliers [20]. Denote the vector of Lagrange multipliers as ξ. For convenience, the stationarity conditions are usually expressed in terms of the Lagrangian function which is defined in the most general case, as

L (x, ξ) = f (x) + ξ^{T} g (x) .

(23)

where g(x) represents the constraints. The Lagrangian function contains the first-order and second-order conditions for a point to be a local minimizer of the objective function subject to the given constraints. In our chosen implementation of an SQP algorithm, the Hessian of the Lagrangian function needs to be updated with each iteration. The chosen method for doing so is the BFGS update method. The next major step involves solving the quadratic problem for each iteration. To do so, an active set strategy [17] is employed. An active set strategy activates and deactivates constraints during each iteration as the algorithm progresses towards the minimum. The third and final major step is to form a new iterate

x_{k + 1} = x_{k} + β_{k} s_{k} .

(24)

where s_k is the search direction vector. The step size β_k is chosen by minimizing an appropriate function called the merit function. The merit function measures how appropriate and feasible each x_k and ξ_k are, and the function should have a local minimum at the solution. In our chosen implementation for minimization, we have chosen to enforce constraints for the probabilities in the second phase of computing the probabilities by the linear least squares analysis. In the first stage, $\hat{λ}$ are determined by unconstrained minimization.

2.5. Estimation of π from Statistical Data

In the preceding section, a method was developed for determining estimates of λ given an estimate for the cdf. A naive approach to attaining values for π is to exploit equation (13). However, due to the nature of the hyperexponential density and the equations involved, the values for the vector a cannot be estimated accurately enough from data for computing π. The values attained for π from equation (13) are very sensitive to the values derived for $\tilde{λ}$ . This means that small inaccuracies in $\tilde{λ}$ translate into large errors in π. This is mainly due to the inability of â to estimate a with sufficient accuracy. In contrast, the estimate for the cdf $\hat{F} (x)$ from data provides a very accurate representation of the exact cdf. This is yet another advantage of using the estimate of the cdf for function-fitting.

The cdf of the hyperexponential density is linear in π. In addition, the estimated cdf can be expressed in terms of the estimates for $\hat{λ}$ and the unknown parameters π as follows,

\sum_{k = 1}^{M} {\tilde{π}}_{k} e^{- {\hat{λ}}_{k} x} = 1 - \hat{F} (x) .

(25)

for all 0 < x < ∞. Let z = {z₁, …, Z_S} be an arbitrary set of positive, real constants such that

inf_{1 \leq i \leq S} z_{i} = inf_{1 \leq j \leq M} w_{j} and

(26)

sup_{1 \leq i \leq S} z_{i} = sup_{1 \leq j \leq M} w_{j} .

(27)

Define F(z) = [F(z₁),…,F(ZS)]^T and $\hat{π} = [{\hat{π}}_{1}, \dots, {\hat{π}}_{M}]^{T}$ . Let $\hat{C}$ be the S × M matrix with ${\hat{C}}_{i j} = e^{- λ_{j} z_{i}}$ being the element at the i^th row and j^th column. From these definitions, equation (25) can be written using matrix notation as

C \hat{π} = 1 - F (z)

(28)

leading to the following theorem.

Theorem 1. Equation (28) has a unique solution for the mixing proportions, given that the component means are known.

Proof. Equation (28) can be solved using linear least squares regression analysis, and is also a convex function (Boyd [21]). Therefore, this function has a global minimizer and has a unique solution for π.

From this theorem, $\tilde{π}$ is obtained by solving equation (28) using linear least squares regression analysis.

2.6. Summary of Algorithm 1

The following presents a summary of Algorithm 1, for obtaining estimates of the parameters for the hyperexponential density, given n samples of data.

Choose values for α such that each α_i is distinct and positive.

Obtain an initial estimate for λ.

Compute â using a and the samples of data.

Minimize $\hat{d} \hat{λ}$ to obtain a new estimate for $\hat{λ}$ .

Using the new estimate of $\hat{λ}$ , compute C and using linear least squares regression, obtain an estimate for π.

3. Algorithm 2

The second algorithm developed is similar to Algorithm 1 of the previous section. The essential difference in Algorithm 2 is that 2M parameters are estimated from the developed objective function. In the following subsection, the required objective function is developed.

3.1. Development of the Objective Function

As in the previous section, an approximate cdf is constructed as follows

\tilde{F} (x, \tilde{λ}, \tilde{π}) = 1 - \sum_{k = 1}^{M} {\tilde{π}}_{k} e^{- {\tilde{λ}}_{k} x} .

(29)

Notice that this cdf has three arguments as opposed to two, and that π is no longer described as a function of $\tilde{λ}$ . Following the procedure of the previous section, the error of fit of this candidate function is expressed as

(F (x) - \tilde{F} (x, \tilde{λ}, \tilde{π}))^{2} .

(30)

Similarly, the approximate total error of the fit is denoted by

d (\tilde{λ}, \tilde{π}) = \sum_{k = 1}^{m} (F (x_{k}) - \tilde{F} (x_{k}, \tilde{λ}, \tilde{π}))^{2} d x,

(31)

where m  M. The function $d (\hat{λ}, \tilde{π})$ has similar properties as $d (\tilde{λ})$ , in that $d (\tilde{λ}, \tilde{π}) \geq 0$ for all ${\hat{λ}}_{i} > 0$ and 1  i  M. Hence, $d (\tilde{λ}, \tilde{π})$ is also bounded from below and has a global minimizer.

The required parameters are obtained by minimizing the objective function in equation (31). As in the previous section, this is a constrained nonlinear optimization problem. The constraints are the same, i.e. that the component means are positive and that the mixing proportions are valid probabilities. The latter constraint can, however, be relaxed by introducing the softmax function (Bishop [20]). Let γ = {γ₁, …, γ_M} be a vector of real constants, and hence,

π_{i} (γ) = \frac{e^{γ_{i}}}{\sum_{j = 1}^{M} e^{γ_{j}}}

(32)

for all 1  i  M. Fix one of γ, γ_M (say) to be 0, which ensures that the transformation is one-to-one. Using this definition for π_i ensures that the constraints of the mixing proportions are met. Thus, the only applicable constraint is that λ_i > 0, for all 1  i  M. However, since the hyperexponential cdf is strictly monotonic, this constraint can effectively be disregarded, reducing the problem to a nonlinear unconstrained optimization problem. Let $γ = {γ_{1}, \dots, γ_{M}}$ be the estimates of γ and introduce the new objective function defined as

d_{u} (\hat{λ}, \hat{γ}) = \sum_{k = 1}^{m} (F (x_{k}) - F (x_{k}, \tilde{λ}, \tilde{π} (\hat{γ})))^{2} d x,

(33)

It is worth noticing that (33) is not a convex function of $\hat{γ}$ and $\hat{λ}$ , so the Newtonian method for optimization cannot be applied. Instead, the minimization should be performed using either quasi-Newtonian methods with the BFGS update method and inexact line searches, or using the Levenberg-Marquardt method for nonlinear regression analysis.

For Algorithm 2, the only quantity that needs to be estimated is the cdf. The approach for estimating the cdf is the same in both algorithms 1 and 2. In the next subsection, a summary of the algorithm is presented.

3.2. Summary of Algorithm 2

The following presents a summary of of Algorithm 2. Estimates of the parameters of the hyperexponential density are obtained, given n samples of data.

Obtain an initial estimate for $\hat{λ}$ and $\hat{π}$ .

Minimize $d_{u} (\hat{λ}, \hat{π})$ to obtain new estimates for $\hat{λ}$ , and $\hat{π}$ .

4. Discussion

In this section, we discuss the expediency of the algorithms developed to sensor networks. An account of the simulation experiments executed to demonstrate the algorithms is also presented, along with a discussion of the results of these experiments.

4.1. Applicability of Algorithms to Sensor Networks

The primary strength of Algorithms 1 and 2 is that they are well-suited to sensor networks. In order to illustrate how these algorithms are applicable to sensor networks, we return to the rainfall example in section I-A. Each sensor node collects rainfall data from its immediate environment. Hansen and lnes [6] discuss the difficulties in measuring rainfall. Taking soil moisture measurements, for example, is one suggested approach for determining the amount of rainfall. However, the exact details of measuring rainfall with sensor networks are beyond the scope of this paper. The individual sensor nodes do not possess the computing power to perform the complete parameter estimation procedure. In addition, transmission of all the data values collected at all sensor nodes to the base station is impractical, as the large number of transmissions needed would quickly drain the batteries of the sensor nodes. Therefore, an algorithm such as the EM algorithm which requires all the data values would not be applicable in this scenario. Nowak [22] introduces a distribued EM algorithm approach for performing density estimation in sensor networks. In [22], it is assumed that the local processing is much less expensive than communication. We do not compare explicitly our algorithms with those in [22], since Nowak performs the density estimation of Gaussian mixtures. However, the following observations are made. In our algorithms, there is very little processing to be done at individual sensor nodes, and the number of communications for each sensor node is on par with that presented in [22]. Hence, it is expected that our algorithms will use less battery power than that of [22].

Recall that in both Algorithms 1 and 2, an estimate of the cdf of the hyperexponential density, $\hat{F} (x)$ is required. However, the domain of $\hat{F} (x)$ is a set of fixed, pre-determined values of x. Computing an estimate of $\hat{F} (x)$ as in equation (21) would require each node to store all data points collected, which is infeasible. Instead, the estimate of the cdf is calculated as follows. Given n observations of data {w₁ … .w_n}, the empirical cdf is defined as

F_{n} (x) = \frac{1}{n} S (x)

(34)

where

S (x) = \sum_{i = 1}^{n} u (x - w_{i}) .

(35)

and u is the unit step function defined as

u (y) = {\begin{cases} 0 & if y < 0 \\ 1 & if y \geq 0 \end{cases}

(36)

Only the values of F_n (x) for a fixed domain of x are required, so let the required values of x be {x₁, …, x_m}. Hence, the necessary values that need to be recorded and updated by each sensor node are n and S(x_j) for each 1  j  m. For a specific sensor node k, denote S_k to be the value of S for node k, and similarly n_k denotes the value of n for node k. Each sensor node k collects data and increments its own value of n_k while updating S_k. At predefined epochs, for instance when n_k goes beyond a certain threshold value, the recorded values of S_k are transmitted to the base station. With this approach, each sensor node is required to record at most m+1 values. Updating and recording these values require very little computational power.

Upon transmitting the values of S_k(x_j) to the base station, the sensor node resets the values of S_k(x_j) for all j to zero and resets n_k to zero. Each sensor node transmits m values of S_k(x_j), but some values of S_k(x_j) may change more often and by greater amounts than others. For instance, after a period of collecting rainfall observations we might find that S_k(x₁) is large while S_k(x₂) and S_k(x₃) are still zero. Sending the values of just S_k(x₁) would reduce the size of the transmission for node k significantly. Therefore, we propose a format for the transmission packet as in Fig. 2 . The bit vector at the head of the packet identifies which of the values of x are being sent in the packet. For instance, if bits number 1, 2, and 5 are set in the bit vector in the transmission from sensor node k, and all other bits are unset, then there are 3 data values to follow in the packet. The data values are in consecutive order and are correspondingly S_k(x₁), S_k(x₂) and S_k(x₅). The proposed approach ensures that only values that are non-zero will be transmitted, thus minimizing the size of the transmissions.

Figure 2

Data packet format for transmission from sensor nodes.

The values of S_k(x_j) for each sensor node k and all 1  j  m, are transmitted to the base station along with the corresponding values of n_k. Let ${\hat{F}}^{n e w} (x_{j})$ and ${\hat{F}}^{o l d} (x_{j})$ be the new and current estimates of the cdf, respectively. Similarly, let n^new and n^old be the new and current values of n. Assuming that the base station receives values from r sensor nodes the estimate of the cdf is updated as follows,

{\hat{F}}^{n e w} (x_{j}) = \frac{n^{o l d}}{n^{n e w}} {\hat{F}}^{o l d} + \frac{1}{n^{n e w}} \sum_{k = 1}^{r} S_{k} (x_{j}),

(37)

and where

n^{n e w} = n^{o l d} + \sum_{k = 1}^{r} n_{k} .

(38)

If a value of S_k(x_j) is not received from a sensor node, assume that it has a value of zero in equation (37).

Equation (37) is used instead of equation (21) and thus, provides the necessary estimate of the cdf in the objective functions of Algorithms 1 and 2. Once an estimate of the cdf is obtained, the base station then performs the appropriate minimization procedure to estimate the parameters of the hyperexponential density. The problem presented here of estimating the amount of rainfall in an area is ideal for sensor networks, and the approach to solving this problem is well-suited to sensor networks. Each sensor node requires only a small amount of computational power and memory, while keeping the number of network transmissions to a minimum.

4.2. Error Analysis of Empirical cdf

The empirical cdf is commonly used as an estimator for the cdf in many applications including bootstrapping techniques and various EDF statistics such as the Cramer-von Mises and Anderson-Darling tests [23]. The virtues of the empirical cdf as an estimator stem from the fact that the empirical cdf converges uniformly to the actual cdf by the Gilvenko-Cantelli theorem [24]. Further properties of the empirical cdf are discussed extensively in [23] and [24]. The accuracy of the empirical cdf hinges on obtaining data points that range between the tails of the cdf. Extreme values which fall in the tails of the cdf will not produce good results as is the case with other estimators.

The sensor network implementation of our estimator requires the data to be discretized into several bins. The number of bins can be increased for increased accuracy. However, the final bin represents all data falling in the range [x_max, ∞). We can derive an upper bound for x_max to ensure that the probability of a random sample falling in [x_max, ∞) is below a threshold, say 1% or 0.01. Let β_max be the largest mean of all the exponential components of the mixture density. The highest probability of samples having a very large value is obtained when all other exponential components of the mixture occur with zero mixing probabilities. Therefore, we have

P [X > x_{max}] < exp (- \frac{x_{max}}{β_{max}}) .

(39)

If we prefer this probability to be less than 0.01, we have

P [X > x_{max}] < exp (- \frac{x_{max}}{β_{max}}) < 0.01.

(40)

or, equivalently, x_max = 4.7β_max. Therefore, a good guideline is that the largest threshold in the specification of intervals for the sensor network implementation, should be about 5 times the largest anticipated mean value of the exponential probability density components of the mixture.

The Gilvenko-Cantelli theorem states that if the data samples are truly iid and the size of the data sample is sufficiently large, a good estimate of the cdf will be obtained. Since the empirical cdf is not a point estimator, deriving the standard error of the estimate is not trivial. The question of how large a sample is required for a good estimate can be answered by the Dvoretzky, Kiefer, Wolfowitz inequality [23], which is stated as follows. If X₁, …, X_n are iid real-valued random variables with cdf F, and F_n is the empirical cdf, then for any d > 0 and any positive integer n.

P [d_{K} (F_{n}, F) > d] \leq C exp (- 2 n d^{2}),

(41)

where d_K is the well-known Kolmogorov-Smirnov statistic, and C is a universal constant. It can be shown that C = 2 is a good choice [23]. For illustrative purposes, if we restricted the probability of having a discrepancy of no more than 0.01 at a single point between F_n and F to be less than or equal to 0.01, then we would require approximately 200 data samples or more.

4.3. Summary of Simulation Experiments and Results

The algorithms discussed in the previous section were implemented and tested through simulation using different values for M, λ, and π. A subset of the simulation trials is discussed here. Both algorithms were tested on hyperexponential density using synthetically generated iid mixture samples. In addition, the results of both algorithms were compared to the basic implementation of the EM algorithm. Evaluation of the objective function d(λ) was performed by evaluating the approximate total error for equally spaced points over a meaningful domain. In Algorithm 1, the first phase of estimating λ is accomplished through the use of quasi-Newtonian methods with the BFGS method for the Hessian update and a safeguarded mixed quadratic and cubic polynomial interpolation and extrapolation method for line scarches. The implementation chosen is the fminunc MATLAB function [25]. For the second phase of Algorithm 1, constrained linear regression was performed using SQP methods and specifically the lsqlin MATLAB function. Algorithm 2 was implemented using quasi-Newtonian methods with the BFGS update formula, and a safeguarded mixed quadratic and cubic polynomial interpolation and extrapolation method for line searches. Once again, the fminunc MATLAB function [25] was employed.

The following is an outline of the simulation strategy utilized. In the first class of experiments, 50 experiments were performed using simulation data generated from a four component hyperexponential density having the following characteristics: λ = [1.0, 2.0, 3.0, 4.0] and P = [0.28, 0.14, 0.38, 0.2]. Also, for each of the 50 experiments, the same data samples were applied, but each experiment was executed using a random initial guess for λ from the range (0.5, 4.5). A total of 100 data samples was chosen for each experiment and each of the three algorithms was executed on these data samples for each 1 experiment. The initial values for P were chosen to be $\frac{1}{M}$ , where applicable. For the second class of experiments, 50 experiments were performed repeating all the steps and characteristics of the first class of experiments, except that for each of the 50 experiments different synthetic data was generated. Once again, the values for λ were chosen randomly from the range (0, 10). Finally, for the fourth and the final class of experiments, the same steps were followed as in the third class, however, this time for a six component hyperexponential density. The values for each of the six rates were chosen randomly in the range (0, 10). A total of 200 experiments were performed for the simulation exercise.

Sample results of the three algorithms are given in Figs. 3, 4, and 5. Each figure compares the generated pdf of the corresponding algorithm to the actual pdf. Each of the algorithms terminated in 1–10 iterations. From the simulation results, all of the three algorithms appear to produce similar results. In most cases, Algorithm 1 and the EM algorithm give similar results, and Algorithm 2 gives slightly better results in some cases. There is no simulation evidence to suggest that there are operating regions in which one algorithm is superior to the other, given the current approaches taken for optimization. Extensive simulation evidence indicates that Algorithm 2 gives the best results. in terms of accuracy of results, amongst the three algorithms compared.

Figure 3

Sample plot of pdf generated from Algorithm 1 using 100 samples of data.

Figure 4

Sample plot of pdf generated from Algorithm 2 using 100 samples of data.

Figure 5

Sample plot of pdf generated from the EM algorithm using 100 samples of data.

In terms of computation speed. Algorithms 1 and 2 offer significant advantages over the EM algorithm. The EM algorithm involves computations that include all N data samples per iteration. As a consequence, as the number of data samples increases, the computation time increases dramatically. Compare this to Algorithms 1 and 2, in which the computation time of each iteration increases as a function of the number of unknown parameters only. For Algorithms 1 and 2, the values obtained from the estimated cdf function are at predefined points, and need only be looked up once, and not for every iteration. This lookup is very cheap as well, since a binary search of running time O(log N) is all that is required. For these reasons, the computation times for both Algorithms 1 and 2 are significantly improved versus the EM algorithm, particularly when N is large and the algorithms converage slowly.

The EM algorithm is generally regarded as the standard technique for estimating the parameters of mixtures of probability densities. Ruhe [26], Hasselblad [27], and Jewell [28] present solutions for finding the parameters of positive sums of exponentials, by maximum-likelihood techniques. Gruet et al [29] present a technique for estimation of mixtures of exponentials based on MCMC methods. It is well-known that the EM algorithm has a slow rate of convergence and may converge to values on the boundary of the parameter space. Moreover, the EM algorithm is not well-suited to an application in sensor networks, as the algorithm would require a high amount of network traffic to transfer all the data.

5. Conclusion

The major contributions presented here are algorithms for computing the parameters of a hyperexponential density. Our algorithms are easily implemented yet computationally very efficient. There are numerous potential applications of this algorithm particularly in the areas of network traffic modeling and queuing theory; however, the algorithms presented here are particularly applicable in sensor network situations. In particular, due to the formulation of Algorithms 1 and 2, accurate estimates of the hyperexponential pdf are obtained while keeping computation and network transmissions at the sensor nodes to a minimum. The two developed algorithms as well as the EM algorithm all give very good estimates of the hyperexponential pdf. However, extensive simulation evidence demonstrates that Algorithm 2 gives the best numerical results of the three algorithms compared, in terms of the quality of the estimate of the hyperexponential pdf produced, which is what our goal is. In addition, evidence is presented to suggest that the algorithms developed are superior to the EM algorithm in terms of computation speed in cases where a large number of data samples are present or if the convergence speed of the algorithm declines. Both algorithms developed are efficient and easily implemented using standard tools of numerical optimization.

References

Chauveau

Martin

C. F.

van Rooij

A. C. M.

Ruymgaart

F. H.

, “Discrete signed mixtures of exponentials.” Communications in Statistics: Stochastic Models vol. 12, no. 2, pp. 245–263, 1996.

Cao

Wei

, “Precipitation modeling and contract valuation: A frontier in weather derivatives.” The Journal of Alternative Investments vol. 7, no. 2, pp. 93–99, 2004.

Botta

R. F.

Harris

C. M.

Marchal

W. G.

, “Characterizations of generalized hyperexponential distribution functions.” Communications in Statistics: Stochastic Models vol. 3, no. 1, pp. 115–148, 1987.

Akyildiz

I. F.

Sankarasubramanian

Cayirei

, “A survey on sensor networks,” Computer Networks vol. 38, no. 4, pp. 393–422, 2002.

Rentala

Musunuri

Gandham

Saxena

, “Survey on sensor networks,” Tech. Rep. UTDCS-33–02. The University of Texas at Dallas, TX, U.S.A., 2002.

Hansen

J. W.

Ines

A. V. M.

, “Stochastic disaggregation of monthly rainfall data for crop simulation studies,” Agricultural and Forest Methodology vol. 131, pp. 233–246,2005.

Markovitech

N. M.

Krieger

U. R.

“Nonparametric estimation of long-tailed density functions and its applications to the analysis of World Wide Web Traffic,”

Performance Evaluation vol. 42, pp. 205–222,2002.

Leland

W. E.

Taqqu

M. S.

Willinger

Wilson

. “On the self-similar nature of ethernet traffic (Extended Version.” IEEE/ACM Trans, on Networking vol. 2. no. 1, pp. 1–15, 1994.

Schwefel

“Modeling of packet arrivals using markov-modulated poisson processes with power-tail bursts.”

Master's thesis. Technische Universität Mnchen München, Garching, Germany, August 1997.

10.

Mandelbrot

B. B.

The Fractal Geometry of Nature. NY: W. H. Freeman, 1982.

11.

Feldmann

Whitt

, “Fitting mixtures of exponentials to long-tail distributions to analyze network performance models,” performance Evaluation vol. 31, pp. 245–279,1998.

12.

Silverman

B. W.

Density Estimation for Statistics and Data Analysis. FL: Chapman & Hall, 1986.

13.

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

“Maximum-likelihood from incomplete data via the EM algorithm,”

J. Royal Statist. Soc., Ser. B vol. 39, pp. 1–38,1977.

14.

Boras

Studies in Displacement Structure Theory. Ph.D. dissertation, Stanford University, CA, June 1996.

15.

Knuth

D. E.

, Fundamental Algorithms - The Art of Computer Programming: Vol. 1. MA: Addison-Wesley, 1973.

16.

Borwein

J. M.

Lewis

A. S.

, Convex Analysis and Nonlinear Optimization: Theory and Examples, NY: Springer-Verlag, 2000.

17.

Papalambros

P. Y.

Wilde

D. J.

, Principles of Optimal Design. NY: Cambridge University Press, 1988.

18.

Nocedal

Wright

S. J.

, Numerical Optimization. NY: Springer-Verlag, 1999.

19.

Fletcher

, Practical Methods of Optimization. NY: John Wiley & Sons, 1987.

20.

Bishop

C. M.

. Neural Networks for Pattern Recognition. NY: Oxford University Press, 1998.

21.

Boyd

Vandenberghe

, Convex Optimization. NY: Cambridge University Press, 2004.

22.

Nowak

R. D.

, “Distributed EM algorithms for density estimation and clustering in sensor network,” IEEE Trans. Sig. Proc. vol. 51, no. 8, pp. 2245–2253, 2003.

23.

Lehmann

E. L.

Romano

J. P.

, Testing Statistical llypotheses: 3rd edition. NY: Springer-Verlag, 2005.

24.

van der Vaart

A. W.

, Asymptotic Statistics NY: Cambridge University Press 1998.

25.

“MATLAB Software,” http://mathworks.com.

26.

Ruhe

, “Fitting empirical data by positive sums of exponentials,” SIAM Journal Sci. Stat. Comput, vol. 1, no. 4, 481–497, 1980.

27.

Hasselblad

“Estimation of finite mixtures of distributions from the exponential family,”

Journal of the American Statistical Association vol. 64, pp. 1459–1471,1969.

28.

Jewell

N. P.

, “Mixtures of Exponential Distributions,” The Annals of Statisties vol. 10, no. 2, pp. 479–484, 1982.

29.

Gruet

M. A.

Philippe

Robert

C. P.

, “Estimation of exponential mixtures,” in Discretization and MCMC Convergence Assessment ( Robert

C. P.

ed.), ch. 8, pp. 161–173, NY: Springer-Verlag, 1998.