Low Multilinear Rank Approximation of Tensors and Application in Missing Traffic Data

Abstract

The problem of missing data in multiway arrays (i.e., tensors) is common in many fields such as bibliographic data analysis, image processing, and computer vision. We consider the problems of approximating a tensor by another tensor with low multilinear rank in the presence of missing data and possibly reconstructing it (i.e., tensor completion). In this paper, we propose a weighted Tucker model which models only the known elements for capturing the latent structure of the data and reconstructing the missing elements. To treat the nonuniqueness of the proposed weighted Tucker model, a novel gradient descent algorithm based on a Grassmann manifold, which is termed Tucker weighted optimization (Tucker-Wopt), is proposed for guaranteeing the global convergence to a local minimum of the problem. Based on extensive experiments, Tucker-Wopt is shown to successfully reconstruct tensors with noise and up to 95% missing data. Furthermore, the experiments on traffic flow volume data demonstrate the usefulness of our algorithm on real-world application.

1. Introduction

Missing data can be recognized as the problem that occurs in the data collection process. This problem is of considerable practical interest. Recently, it has been shown that data often have more than two modes of variation and are therefore represented as multiway arrays (i.e., tensors). For example, in internet network traffic flows analysis, the network traffic flows can be modeled as a fourth-order tensor with source IP, destination IP, port number, and time [1, 2]. If a sequence of traffic flows is missing, how can we recover them from the underlying information? Other examples include bibliographic data analysis [1, 2], image in-painting [3, 4], video in-painting [5], and data analysis [6]. Thus, it is necessary to develop a tool for reconstructing efficiently the tensors with missing data.

The low rank approximation or completion problem on matrix is studied in science and engineering to estimate missing values, such as computer vision, machine learning, and signal processing. Ruhe [7] proposed the low rank matrix approximation with missing data while considering the underlying decomposition factors. Srebro and Jaakkola [8] formulate the problem using singular value decomposition. Golub and Van Loan [9] proposed a manifold optimization approach based on truncated singular value decomposition (T-SVD) for matrix completion due to the nonuniqueness of the T-SVD, and the implementation of the algorithm was described in [10]. Ma et al. [11] applied fixed point and Bregman iterative method, and Cai et al. [12] proposed a singular value thresholding algorithm to solve the rank minimization problem.

Instead of flattening multiway arrays into matrices and using techniques for matrix, tensor-based methods can preserve the multiway nature of the data and extract the underlying factors in each mode of a higher-order tensor. The methods consist of two main types of methods: methods based on tensor decomposition and methods based on trace norm.

In this paper, we focus on low multilinear rank approximation of tensors with missing data based on Tucker decomposition [13], since Tucker decomposition is more general than CP and CP can be regarded as a special case of Tucker decomposition. Our goal is to capture the global structure of the data via a Tucker model in the presence of missing entries and recover them. In the two-way case, that is, the matrix case, the rank is a powerful tool to capture the global information. However, the properties of tensor ranks are quite different from matrix. There is no straightforward algorithm to determine the rank of a given tensor [14], and tensor rank is difficult to minimize in general since it is a nonconvex function (it is also a nonconvex function for matrix rank). Fortunately, a tensor can also be represented by Tucker model [13], whose rank can be represented by multilinear tensor rank (comparing with CP-rank, the definitions refer to Section 3). Therefore, the multilinear rank can be used for exploiting the global information instead of the CP-rank of a given tensor in low multilinear rank approximation problem.

Considering the missing tensor entries and noise in the observed elements, we formulate the low multilinear rank approximation problem as

f_{w} : \min_{S, X, Y, Z} {∥ 𝒲 * (𝒜 - 𝒮_{\times 1} X_{\times 2} Y_{\times 3} Z) ∥}_{F}^{2} + \frac{1}{2} λ {∥ 𝒮 ∥}_{F}^{2},

(1)

where $𝒜 \in ℝ^{I_{1} \times I_{2} \times I_{3}}$ is a given three-order tensor, $𝒮 \in ℝ^{R_{1} \times R_{2} \times R_{3}}$ is the core tensor, and $X \in ℝ^{I_{1} \times R_{1}}$ , $Y \in ℝ^{I_{2} \times R_{2}}$ , and $Z \in ℝ^{I_{3} \times R_{3}}$ are orthogonal mode matrices. The parameter λ ∈ [0, 1] is a regularization coefficient which can address the data sets corrupted by noise; The weighting function 𝒲, of the same size as 𝒜, is a nonnegative weight tensor defined as

\begin{matrix} 𝒲_{i_{1} i_{2} i_{3}} = {\begin{cases} 1, & if a_{i_{1} i_{2} i_{3}} is known, \\ 0, & if a_{i_{1} i_{2} i_{3}} is missing, \end{cases} \\ for all i_{k} = 1, \dots, I_{k}, k = 1,2, 3 . \end{matrix}

(2)

Tucker decomposition is not unique. Consider the three-way Tucker decomposition in (1). Let $U \in ℝ^{R_{1} \times R_{1}}$ , $V \in ℝ^{R_{2} \times R_{2}}$ , and $W \in ℝ^{R_{3} \times R_{3}}$ be orthogonal matrices; then

𝒮_{\times 1} X_{\times 2} Y_{\times 3} Z = ⟦ 𝒮_{\times 1} U_{\times 2} V_{\times 3} W; X U^{- 1}, Y V^{- 1}, Z W^{- 1} ⟧ .

(3)

A consequence of the property is that the solutions of f_w are not isolated. Thus, numerical methods might have difficulties converging to a solution [15]. That is, (1) does not have an isolated local minimum, which severely degrades the performance of standard optimization algorithms.

Considering the existing nonuniqueness problem of the Tucker decomposition, a novel method is proposed based on Grassmann manifold to minimize the weighted function in (1). The proposed method is termed as Tucker weighted optimization (Tucker-Wopt). Let the subspace spanned by the columns of X,Y, and Z be denoted by [X], [Y], and [Z]. We are not interested in the exact elements of the matrices X,Y, and Z but rather interested in the subspaces spanned by their columns, which are points on the Grassmann manifold. Then the function in (1) can be solved by nonlinear optimization on a product manifold.

The proposed Tucker-Wopt is related to the work in [10, 16], in which a manifold optimization approach based on T-SVD was proposed. The differences between the two methods are that (1) we extend the matrix case to tensor case by laying out the theoretical foundations and then build an efficient algorithm by using nonlinear optimization technique on Grassmann manifold and (2) that we focus on low multilinear rank approximation problem with missing data which is different and more complex than the matrix case. Another related work is CP-Wopt proposed by Kolda and Bader [14], which is also a first-order optimization method. However, we focus on the most general tensor decomposition (i.e., Tucker) and optimize the problem on Grassmann manifold for ensuring the uniqueness of the final solution. Our method is also different from the method based on Tucker decomposition presented in [17, 18], where the presented method is used for comparing with low rank tensor completion (LRTC), and their objective function does not consider the existed noise and the uniqueness of the optimal solution and is optimized by block coordinate descent method (BCD).

The proposed algorithm Tucker-Wopt is tested on both simulated data and real traffic volume data. Experiments show that Tucker-Wopt outperforms the traditional imputing-based methods especially when the ratio of missing data is high and the data sets are corrupted by noise. Tucker-Wopt is able to reconstruct the low multilinear rank tensor efficiently even when the ratio of missing data is up to 95%. Experiments on traffic flow volume data demonstrate the usefulness of our algorithm in real-world application.

This paper is organized as follows. Section 2 overviews related work of the existing tensor-based methods for missing data and the traditional methods for missing traffic volume data. Section 3 introduces the notation used in the paper. The proposed algorithm is described in Section 4. Numerical results on both simulated and real data are given in Section 5. Conclusions and possible future work are in Section 6.

2. Related Work

2.1. Existing Tensor-Based Methods for Missing Data

To estimate the missing entries in a low rank tensor, tensor completion methods which try to minimize the tensor rank have been proposed [17 –20]. Liu et al. proposed tensor completion based on trace norm, optimizing the objective function by block coordinate descent algorithm (BCD) [19] and alternating direction method of multipliers (ADMM) [17, 18]. Gandy et al. [20] built the objective function based on multi-n rank instead of trace norm, optimizing the objective function by ADMM.

For the methods based on tensor decomposition, CANDECOMP/paraFAC (CP) [21] decomposition and Tucker decomposition [13] are the two most widely used decomposition models. Smoliński et al. [22] proposed EM-Tucker3 method which incorporates EM in alternating least squares (ALS) procedure to estimate missing elements based on Tucker decomposition. Nevertheless, as the amount of missing data increases, the performance of the algorithm may suffer since the initialization and the intermediate models used to impute the missing values will increase the risk of converging to a less optimal solution. Tomasi and Bro [23] developed an INDADAC (Incomplete Data paraFAC) procedure that uses the Levenberg-Marquardt version of GN for fitting the CP model for data with missing values. The method is a second-order optimization and a first-order method has been developed called CP-WOPT [24] with the goal of scaling to larger problem sizes. CP-WOPT is always faster than the best alternative approach based on second-order optimization (INDAFAC) and even faster than EM-ALS with high percentage of missing data. However, just as Acar et al. [24] pointed, CP-WOPT could not fulfill the case of data with different structure for each mode. It may be possible to perform better using a more flexible model such as the Tucker model.

2.2. Existing Methods for Missing Traffic Volume Data

The spatial and temporal correlations of traffic volume data are critical for imputing the missing traffic volume data. The traditional methods involve the techniques such as historical (neighboring) imputation methods [25] and spline (including linear)/regression imputation methods [26]. These methods model the traffic data as vector pattern, which can cover few spatial-temporal information. Then spatial/temporal correlations are used to impute missing traffic data when few data are missing. Recently, the development of imputation techniques is moving on a statistically principled track. Bayesian Principal Component Analysis (BPCA) algorithm and Probabilistic Principal Component Analysis (PPCA) [27] model the traffic data as matrix pattern, which can cover more spatial-temporal information than vector pattern, and impute the missing traffic data from the whole matrix based on spatial-temporal correlations. These methods have been proved to be more effective with higher accuracy than the traditional imputation methods for missing traffic volume data.

As mentioned above, traditional methods mostly only exploit part of correlations, such as historical or temporal neighboring correlations. And the statistically principled methods usually utilize the temporal correlations of traffic data from day to day. In fact, there are many correlations existing in traffic volume. For example, the traffic data temporal correlations contain the relations from day to day, hour to hour, and so forth. In addition, the spatial correlations exist in the adjacent detectors data. However, these correlations are not explored fully or simultaneously in previous imputing methods. Considering the situation, the missing traffic volume data is solved by the tensor-based methods in this paper.

3. Notation

This work partially adopts the notation in Acar et al. [24] and Ishteva [15]. Throughout this paper, third-order tensors are denoted by calligraphic letters (𝒜, ℬ,…), matrices by boldface capitals (A,B,…), vectors by boldface lowercase letters (a,b,…), and scalars by lowercase letters (a, b,…). For example, an element of a third-order tensor 𝒜 is a_ijk = 𝒜_ijk. Some special scalars, such as upper bounds of indices, are denoted by capital letters (I, I₁, I₂, I₃, N, R,…). The symbol “×” stands for the Cartesian product of two sets, “⊗” stands for the Kronecker product, I is the identity matrix, and St (R, I) denotes the Stiefel manifold (the set of all columnwise orthonormal I × R matrices).

The elements of a third-order tensor are referred to by three indices. The mode-1 vectors of the tensor are defined to be its columns and the mode-2 vectors are its rows. In general, the mode-n vectors (n = 1, 2, 3,…, N) are obtained by varying the nth index, while keeping the other indices fixed. The number of linearly independent mode-n vectors is called mode-n rank. It is a generalization of column and row ranks of a matrix. Different from matrix rank, mode-n ranks are not equal to each other in general.

The mode-n products $𝒜_{\times n} M^{(n)}$ , n = 1, 2, 3 of a tensor $𝒜 \in ℝ^{I_{1} \times I_{2} \times I_{3}}$ with matrices $M^{(n)} \in ℝ^{J_{n} \times I_{n}}$ , are defined by

\begin{matrix} {(𝒜_{\times 1} M^{(1)})}_{j_{1} i_{2} i_{3}} = \sum_{i_{1}} a_{i_{1} i_{2} i_{3}} m_{j_{1} i_{1}}^{(1)}, \\ {(𝒜_{\times 2} M^{(2)})}_{i_{1} j_{2} i_{3}} = \sum_{i_{2}} a_{i_{1} i_{2} i_{3}} m_{j_{2} i_{2}}^{(2)}, \\ {(𝒜_{\times 3} M^{(3)})}_{i_{1} i_{2} j_{3}} = \sum_{i_{3}} a_{i_{1} i_{2} i_{3}} m_{j_{3} i_{3}}^{(3)}, \end{matrix}

(4)

where 1⩽i_n⩽I_n, 1⩽j_n⩽J_n.

An N-way tensor can be rearranged as a matrix; this is called matricization, also known as unfolding or flattening. The mode-n matricization of a tensor $𝒜 \in ℝ^{I_{1} \times I_{2} \times I_{3}}$ is defined as follows:

\begin{matrix} {(𝒜_{(1)})}_{i_{1}, (i_{2} - 1) I_{3} + i_{3}} = {(𝒜_{(2)})}_{i_{2}, (i_{3} - 1) I_{1} + i_{1}} \\ = {(𝒜_{(3)})}_{i_{3}, (i_{1} - 1) I_{2} + i_{2}} = a_{i_{1} i_{2} i_{3}}, \end{matrix}

(5)

where 1⩽i₁⩽I₁, 1⩽i₂⩽I₂, and 1⩽i₃⩽I₃.

Given two tensors 𝒜 and ℬ of the same size I₁ × I₂ × I₃, their Hadamard (elementwise) product is denoted by 𝒜*ℬ and defined as

{(𝒜 * ℬ)}_{i_{1} i_{2} i_{3}} = a_{i_{1} i_{2} i_{3}} b_{i_{1} i_{2} i_{3}},

(6)

and the scalar product of 𝒜 and ℬ is defined as

〈 𝒜, ℬ 〉 = \sum_{i_{1}} \sum_{i_{2}} \sum_{i_{3}} a_{i_{1} i_{2} i_{3}} b_{i_{1} i_{2} i_{3}} .

(7)

For a tensor 𝒜 of size I₁ × I₂ × I₃, its norm is $∥ 𝒜 ∥ = \sqrt{〈 𝒜, 𝒜 〉}$ . We also define a weighted norm as follows. Let 𝒜 and 𝒲 be two tensors of size I₁ × I₂ × I₃; then the 𝒲-weighted norm of 𝒜 is

{∥ 𝒜 ∥}_{𝒲} = ∥ 𝒲 * 𝒜 ∥ .

(8)

Let 𝒜^𝒲 be the I₁ × I₂ × I₃ observed tensor that stores all the observed values, such that

𝒜_{i_{1} i_{2} i_{3}}^{𝒲} = {(𝒜 * 𝒲)}_{i_{1} i_{2} i_{3}} = {\begin{cases} a_{i_{1} i_{2} i_{3}}, & if a_{i_{1} i_{2} i_{3}} is known, \\ 0, & if a_{i_{1} i_{2} i_{3}} is missing . \end{cases}

(9)

4. Tucker-Wopt Algorithm

In this section, we will present the Tucker-Wopt algorithm for the low multilinear rank approximation problem of tensors with missing data. For simplicity, we only consider third-order tensors. The generalization to tensors of order higher than three is feasible.

4.1. Cost Function

Let 𝒜 be a real-valued tensor of size I₁ × I₂ × I₃. From function (1), the low multilinear rank approximation problem of tensors with missing data can be formulated as

F (X, Y, Z) \equiv \min_{𝒮 \in R^{R_{1} \times R_{2} \times R_{3}}} {\frac{1}{2} {∥ 𝒜 - 𝒮_{\times 1} X_{\times 2} Y_{\times 3} Z ∥}_{𝒲}^{2} + \frac{1}{2} λ {∥ 𝒮 ∥}_{F}^{2}} .

(10)

Here, $𝒮 \in ℝ^{R_{1} \times R_{2} \times R_{3}}$ , $X \in ℝ^{I_{1} \times R_{1}}$ , $Y \in ℝ^{I_{2} \times R_{2}}$ , and $Z \in ℝ^{I_{3} \times R_{3}}$ are orthogonal matrices, normalized as X^TX = I, Y^TY = I, and Z^TZ = I, and I is the denoting the identity matrix. The parameter λ ∈ [0, 1] is a regularization coefficient which can address the data sets corrupted by noise. For the results in the paper, we choose λ = 0 when the data sets is not corrupted by noise while set λ as a positive value for the data sets which is corrupted by nosie.

4.2. Gradient Descent on the Grassmann Manifold

Because X,Y, and Z are orthogonal matrices, the function F(X,Y,Z) has the following invariance property:

F (X, Y, Z) = F (X Q^{(1)}, Y Q^{(2)}, Z Q^{(3)}),

(11)

where Q⁽ⁱ⁾ ∈ O_{R
_i}, O_{R
_i} are R_i-dimensional orthogonal matrix spaces, i = 1, 2, 3. This indicates that the function F(X,Y,Z) is defined on the product of three Grassmann manifolds M = St(R₁, I₁)/O_{R
₁} × St(R₂, I₂)/O_{R
₂} × St(R₃, I₃)/O_{R
₃}. The elements of M are difficult to comprehend since they are equivalent classes of matrices, but the minimum of F(X,Y,Z) can be found with high probability by standard gradient descent on the Grassmann manifold [28].

In fact, M = St(R₁, I₁)/O_{R
₁} × St(R₂, I₂)/O_{R
₂} × St(R₃, I₃)/O_{R
₃} is the product quotient manifold of the product Stiefel manifolds $\bar{M} = St (R_{1}, I_{1}) \times St (R_{2}, I_{2}) \times St (R_{3}, I_{3})$ , and the function F(X,Y,Z) can be considered as the projection of the function F(XQ⁽¹⁾,YQ⁽²⁾,ZQ⁽³⁾) on $\bar{M}$ to M, and F(XQ⁽¹⁾,YQ⁽²⁾,ZQ⁽³⁾) can symbol $\bar{F} (X, Y, Z)$ . Furthermore, $\bar{M} = St (R_{1}, I_{1}) \times St (R_{2}, I_{2}) \times St (R_{3}, I_{3})$ is the product embedded submanifold of the product Riemannian manifold ${\overset{ˇ}{M} = ℝ}^{I_{1} \times R_{1}} \times ℝ^{I_{2} \times R_{2}} \times ℝ^{I_{3} \times R_{3}}$ , and the function F(XQ⁽¹⁾,YQ⁽²⁾,ZQ⁽³⁾) can be considered as the constraint of the function F(XM⁽¹⁾,YM⁽²⁾,ZM⁽³⁾) on $\overset{ˇ}{M}$ to M, where M⁽ⁱ⁾ ∈ ℝ^I_i × R_i, i = 1, 2, 3, and F(XM⁽¹⁾,YM⁽²⁾,ZM⁽³⁾) can symbol $\overset{ˇ}{F} (X, Y, Z)$ . The domain of $\overset{ˇ}{F} (X, Y, Z)$ is a vector space, so that all classical techniques of real analysis are applicable.

Theorem 1. The relationship between gradient of F(X,Y,Z) and the gradient of $\overset{ˇ}{F} (X, Y, Z)$ is

g r a d F (X, Y, Z) = P_{(X, Y, Z)} g r a d \overset{ˇ}{F} (X, Y, Z),

(12)

where

P_{(X, Y, Z)} (E_{X}, E_{Y}, E_{Z}) = ((I - X X^{T}) E_{X} + X s h e w (X^{T} E_{X}), (I - Y Y^{T}) E_{Y} + Y s h e w (Y^{T} E_{Y}), (I - Z Z^{T}) E_{Z} + Z s h e w (Z^{T} E_{Z}))

(13)

and shew(A) = (A − A^T)/2.

The detailed proof of Theorem 1 is presented in the Appendix of this paper.

Let 𝒮₀ be the minimum point; the function (10) can be represented as

\begin{matrix} F (X, Y, Z) \equiv \frac{1}{2} {∥ 𝒜 - 𝒮_{0 \times 1} X_{\times 2} Y_{\times 3} Z ∥}_{𝒲}^{2} + \frac{1}{2} λ {∥ 𝒮_{0} ∥}_{F}^{2} \\ = \frac{1}{2} {∥ 𝒜_{(1)} - {X 𝒮}_{0 (1)} {(Y \otimes Z)}^{T} ∥}_{𝒲_{(1)}}^{2} + \frac{1}{2} λ {∥ 𝒮_{0 (1)} ∥}_{F}^{2} \\ = \frac{1}{2} {∥ 𝒜_{(2)} - {Y 𝒮}_{0 (2)} {(Z \otimes X)}^{T} ∥}_{𝒲_{(2)}}^{2} + \frac{1}{2} λ {∥ 𝒮_{0 (2)} ∥}_{F}^{2} \\ = \frac{1}{2} {∥ 𝒜_{(3)} - {Z 𝒮}_{0 (3)} {(X \otimes Y)}^{T} ∥}_{𝒲_{(3)}}^{2} + \frac{1}{2} λ {∥ 𝒮_{0 (3)} ∥}_{F}^{2} . \end{matrix}

(14)

The function (14) is equivalent to

F (X, Y, Z) \equiv \frac{1}{2} {∥ ℬ - 𝒞 ∥}^{2} + \frac{1}{2} λ {∥ 𝒮 ∥}_{F}^{2},

(15)

where ℬ = 𝒲*𝒜 and 𝒞 = 𝒲*(𝒮_0×1X_×2Y_×3Z), and (X,Y,Z) is on the product Grassmann manifolds M. If one lets (X,Y,Z) be on the product Riemannian manifold ${\overset{ˇ}{M} = ℝ}^{I_{1} \times R_{1}} \times ℝ^{I_{2} \times R_{2}} \times ℝ^{I_{3} \times R_{3}}$ , then the function (15) can be written as

\overset{ˇ}{F} (X, Y, Z) \equiv \frac{1}{2} {∥ ℬ - 𝒞 ∥}^{2} + \frac{1}{2} λ {∥ 𝒮 ∥}_{F}^{2} .

(16)

The gradient of the function $\overset{ˇ}{F} (X, Y, Z)$ follows directly from the expression below. In matrix notation, using ℬ and 𝒞 as function (14), one can get the gradient equation as

\begin{array}{l} grad \overset{ˇ}{F} (X, Y, Z) = ({grad}_{X} \overset{ˇ}{F}, {grad}_{Y} \overset{ˇ}{F}, {grad}_{Z} \overset{ˇ}{F}) \\ = [(C_{(1)} - B_{(1)}) (Z \otimes Y) S_{(1)}^{T}, {(C}_{(2)} - B_{(2)}) \times (Z \otimes X) S_{(2)}^{T}, (C_{(3)} - B_{(3)}) (Y \otimes X) S_{(3)}^{T}] . \end{array}

(17)

Combining (12) and (13), a closed-form expression for the gradient of F(X,Y,Z) is obtained.

In the following, we use a compact representation x for a pair (X,Y,Z) with X being an I₁ × R₁ matrix, Y an I₂ × R₂ matrix, and Z an I₃ × R₃ matrix. Also the gradient is represented by: gradF(x_k) = (grad_XF, grad_YF, grad_ZF).

4.3. Optimization on Manifold

Minimizing F(X,Y,Z) is a difficult task, because F is a nonconvex function. The basic idea is that the truncated higher-order singular value decomposition (HOSVD) [29, 30] of 𝒜^𝒲 provides a suboptimal initial approximation and that the minimum can be found with high probability by standard gradient descent after this initialization.

HOSVD [29, 30] is a generalization of the singular value decomposition (SVD) [9] which can give the best low rank approximation of a matrix. However, truncation of HOSVD results in a suboptimal solution of the best low multilinear rank approximation problem. But it can serve as a good starting point for iterative gradient descent algorithm.

Apply the HOSVD on the observed tensor 𝒜^𝒲

𝒜^{𝒲} = {\tilde{𝒮}}_{\times 1} {\tilde{X}}_{\times 2} {\tilde{Y}}_{\times 3} \tilde{Z},

(18)

for a so-called core-tensor $\tilde{𝒮} \in ℝ^{I_{1} \times I_{2} \times I_{3}}$ and three normalized orthogonal matrices $\tilde{X} \in ℝ^{I_{1} \times I_{1}}$ , $\tilde{Y} \in ℝ^{I_{2} \times I_{2}}$ , and $\tilde{Z} \in ℝ^{I_{3} \times I_{3}}$ . Then we truncate the three orthogonal matrices and get the first R₁, R₂, and R₃ columns of them. Define the initial matrices as $X_{0} = [{\tilde{x}}_{1}, \dots, {\tilde{x}}_{R_{1}}]$ , $Y_{0} = [{\tilde{y}}_{1}, \dots, {\tilde{y}}_{R_{2}}]$ , and $Z_{0} = [{\tilde{z}}_{1}, \dots, {\tilde{z}}_{R_{3}}]$ . Notice that we do not need to compute the scaled core tensor 𝒮₀, since we only require X₀, Y₀, and Z₀ for the following local optimization step. The x₀ = (X₀, Y₀, Z₀) stand for the matrices which are truncated from HOSVD. The manifold optimization algorithm starting at x₀ is described below.

For any step size τ, it is shown in Armijo [31] that this algorithm converges to a stationary point. The algorithm stops when the fit error ${∥ 𝒜 - \hat{𝒜} ∥}_{𝒲} / {∥ 𝒜 ∥}_{𝒲}$ goes below some threshold ε_tol; for example, 10⁻⁵ or the difference of the fit error between two iterations goes below some threshold δ_tol, for example, 10⁻⁶. The basic idea is that the fit error is a good indicator of the relative error on the whole set, ${∥ 𝒜 - \hat{𝒜} ∥}_{F} / {∥ 𝒜 ∥}_{F}$ .

The pseudocode of the Tucker-Wopt algorithm is given in Algorithm 1. In step 1, the initial mode matrix (X₀, Y₀, Z₀) is computed by truncated HOSVD. From step 2 to step 10, the problem of low multilinear rank tensor approximation with missing data is solved by the gradient descent algorithm on the Grassmann manifold and the optimal mode matrix (X_k, Y_k, Z_k) is obtained. Finally, the reconstructed tensor 𝒜 is obtained in step 11. Algorithm 2 gives the details of optimizing S in step 3 of Algorithm 1.

Algorithm 1:

Tucker-Wopt.

Algorithm 2:

5. Experiments

The proposed Tucker-Wopt algorithm for missing tensor value estimation is implemented with MATLAB on a Windows Workstation with a Dual-Core Intel(R) Xeon (TM) 2.8 GHZ CPU and 1 GB RAM. In this section, the performances of proposed method are evaluated both on simulated and real-world traffic data and compared with previous tensor-based method, such as Tucker-EM and CP-Wopt, for missing data estimation.

5.1. Performances on Noiseless Synthetic Data

To evaluate the performance of missing data estimation on noiseless data, we use the generated 3-mode I₁ × I₂ × I₃ test tensor 𝒜 with mode-n rank (r₁, r₂, r₃) as the synthetic test data. To impose these rank conditions, 𝒮 is r₁ × r₂ × r₃ core tensor with each entry being sampled independently from a standard Gaussian distribution 𝒩(0, 1). X, Y, and Z are I₁ × r₁, I₂ × r₂, and I₃ × r₃ factor matrices generated by setting each entry to a random number generated with 𝒩(0, 1). For simplicity, in this paper we set the mode-n ranks with the same value.

To show the superiority of proposed method, Tucker-Wopt is compared to EM-Tucker3 (implemented in the N-way Toolbox for MATLAB, version 3.10 [32]). Both of the two algorithms are tested in two types of missing data: randomly missing and structured missing data by randomly missing fibers. For Tucker-Wopt and EM-Tucker3, the stopping condition ${∥ 𝒜 - \hat{𝒜} ∥}_{𝒲} / {∥ 𝒜 ∥}_{𝒲}$ is set as 10⁻⁸, and the maximum number is set to 1000. Specially, λ is set to 0 in (15) for Tucker-Wopt.

5.1.1. Results with Randomly Missing Elements

In the case of randomly missing elements, we consider the performances on moderate-sized problems of size 50 × 50 × 50 and 100 × 100 × 100. For both sizes, we set the mode-n rank of the tensor to (5, 5, 5) and the missing rate to 20%, 40%, 60%, 80%, 90%, and 95%. The weight tensor 𝒲 is a binary tensor such that exactly (m × I₁ × I₂ × I₃) randomly selected missing elements are set to zero, where m ∈ (0, 1) defines the percentage of missing data. It is required that every slice of 𝒲 (in every direction) has at least one nonzero element, since the underlying Tucker model cannot be reconstructed. This is the generalization of the problem of coherence in the matrix completion problem that a matrix cannot be recovered if there is a whole missing row (or a column).

The convergence rates of Tucker-Wopt for tensors of size 50 × 50 × 50 are reported in Figure 1, where the fit error ${∥ 𝒜 - \hat{𝒜} ∥}_{𝒲} / {∥ 𝒜 ∥}_{𝒲}$ and the prediction error ${∥ 𝒜 - \hat{𝒜} ∥}_{F} / {∥ 𝒜 ∥}_{F}$ , with respect to the number of iterations of the manifold optimization step, are averaged over 10 runs. The results are shown for two percentages of missing data: 20% and 90%. We can see that the prediction error decays exponentially with the number of iterations in both instances. Also, the prediction error is very close to the fit error which justifies that the chosen stopping criterion is valid to our algorithm.

Figure 1:

Prediction and real fit errors versus the number of iterations of the manifold optimization step for tensors of size 50 × 50 × 50 and mode-n rank (5, 5, 5) with different percentages of missing data: 20% and 90%.

Experimental results also show that EM-Tucker3 converges slower in the case of low missing data rate and does not converge in the case of high missing data rate, while the proposed Tucker-Wopt converges faster in all cases of data sets with different missing ratio. Figure 2 illustrates the convergence of Tucker-Wopt and that of EM-Tucker3 for the tensors of size 50 × 50 × 50, mode-n rank as (5, 5, 5), and 90% missing data.

Figure 2:

Comparison of the convergence of Tucker-Wopt and EM-Tucker3 for the problems of size of 50 × 50 × 50, mode-n rank as (5, 5, 5), and 90% randomly missing data.

Figure 3 demonstrates the relative error ${∥ 𝒜 - \hat{𝒜} ∥}_{F} / {∥ 𝒜 ∥}_{F}$ for tensors of size 50 × 50 × 50 and 100 × 100 × 100, respectively, where the performances of Tucker-Wopt and EM-Tucker3 are compared, and mode-n rank is set as (5, 5, 5). All of the results are averaged over 10 instances. From Figure 3, it can be found that Tucker-Wopt performs significantly better than the EM-Tucker3 in the case of high missing data rate. Tucker-Wopt can recover the missing data even with 95% missing rate, while EM-Tucker3 produces a higher relative error. Although EM-Tucker3 performs better than our proposed method in the case of low missing data rate, our algorithm can also give comparable results to EM-Tucker3.

Figure 3:

Comparison of relative error for 50 × 50 × 50 and 100 × 100 × 100 tensors in (a) and (b), respectively, where mode-n rank is set as (5, 5, 5) for randomly missing elements with different percentages of missing data.

The proposed method outperforms EM-Tucker3 in most cases; the reason might be that Tucker-Wopt can capture the global information of tensors and keep the overall structure of data sets. EM-Tucker3 is based on EM framework that exploits the local information of tensors only, and therefore the relative errors rise sharply with the increasing of missing data ratio.

5.1.2. Results with Structured Missing Data

The setup for experiments with structured missing data is similar to that for randomly missing data. Let W be a I₂ × I₃ binary matrix, and (m × I₂ × I₃) randomly chosen elements of W are set to 0. The binary tensor 𝒲 is created by stacking I₁ copies of W together. Every slice of 𝒲 (in any direction) is guaranteed to have at least one non-zero element. Similar to matrix completion W is required to have no zero rows or columns. We set the missing rate to be 20%, 40%, 60%, 70%, 80%, 85%, and 90%. In the experiments for randomly missing fibers, we only consider missing fibers in the first mode for simplicity. The miss types in other modes can be easily generated. All results are averaged over 10 runs.

Figure 4 presents results for data sets of size 50 × 50 × 50 and 100 × 100 × 100 with missing fibers, respectively. Similar to the case of randomly missing elements, the relative error increases with the percentage of missing data. However, one notable difference is that the relative errors for structured missing data are in general greater than those for randomly missing data with comparable amounts of randomly missing elements. Also, in the case of randomly missing data, the Tucker-Wopt can recover the missing data up to 95% missing rate, while, in the case of structured missing data, it can recover the miss data up to 90%. This indicates that problems with structured missing data are more difficult than those with randomly missing elements.

Figure 4:

Comparison of relative error for structured missing data on 50 × 50 × 50 (a) and 100 × 100 × 100 (b) tensors, respectively, where mode-n rank is (5, 5, 5).

5.2. Performances on Noisy Observations

For the case with noisy observations, we also compare Tucker-Wopt with EM-Tucker3. To speed up the convergence, the iteration procedure is stopped when the difference of the fit error between adjacent iterations gets down to a threshold (set to 10⁻⁸). Other setups are the same as noiseless observations.

The proposed algorithm is tested on tensors of size 50 × 50 × 50, and, furthermore, it is tested with regularization with λ = 0.003 × (1 − m), where m is the missing rate of tensors. The value of λ used in this paper is based on our experience. Tensors are generated as above but corrupted by additive noise. The noises are independently and identically distributed according to a Gaussian distribution 𝒩(0, 1). Again, the mode-n rank of tensors is set to (5, 5, 5) and the missing rate is set from 20% to 90% with interval of 10%.

Figure 5 plots the relative errors of Tucker and EM-Tucker3 with noise. Similar to the problems without noise, both Tucker-Wopt with regularization and Tucker-Wopt without regularization outperform EM-Tucker3. For noisy data, the Tucker-Wopt with regularization performs better than the Tucker-Wopt without regularization.

Figure 5:

Comparison of relative error for noisy observation on 50 × 50 × 50 tensor, where mode-n rank is (5, 5, 5).

5.3. Compare with CP-WOPT

CP and Tucker decomposition are the two most well-known tensor decompositions, and, in fact, the former is a special case of the latter. Thus, for illustrating the generalization of Tucker-Wopt, we compare it with the CP-WOPT.

CP-WOPT [24] is implemented by the Tensor Toolbox [33]. Nonlinear conjugate gradient (NCG) is used with Hestenes-Stiefel updates [34] and the Moré-Thuente line search [35] provided in the Poblano Toolbox [36]. See Acar et al. [24] for more details.

The experimental parameters are set as follows. For Tucker-Wopt and CP-WOPT, the stopping condition which is the relative change of the function value F in (13) is set to 10⁻⁸, and the maximum number of iterations is set to 1000. Specially, in CP-WOPT, the tolerance on the 2-norm of the gradient divided by the number of entries in the gradient is set to 10⁻⁸.

Both algorithms are compared on 50 × 50 × 50 and 100 × 100 × 100 tensors. They are investigated with 20%, 40%, 60%, 80%, and 90% missing data. In CP-WOPT, similar to Acar et al. [24], the size of tensor is I × J × K and the number of factors is R. We generate factor matrices X, Y, and Z of sizes I × R, J × R, and K × R, respectively, by randomly choosing each element from 𝒩(0, 1) and then normalizing each column to unit length. The number of factors is 5 for all experiments. To keep the consistency of data sets, the factor matrices for Tucker-Wopt are set the same as that for the CP-WOPT. Mode rank is set to 5 for each mode, and the core tensor 𝒮 is superdiagonal with its diagonal elements being set to 1.

Figure 6 shows the relative error of different missing ratios for Tucker-Wopt and CP-WOPT on 50 × 50 × 50 and 100 × 100 × 100 tensors, respectively. All the results are averaged over 10 runs. Results show that Tucker-Wopt is not only suitable for the special Tucker model (i.e., CP) but it performs better than CP-WOPT.

Figure 6:

Comparison of relative error of Tucker-WOPT and CP-WOPT on 50 × 50 × 50 (a) and 100 × 100 × 100 (b) CP model with different missing ratios, respectively.

5.4. Traffic Flow Volume Data

Besides test on synthetic data, the proposed method is also applied to real-world missing traffic volume data estimation. Traffic volume data is important for appropriate traffic control strategies. However, some data are usually missed due to various reasons, such as the loss of data package during transmission in intelligent transportation systems (ITS) or detector malfunction [27]. An assumption is adopted that the traffic flows through the same loop have a high similarity form day to day [25]. With this assumption, the traffic flow volume data can be considered as low rank data. Then the missing traffic volume can be completed by our proposed method.

The traffic flow volume data at a site in Sacramento County downloaded from PeMS database (http://pems.dot.ca.gov/) are used for test. We use 16-days traffic flow data on one link which are recorded every 5 minutes and construct the data as a tensor of size 16 × 24 × 12, namely, 16 days, 24 hours, and 12 intervals per hour. Then the data is preprocessed as a Tucker model with multilinear rank (2, 2, 2).

Figure 7 illustrates the relative errors for different missing rates. Here, the definition of model error [24] is adopted for 2-component CP model as $∥ 𝒳 - \hat{𝒳} ∥ / ∥ 𝒳 ∥$ . Similarly, the definition is adopted for the Tucker model with multilinear rank (2, 2, 2). We can see that the model error for both Tucker and CP model is about 0.14 (i.e., when there is no missing data). Even though the relative error decreases when we increase the multilinear rank or components, the relative error goes down slowly.

Figure 7:

Comparison of relative error for problems of real traffic flow volume data with different percentages of missing data.

It can be seen in Figure 7 that the relative error is around 0.14 for lower missing data rate, and, furthermore, it increases very slowly with the missing data rate; for example, it is around 0.147 for 80% missing rate. When the missing rate is less than 80%, Tucker-Wopt performs comparable to EM-Tucker3 and CP-WOPT. However, the relative error of CP-WOPT increases sharply when the missing rate is higher than 90%, while Tucker-Wopt still can recover the missing data with a relative error around 0.168. Our proposed algorithm is more robust for the traffic flow volume data than CP-WOPT with high missing data rate.

6. Conclusions

In this paper, we present a low multilinear rank tensor approximation based on Tucker model, named as Tucker-Wopt, in the presence of missing data. As the nonuniqueness of Tucker model, we propose a novel gradient decent method on Grassmann manifold which is for ensuring the global convergence of a local optimal solution.

Experimental results demonstrate that our proposed Tucker-Wopt algorithm is promising to missing tensor data estimation. Since the global information of tensors can be captured by Tucker-Wopt while the EM-Tucker3 just exploits the local information of data sets, Tucker-Wopt performs better than EM-Tucker3 in most cases and can fulfill the extreme case where the ratio of missing data is very high. Tucker-Wopt can reconstruct the Tucker model successfully with large amounts of missing data, for example, 95% missing elements for tensors of size 50 × 50 × 50, while EM-Tucker3 can only address the same problem with missing data lower than 90%. Additionally, our algorithm generally converges faster than EM-Tucker3 especially when the percentage of missing data is high.

Experiments on data sets contaminated by noise also illustrate the robustness of our proposed algorithm. We also consider the practical use of Tucker-Wopt algorithm for missing traffic flow volume data estimation. Experiments demonstrate the robustness of our algorithm even though the missing rate is high.

For future studies, we will consider the problem of computation time since we would like to apply the proposed algorithm to large scale problems.

Footnotes

Appendix

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61271376, 61171118, and 91120010), the National Basic Research Program of China (973 Program: no. 2012CB725405), and Beijing Natural Science Foundation (4122067). The authors would like to thank Yong Li from Beijing University of Posts and Telecommunication and Bin Shen from Purdue University for fruitful discussions about parts of this paper.

References

Sun

Papadimitriou

, and Yu

P. S.

, “Window-based tensor analysis on high-dimensional and multi-aspect streams,” in Proceedings of the 6th International Conference on Data Mining (ICDM '06), pp. 1076–1080, December 2006.

Sun

Tao

, and Faloutsos

, “Beyond streams and graphs: dynamic tensor analysis,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06), pp. 374–383, August 2006.

Bertalmio

Sapiro

Caselles

, and Ballester

, “Image inpainting,” in Proceedings of the 27th Annual Conference on Computer Graphics (SIGGRAPH '00), pp. 417–424, July 2000.

Shen

Zhang

, and Zhang

, “Image inpainting via sparse representation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), pp. 697–700, April 2009.

Shih

T. K.

Tang

N. C.

Yeh

Chen

, and Lee

, “Video inpainting and implant via diversified temporal continuations,” in Proceedings of the 14th Annual ACM International Conference on Multimedia (MM '06), pp. 133–136, October 2006.

Mørup

Hansen

L. K.

, and Arnfred

S. M.

, “ERPWAVE-LAB. A toolbox for multi-channel analysis of time-frequency transformed event related potentials,” Journal of Neuroscience Methods, vol. 161, no. 2, pp. 361–368, 2007.

Ruhe

, “Numerical computation of principal components when several observations are missing,” Tech. Rep. UMINF-48-74, Department of Information Processing, Institute of Mathematics and Statistics, University of Umea, Umea, Sweden, 1974.

Srebro

and Jaakkola

, “Weighted low-rank approximations,” in Proceedings of the 20th International Conference on Machine Learning, pp. 720–727, August 2003.

Golub

G. H.

and Van Loan

C. F.

, Matrix Computations, Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996.

10.

Keshavan

R. H.

and Oh

, “Optspace: a gradient descent algorithm on the grassman manifold for matrix completion,” 2009, http://arxiv.org/abs/0910.5260.

11.

Goldfarb

, and Chen

, “Fixed point and Bregman iterative methods for matrix rank minimization,” Mathematical Programming, vol. 128, no. 1–2, pp. 321–353, 2011.

12.

Cai

J. F.

Candès

E. J.

, and Shen

, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.

13.

Tucker

L. R.

, “Some mathematical notes on three-mode factor analysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.

14.

Kolda

T. G.

and Bader

B. W.

, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009.

15.

Ishteva

, Numerical methods for the best low multilinear rank approximation of higher-order tensors [Ph.D. thesis], Department of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium, 2009.

16.

Keshavan

R. H.

Montanari

, and Oh

, “Matrix completion from a few entries,” 2009, http://arxiv.org/abs/0901.3150.

17.

Liu

Musialski

Wonka

, and Ye

J. P.

, “Tensor completion for estimating missing values in visual data,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 208–220, 2012.

18.

Liu

Wonka

, and Ye

, “Sparse non-negative tensor factorization using columnwise coordinate descent,” Pattern Recognition, vol. 45, no. 1, pp. 649–656, 2012.

19.

Liu

Musialski

Wonka

, and Ye

J. P.

, “Tensor completion for estimating missing values in visual data,” in Proceedings of the IEEE International Conference on Computer Vision, p. 8, Kyoto, Japan, 2009.

20.

Gandy

Recht

, and Yamada

, “Tensor completion and low-n-rank tensor recovery via convex optimization,” Inverse Problems, vol. 27, no. 2, Article ID 025010, 2011.

21.

Carroll

J. D.

and Chang

, “Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, 1970.

22.

Smoliński

Walczak

, and Einax

J. W.

, “Exploratory analysis of data sets with missing elements and outliers,” Chemosphere, vol. 49, no. 3, pp. 233–245, 2002.

23.

Tomasi

and Bro

, “PARAFAC and missing values,” Chemometrics and Intelligent Laboratory Systems, vol. 75, no. 2, pp. 163–180, 2005.

24.

Acar

Dunlavy

D. M.

Kolda

T. G.

, and Mørup

, “Scalable tensor factorizations for incomplete data,” Chemometrics and Intelligent Laboratory Systems, vol. 106, no. 1, pp. 41–56, 2011.

25.

Chen

and Shao

, “Nearest neighbor imputation for survey data,” Journal of official statistics Sweden Statistiska centralbyrån, vol. 16, no. 2, pp. 113–131, 2000.

26.

Allison

P. D.

, Missing Data, Sage, Thousand Oaks, Calif, USA, 2001.

27.

Zhang

, and Hu

, “PPCA-based missing data imputation for traffic flow volume: a systematical approach,” IEEE Transactions on Intelligent Transportation Systems, vol. 10, no. 3, pp. 512–522, 2009.

28.

Absil

P.-A.

Mahony

, and Sepulchre

, Optimization Algorithms on Matrix Manifolds, Princeton University Press, Princeton, NJ, USA, 2008.

29.

De Lathauwer

De Moor

, and Vandewalle

, “On the best rank-1 and rank-(R₁, R₂, …, R_N) approximation of higher-order tensors,” SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1324–1342, 2000.

30.

De Lathauwer

De Moor

, and Vandewalle

, “A multilinear singular value decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.

31.

Armijo

, “Minimization of functions having lipschitz continuous first partial derivatives,” Pacific Journal of Mathematics, vol. 16, no. 1, pp. 1–3, 1966.

32.

Andersson

C. A.

and Bro

, “The N-way toolbox for MATLAB,” Chemometrics and Intelligent Laboratory Systems, vol. 52, no. 1, pp. 1–4, 2000.

33.

Bader

B. W.

and Kolda

T. G.

, “MATLAB tensor toolbox version 2.4,” 2010, http://www.sandia.gov/∼tgkolda/TensorToolbox/.

34.

Nocedal

and Wright

S. J.

, Numerical Optimization, Springer, New York, NY, USA, 1999.

35.

Moré

J. J.

and Thuente

D. J.

, “Line search algorithms with guaranteed sufficient decrease,” ACM Transactions on Mathematical Software, vol. 20, no. 3, pp. 286–307, 1994.

36.

Dunlavy

D. M.

Kolda

T. G.

, and Acar

, “Poblano v1.0: a Matlab toolbox for gradient-based optimization,” Tech. Rep. SAND2010-1422, Sandia National Laboratories, Albuquerque, NM, USA, 2010.