Dimension reduction of high-dimensional dataset with missing values

Abstract

Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.

Keywords

Dimension reduction high-dimensional data missing value

Introduction

In the present era of Big Data, datasets with hundreds or even thousands of variables are generated and collected in many fields such as genomics, e-commerce, engineering, education, etc.^1,2 When the number of variables becomes too large, we usually need a reduced representation of the variables to analyze or visualize the data. So dimension reduction techniques which aim to reduce the dimension from high-dimensional space to a lower dimensional space are often necessary before conducting statistical analysis. Principal component analysis (PCA) is one of the most popular techniques among them. It has been widely used in statistical data analysis, pattern recognition, and image processing.³

However, until recently, most PCA are conducted based on the assumption of a complete data set despite the fact that missing data are a rule rather than an exception in quantitative researches.⁴ Data can be missing due to different reasons. For example, people may refuse to answer some sensitive questions. In telescope data, missing observations may be caused by a cloudy sky or sporadic instrument failures. The problem of missing data is common in almost all research and may lead to biased estimates and decreased statistical power.⁵

The traditional method to handle missing data is to delete the sample that has one or more missing observations. This complete case analysis is one of the methods commonly implemented in statistical software. When the missing rate is moderate or large, however, complete case analysis can be inefficient because a large number of observations will be excluded. Another common method to deal with the missing data is mean substitution. Mean substitution assumes that a missing variable is missing completely at random, that is, the variable’s missingness depend neither on the true value of the missing variable nor on the values of other variables in the dataset. If this assumption is not met, mean substitution may lead to biased estimate. Additionally, the variances are usually underestimated by mean substitution.⁶ Multiple imputation is also a popular strategy to fill the incomplete dataset. In multiple imputation, each missing data is replaced by a set of plausible values that represent the uncertainty about the true value to impute. These imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. This approach is becoming increasingly popular to handle missing data as it has the potential to correct the bias appeared in the complete case analysis or other alternative analyses.⁷ The methods of matrix factorization or nearest neighbor imputation have also been used in a host of disciplines to handle missing values.^8,9

The problems of missing data are also discussed in the context of PCA. In Audigier et al.,¹⁰ an imputation method based on the factorial analysis and principal component (PC) method is proposed, which perform well especially for categorical variables. In Serneels and Verdonck,¹¹ two approaches are presented to perform PCA on data which contain both outlying cases and missing elements: one approach is based on the eigendecomposition of the covariance matrix, the other is an expectation robust algorithm to deal with the data where the number of variables exceeds the number of observations. Several methods for dealing with missing data in PCA are reviewed and compared in Ginkel et al.⁵ Their comparative theoretical advantages and disadvantages are also presented.

Another aspect that seriously challenges PCA is the so-called high dimensional, low sample size settings. The essential characteristics of a high-dimensional dataset is that its dimension p is close to, or even larger than, the number of observations N. Obviously, the statistical analysis of high-dimensional data will transcend the “large N, fixed p” asymptotics in the classical multivariate statistics. In the high dimensional, low sample size settings, the sample covariance matrix S is well-known to be an inconsistent and biased estimator of the population covariance matrix Σ.^12,13 As a result, the PCs obtained by the eigendecomposition of S will have a great deviation from the true ones. By assuming the principal loading vectors are sparse, Lee et al. have proposed a modification of the conventional PCA that works in very high-dimensional spaces.¹⁴ In Lee et al.,¹⁵ it is demonstrated that the PC scores for samples which were not included in the original PCA can be substantially biased toward 0 in the high-dimensional setting. Meanwhile, a bias-adjusted PC score prediction algorithm is proposed. A general asymptotic framework for studying the consistency properties of PCA is developed in Shen et al.¹⁶ It allows one to rigorously characterize how the PCA consistency is evolved as a function of the dimension, the sample size, and the spike size.

In this paper, we aim to improve the performance of PCA when dataset contains small or moderate amount of missing values and belongs to the high dimensional, low sample size settings. For the incomplete dataset, we first optimally estimate the covariance matrix Σ via Lasso estimation and then regularize the conventional PCA procedure. The good performance of the proposed method is demonstrated by comparing with other popular imputation methods through extensive simulations on some synthetic and real world datasets.

PCA and the covariance matrix estimation

Let $x = [x_{1}, x_{2}, \dots, x_{p}] \in R^{p}$ be a p-dimensional random vector with mean zero, $E (x)=0$ , and covariance matrix Cov $(x) = Σ$ . The eigendecomposition of Σ is given by

Σ = U Λ U^{T}

(1)

where the matrix U is composed of population eigenvectors

U_{1}, U_{2}, \dots, U_{p}

and Λ is a diagonal matrix of the corresponding population eigenvalues

λ_{1} \geq λ_{2} \geq \cdot\cdot\cdot \geq λ_{p}

For a N × p data matrix X with N independent observations of the p-dimensional random vector x , its jth PC score $X_{j}^{Low}$ is defined as

X_{j}^{Low} = X U_{j}, j = 1, \dots, p

(2)

Using the above projection of X on the $m (m < p)$ leading eigenvectors ${U_{j}, j = 1, \dots, m}$ , we get the low-dimensional representatives of the original data matrix X. Thus, in the context of PCA, the eigenvectors ${U_{j}, j = 1, \dots, p}$ compose the PCs directions and ${λ_{j}, j = 1, \dots, p}$ are the corresponding PC variances.

In practice, the population covariance matrix Σ is usually unknown and we only have the data matrix $X_{N \times p}$ on hand. So a common approach in PCA is to replace the population covariance matrix Σ with the sample covariance matrix S

S = \frac{1}{N - 1} X^{T} X

(3)

By decomposing $S = VD V^{T}$ , the PC direction U_j in equation (2) are replaced by the sample eigenvector V_j and the PC variance λ_j are replaced by the sample eigenvalue d_j. The sample covariance matrix S is a natural estimator for the population covariance matrix Σ. This estimator is easy to compute and unbiased in the “large N, fixed p” asymptotics. However, the sample covariance matrix S has a significant amount of estimating error when the number of variables p is comparable to or larger than the number of the observations N. More specifically, the discrepancy between the population eigenvectors U_j and the sample eigenvectors V_j in the spiked population models has been analyzed in the framework of random matrix theory,¹⁵ and the following results holds when $p / N \to γ \geq 0$ as $N \to \infty$

| 〈 U_{j}, V_{j} 〉 | \to {\begin{array}{l} ϕ (λ_{j}), if λ_{j} > 1 + \sqrt{γ}, \\ 0, if 1 < λ_{j} \leq 1 + \sqrt{γ} \end{array}

(4)

where

〈 \cdot, \cdot 〉

is an inner product between two vectors and

ϕ (λ_{j}) = \sqrt{(1 - \frac{γ}{{(λ_{j} - 1)}^{2}}) / (1 + \frac{γ}{λ_{j} - 1})}

. When a number of observations in the data matrix X are missing, the estimating error will inevitably become even larger. The inconsistent and biased estimation of Σ will lead to prominent deviation from the PC directions U_j. Consequently, inexact PC scores are obtained and will bring inaccurate or unpredictable outcomes for the tasks such as classification, clustering, or visualizing the structure of the original data.

Regularized PCA for high-dimensional dataset with missing observations

An unbiased estimator of Σ with missing observations

In the last section, we have shown that PCA is a method based on spectral decomposition of the covariance matrix. Unfortunately, the sample covariance matrix S is not a good estimator of the population covariance matrix Σ in high dimension. Furthermore, we cannot estimate Σ accurately for dataset containing missing values. So we will introduce an unbiased estimator of Σ following the approach proposed in Lounici,¹⁸ which is based on the fundamental results in random matrix theory.

Suppose an entry X_ij in a data matrix X is missing at random and the entry will be observed with probability $δ \in (0, 1]$ . Clearly, the case of δ = 1 corresponds to a complete dataset without missing observation. For an incomplete data matrix X^Incomplete, one can easily estimate δ by computing the proportion of the observed entries in X^Incomplete.

We begin with the covariance matrix estimation by filling all the missing observations of variable x_j with the mean value of the observed entries in the jth column. If a random variable x_j is assumed to be zero-mean, then the missing data is filled with 0. After mean imputation, we have a complete data matrix X^Complete. Its empirical covariance matrix $Σ^{(δ)}$ is computed as

Σ^{(δ)} = \frac{1}{N - 1} {[X^{Complete}]}^{T} X^{Complete}

(5)

Based on the results from random matrix theory,¹⁷ a Lasso estimator for Σ can be obtained by solving the following convex minimization problem¹⁸

\hat{Σ} = \underset{K \in M_{p \times p}}{arg min} | | \tilde{Σ} - K | |_{F}^{2} + β | | K | |_{1}

(6)

where

\tilde{Σ} = (δ^{- 1} - δ^{- 2}) \cdot diag (Σ^{(δ)}) + δ^{- 2} Σ^{(δ)}

(7)

M_{p \times p}

is the set of p × p positive semi-definite matrices,

diag (Σ^{(δ)})

denotes the diagonal matrix of

Σ^{(δ)}

| | \cdot | |_{F}

and

| | \cdot | |_{1}

are the Frobenius norm and the trace norm, respectively. An optimal data-driven choice of the regularization parameter β is given as

β^{*} = C \frac{\sqrt{tr (\tilde{Σ}) | | \tilde{Σ} | |_{\infty}}}{δ} \sqrt{\frac{log (2 p)}{N}}

(8)

where C > 0 is a large enough constant and

| | \cdot | |_{\infty}

denotes the infinity norm. Combining with equations (5) to (8), we will get a good estimator of Σ by solving the minimization problem numerically, even if the original data matrix contains moderate to large amount of missing observations.

The estimator $\hat{Σ}$ in equation (6) is the covariance version of the matrix Lasso estimator initially introduced in the matrix regression framework.^19,20 It has some competitive advantages over other estimators with missing data. First, it provides a better estimate of the covariance matrix Σ in high-dimensional, low sample size settings. Second, it is easy to compute using convex optimization algorithms. Finally, the covariance estimator in equation (6) is minimax optimal and does not require the unknown population covariance matrix Σ to be low-rank. By employing this estimator, we propose our regularized PCA (RPCA) algorithm in the next subsection.

The RPCA algorithm

The proposed PCA algorithm handling high-dimensional dataset with missing observations is shown in Algorithm 1.

Algorithm 1. The RPCA algorithm for high-dimensional dataset with missing observations

Input: The incomplete data matrix X^Incomplete

Output: The data matrix X^Low in reduced-dimension spaces

Estimate δ by computing the proportion of the observed entries in the incomplete data matrix X^Incomplete

Impute X^Incomplete with the mean of the observed entries for each variable (in column wise) and obtain the complete data matrix X^Complete

Normalize each variable in the data matrix to have zero-mean and unit-variance

Compute the empirical covariance matrix $Σ^{(δ)}$ in equation (5)

Compute $\tilde{Σ}$ in equation (7) and the tuning parameter $β^{*}$ in equation (8)

Solve the convex minimization problem in equation (6) and obtain the estimator $\hat{Σ}$

Eigen-decompose $\hat{Σ}$ with $\hat{Σ} = \hat{U} \hat{Λ} {\hat{U}}^{T}$

Do the projection of X^Complete on the m leading eigenvectors ${{\hat{U}}_{j}, j = 1, \dots, m}$ such that $X^{Low} = X^{Complete} [{\hat{U}}_{1} \dots {\hat{U}}_{m}]$

return X^Low

Simulations

First of all, we create an p × p population covariance matrix Σ of Toeplitz structure with entries $Σ_{ij} = \exp (- \frac{1}{10} | i - j |)$ . To generate N samples with the given covariance matrix Σ, we create each sample from an isotropic Gaussian with unit variance and multiply it by a square root of the covariance matrix Σ, e.g. by the Cholesky factorization of Σ. Implemented in such a manner, the observations will fit a multivariate Gaussian distribution and bring about a covariance matrix whose entries decay as they move away from the diagonal. The Toeplitz structure models have been commonly used in estimating the high-dimensional covariance matrices.^21,22 The N × p incomplete data matrix X^Incomplete is generated by randomly discarding a number of observations from it. We aim to estimate $\hat{Σ}$ from X^Incomplete and finally obtain the PC scores X^Low to accomplish dimension reduction.

It has been shown in Algorithm 1 that the PC scores X^Low are mainly determined by the $m (m < p)$ leading eigenvectors ${{\hat{U}}_{j}, j = 1, \dots, m}$ , namely, the m leading PC directions. So we will compare these m leading PC directions estimated by the proposed Algorithm 1 with those estimated by the following imputation methods together with conventional PCA.

(i) The mean value substitution (MS) method

In this method, the missing data for each variable is replaced by the mean value of the observed data for that variable.

(ii) The multivariate imputation by chained equations (MICE) method

The MICE package in R language generates multivariate imputations by chained equations.²³ Using this package, each incomplete variable is imputed by a separate model and then inference is combined across the completed datasets. This method is becoming increasing popular for handling missing data.

Note that we are not going to compare the proposed algorithm with the complete case analysis method because, for moderate or large missing rate, nearly all of the samples will be deleted. This phenomenon becomes even more evident when the dimension p is close to or larger than the number of samples N.

To quantitatively measure the consistency between the true PC ${U_{j}, j = 1, \dots, m}$ with the estimated ones ${{\hat{U}}_{j}}$ , two metrics, i.e. the Frobenius norm and the cosine distance, will be computed. Let A be an p × p matrix, the Frobenius norm is defined as $| | A | |_{F} = \sqrt{Σ_{i = 1}^{p} Σ_{j = 1}^{p} | A_{ij} |^{2}}$ . We will compute the Frobenius norm

Fn = | | [U_{1} \dots U_{m}] - [{\hat{U}}_{1} \dots {\hat{U}}_{m}] | |_{F}

(9)

to measure the similarity between the true PCs ${U_{j}}$ and the estimated PCs ${{\hat{U}}_{j}}$ obtained by different methods. The more similar the vectors are, the smaller the value of Fn is.

Another measure to evaluate the similarity is the cosine distance. Given the true eigenvectors U_j and the estimated eigenvectors ${\hat{U}}_{j}$ , the cosine similarity $\cos θ_{j}$ is represented using a dot product as

\cos θ_{j} = \frac{U_{j} \cdot {\hat{U}}_{j}}{| | U_{j} | | | | {\hat{U}}_{j} | |}

(10)

And the cosine distance is defined as

Cs = \frac{1}{m} Σ_{j = 1}^{m} (1 - \cos θ_{j})

(11)

which ranges from 0 meaning exactly the same, to 1 indicating orthogonality, with 2 meaning exactly opposite.

For different proportions of observing rate δ, the performance of the various methods for the synthetic data matrices X^Incomplete with size $N = 60, p = 60$ are shown in Table 1. Each result shown in Table 1 is averaged over 20 runs. The number of the PC is figured out using elbow test and is fixed to be equal for different methods. To solve the optimization problem (equation (6)) we used CVX, a package for specifying and solving convex programs.²⁴ Table 2 shows the simulation results for the settings $N = 30, p = 60$ . In the cases of δ less than 60% or so, both metrics of Fn and Cs in terms of the MICE+PCA composed method cannot be computed. The reason is that the MICE+PCA method has not sufficient data to be imputed or might not converge when the number of the missing data is huge with respect to the observed data. To give an intuitive comparison after dimension reduction, we have plotted the first two PC scores of the same samples in Figure 1 for different imputation and dimension reduction methods. We can see from Tables 1 and 2 that, for most of the settings, both Fn and Cs obtain their minimal value with the proposed RPCA algorithm. Thus the algorithm gives the PC directions more closer with the true ones than other competitors.

Table 1.

The Frobenius norm and the cosine distance (Fn, Cs) of the principal component directions for the synthetic dataset (N = 60, p = 60).

	δ = 95%	δ = 90%	δ = 80%	δ = 70%
MS+PCA	(2.48, 1.46)	(2.30, 1.22)	(2.22, 1.31)	(1.73, 1.12)
MICE+PCA	(2.30, 1.35)	(2.39, 1.32)	(2.51, 1.64)	(2.40, 1.38)
RPCA	(0.72, 0.16)	(0.94, 0.24)	(0.83, 0.25)	(1.08, 0.28)

PCA: principal component analysis; RPCA: regularized PCA.

Table 2.

The Frobenius norm and the cosine distance (Fn, Cs) of the principal component directions for the synthetic dataset (N = 30, p = 60).

	δ = 95%	δ = 90%	δ = 80%	δ = 70%
MS+PCA	(2.55, 1.48)	(2.40, 1.38)	(1.94, 1.22)	(1.17, 1.54)
MICE+PCA	(2.52, 1.41)	(2.13, 1.16)	(2.11, 1.39)	(2.27, 1.75)
RPCA	(0.79, 0.20)	(0.65, 0.13)	(0.59, 0.11)	(0.93, 0.32)

PCA: principal component analysis; RPCA: regularized PCA.

Figure 1.

First two principal component score plot of (a) the original complete data, (b) the imputed data with MS+PCA, (c) the imputed data with MICE+PCA, and (d) the imputed data with RPCA. The settings are $N = 30, p = 60$ , and $δ = 90 %$ .

Application to classification problems

Linear discriminant analysis (LDA) is a well-established supervised learning technique applicable in a variety of fields.²⁵ The classification of high-dimensional dataset is quite common but difficult to solve for most of the classification algorithms including LDA.²⁶ So the techniques such as PCA and factor analysis are essential to reduce the dimension before classifying the high-dimensional dataset. In this section, we will show that LDA together with our RPCA algorithm can significantly reduce the misclassification rate when dealing with high-dimensional as well as incomplete dataset.

Synthetic data analysis

In LDA, one or more new samples are classified into one of the predefined classes $C_{k}, k = 1, \dots, K$ based on the observed data. It is assumed further in LDA that the variables for each class share the same covariance matrix, $Σ_{k} = Σ, k = 1, \dots, K$ . For a new observed data vector $x \in R^{1 \times p}$ , the optimal classification is obtained using the following decision rule

\hat{G} (x) = \underset{k}{arg max} {x^{T} Σ^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{T} Σ^{- 1} μ_{k} + \log π_{k}}

(12)

where

μ_{k}

and π_k are the mean vector and the prior probability for the kth class, respectively.

We generated the synthetic dataset from two multivariate Gaussian distributions $N (μ_{1}, Σ)$ and $N (μ_{2}, Σ)$ , that is, we have two classes. Additionally, equal amount of samples are generated from both multivariate Gaussian distributions, which assures a balanced class distribution and equal misclassification costs. The mean vector for the first class is $μ_{1}$ = 1 and the second is $μ_{2}$ = 4. The covariance matrice Σ are generated in a similar way as described in section “Simulations”. In order to model an incomplete dataset, some entries in this synthetic data matrix are then randomly left blank. Before we use LDA to classify the incomplete dataset, we need to reduce its dimension because it is currently in the low sample size N, high-dimension p settings. The three methods discussed in section “Simulations” are applied to impute the dataset and to reduce the dimensions. And three-fold cross-validation is applied to evaluate the predicting performance. For a number of different settings, each dimension reduction method together with LDA run 20 times. The averaged correct classification rates are computed and shown in Table 3 (corresponding to a low missing rate) and Table 4 (corresponding to a moderate missing rate). We can see from both Tables 3 and 4 that, compared with other dimension reduction methods, LDA achieves the best performance with the proposed RPCA algorithm regardless of the dimension. As the training sample size N becomes smaller, the averaged correct classification rate for each method decreases. Additionally, our proposed RPCA algorithm is more robust to the change of the dimension because it gains very high prediction accuracy for all the high-dimensional settings.

Table 3.

Averaged correct classification rate of LDA together with different dimension reduction methods (δ = 90%).

	N = 100	N = 60	N = 40	N = 30
	p = 60	p = 60	p = 60	p = 60
MS + PCA	1.00	0.95	0.65	0.55
MICE + PCA	1.00	0.96	0.84	0.71
RPCA	1.00	0.99	0.99	0.97

LDA: linear discriminant analysis; PCA: principal component analysis; RPCA: regularized PCA.

Table 4.

Averaged correct classification rate of LDA together with different dimension reduction methods (δ = 70%).

	N = 100	N = 60	N = 40	N = 30
	p = 60	p = 60	p = 60	p = 60
MS + PCA	0.98	0.59	0.47	0.49
MICE + PCA	0.99	0.74	0.74	0.76
RPCA	1.00	0.99	0.98	0.96

LDA: linear discriminant analysis; PCA: principal component analysis; RPCA: regularized PCA.

Real data analysis

The performance of our RPCA algorithm are also examined with two real world datasets. One is the prognostic Wisconsin breast cancer dataset,²⁷ the other is the ozone level detection dataset.²⁷ The breast cancer dataset is a complete dataset which contains 569 samples for p = 30 variables. Each sample has been assigned with one of the two class labels: malignant or benign. We reserve each entries in the dataset with probability δ and randomly pick N training samples to train the LDA classifier. For different settings of δ and N, the numerical results are shown in Table 5. The predicting accuracy for each setting is averaged over 20 runs. We can see that the classification performance of LDA classifier is relatively high along with the proposed dimension reduction algorithm.

Table 5.

Averaged predicting accuracy of LDA together with different dimension reduction methods on the breast cancer dataset.

	N = 30	N = 30	N = 20	N = 20
	δ = 90%	δ = 70%	δ = 90%	δ = 70%
MS + PCA	0.83	0.52	0.69	0.49
MICE + PCA	0.85	0.85	0.84	0.83
RPCA	0.89	0.87	0.86	0.87

LDA: linear discriminant analysis; PCA: principal component analysis; RPCA: regularized PCA.

The ozone level detection dataset has 73 variables and 2534 samples. We have randomly selected 160 samples labeled with ozone day and an equal amount of samples labeled with normal day. All the samples selected have missing data and the proportion of the observed data is calculated. Among the selected samples, we randomly choose 54 samples as training data, and the 266 remaining samples are used as testing data. The performance for each dimension reduction methods are shown in Table 6 and the predicting accuracy is averaged over 10 simulation runs. Although the predicting accuracy for each method is not very high, our proposed RPCA algorithm achieves a better performance in these high-dimensional settings.

Table 6.

The averaged predicting accuracy of LDA together with different dimension reduction methods on the ozone level dataset (N = 54, p = 73).

(Different methods) + LDA	Accuracy
MS + PCA	0.70
MICE + PCA	0.72
RPCA	0.76

LDA: linear discriminant analysis; PCA: principal component analysis; RPCA: regularized PCA.

Conclusion

We have proposed an RPCA algorithm to reduce the dimension in the dataset where the dimension is close to, or larger than, the sample size and also in which some observations are missing due to various reasons. The proposed algorithm is based on solving a convex optimization problem and some theoretical results from random matrix theory. One of the prominent advantages of this method is it circumvents the intricate imputation of missing data. Its good performance is validated by analyzing both the synthetic datasets and some real world datasets.

We would like to note that the proposed algorithm does not perform well when the missing rate of the dataset is relatively high (roughly $δ \leq 60 %$ in our numerical experiments). Although such a high rate of missing is rare and the datasets with high missing rate are usually dismissed without thoroughly consideration, we believe this problem merits further research. And how to deal with the dataset containing categorical variables is another issue that deserves consideration in the future.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported, in part, by the Science and Technology Program of Xuzhou (Grant No. KC18069) and the Postgraduate Research & Practice Program of Education & Teaching Reform of CUMT.

References

Dutta

Ghosh

AK.

On some transformations of high dimension, low sample size data for nearest neighbor classification. Mach Learn 2016; 102: 57–83.

Dernoncourt

Hanczar

Zucker

JD.

Analysis of feature selection stability on high dimension and small sample data. Comput Stat Data Anal 2014; 71: 681–693.

Abdi

Williams

LJ.

Principal component analysis. Wires Comp Stat 2010; 2: 433–459.

Dong

Peng

CYJ.

Principled missing data methods for researchers. Springerplus 2013; 2: 222–239.

Ginkel

JRV

Kroonenberg

Kiers

HAL.

Missing data in principal component analysis of questionnaire data: A comparison of methods. J Stat Comput Simul 2014; 84: 2298–2315.

Little

RJA

Rubin

DB.

Statistical analysis with missing data. New York, NY: Wiley, 2002.

Lee

Simpson

JA.

Introduction to multiple imputation for dealing with missing data. Respirology 2014; 19: 162–167.

Troyanskaya

Cantor

Sherlock

Missing value estimation methods for DNA microarrays. Bioinformatics 2001; 17: 520–525.

Crookston

Finley

AO.

yaImpute: An R package for kNN imputation. J Stat Soft 2008; 23: 1–16.

10.

Audigier

Husson

Josse

A principal component method to impute missing value for mixed data. Adv Data Anal Classif 2016; 10: 5–26.

11.

Serneels

Verdonck

Principal component analysis for data containing outliers and missing elements. Comput Stat Data Anal 2008; 52: 1712–1727.

12.

Karoui

NE.

Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann Stat 2008; 36: 2757–2790.

13.

Ledoit

Wolf

Spectrum estimation: A unified framework for covariance matrix estimation and PCA in large dimensions. J Multivariate Anal 2015; 139: 360–384.

14.

Lee

Park

BU.

Principal component analysis in very high-dimensional spaces. Stat Sinica 2012; 22: 933–956.

15.

Lee

Zou

Wright

FA.

Convergence and prediction of principal component scores in high-dimensional settings. Ann Stat 2010; 38: 3605–3629.

16.

Shen

Marron

JS.

A general framework for consistency of principal component analysis. J Mach Learn Res 2016; 17: 1–34.

17.

Johnstone

IM.

On the distribution of the largest eigenvalue in principal component analysis. Ann Stat 2001; 29: 295–327.

18.

Lounici

High-dimensional covariance matrix estimation with missing observations. Bernoulli 2014; 20: 1029–1058.

19.

Koltchinskii

Lounici

Tsybakov

AB.

Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann Stat 2011; 39: 2302–2329.

20.

Rohde

Tsybakov

AB.

Estimation of high-dimensional low-rank matrices. Ann Stat 2011; 39: 887–930.

21.

Marzetta

Tucci

Simon

SH.

A random matrix-theoretic approach to handling singular covariance estimates. IEEE Trans Inform Theory 2011; 57: 6256–6271.

22.

Wang

Pan

Tong

, et al. Shrinkage estimation of large dimensional precision matrix using random matrix theory. Stat Sinica 2015; 25: 993–1008.

23.

van Buuren

Oudshoorn

KG.

MICE: Multivariate imputation by chained equations in R. J Stat Soft 2011; 45: 1–67.

24.

Grant

Boyd

CVX: Matlab software for disciplined convex programming, version 2.0 beta, http://cvxr.com/cvx (2013, accessed 8 May 2018).

25.

Mai

A review of discriminant analysis in high dimensions. Wires Comp Stat 2013; 5: 190–197.

26.

Wang

Jiang

BY.

On the dimension effect of regularized linear discriminant analysis, arXiv preprint. arXiv 2017; 1710: 03136.

27.

Dua

Taniskidou

EK.

UCI machine learning repository. Irvine, CA: University of California, 2017.