Sage Journals: Discover world-class research

Abstract

Solving the missing-value (MV) problem with small estimation errors in large-scale data environments is a notoriously resource-demanding task. The most widely used MV imputation approaches are computationally expensive because they explicitly depend on the volume and the dimension of the data. Moreover, as datasets and their user community continuously grow, the problem can only be exacerbated. In an attempt to deal with such a problem, in our previous work, we introduced a novel framework coined Pythia, which employs a number of distributed data nodes (cohorts), each of which contains a partition of the original dataset. To perform MV imputation, the Pythia, based on specific machine and statistical learning structures (signatures), selects the most appropriate subset of cohorts to perform locally a missing value substitution algorithm (MVA). This selection relies on the principle that particular subset of cohorts maintains the most relevant partition of the dataset. In addition to this, as Pythia uses only part of the dataset for imputation and accesses different cohorts in parallel, it improves efficiency, scalability, and accuracy compared to a single machine (coined Godzilla), which uses the entire massive dataset to compute imputation requests. Although this article is an extension of our previous work, we particularly investigate the robustness of the Pythia framework and show that the Pythia is independent from any MVA and signature construction algorithms. In order to facilitate our research, we considered two well-known MVAs (namely K-nearest neighbor and expectation–maximization imputation algorithms), as well as two machine and neural computational learning signature construction algorithms based on adaptive vector quantization and competitive learning. We prove comprehensive experiments to assess the performance of the Pythia against Godzilla and showcase the benefits stemmed from this framework.

Get full access to this article

View all access options for this article.

References

, et al. Using classifier-based nominal imputation to improve machine learning. In: Proc. 15th PAKDD, Part I, LNAI 6634, Springer-Verlag: Berlin; pp. 124–135; 2011.

Farhangfar

, et al. Impact of imputation of missing values on classification error for discrete data. Pattern Recogn., 2008; 41:3692–3705.

Chi

, et al. Genotype imputation via matrix completion. Genome Res., 2013; 23:509–518.

Aydilek

, et al. A novel hybrid appoach to estimating missing values in databases using k-nearest neighbors and neural networks. Innovative Comput Inf Control. 2012; 8:1349–4198.

Farhangfar

, et al. A novel framework for imputation of missing values in databases. IEEE Trans Sys Man Cyber A., 2007; 37:692–709.

Lakshminarayan

, et al. Imputation of missing data in industrial databases. Appl Intell., 1999; 11:259–275.

Kurgan

, et al. Mining the cystic fibrosis data. In: Next Generation of Data-Mining Applications. Zurada

, Kantardzic

(eds.). IEEE Press: New York, NY; pp. 415–444; 2005.

Liew

, et al. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform., 2011; 12:498–513.

Dean

, Ghemawat

. MapReduce: simplified data processing on large clusters. ACM, 2008; 51:107–113.

10.

Ghemawat

, et al. The Google file system. In: Proc. ACM SOSP, 2003; 37:29–43.

11.

Chu

C-T

, et al. Map-Reduce for Machine Learning on Multicore. NIPS 19. MIT Press: New York, NY; pp. 281–288; 2006.

12.

Rubin

. Multiple imputation after 18+ years. J Am Stat Assoc., 1996; 91:473–489.

13.

Raghunathan

, et al. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol., 2001; 27:85–95.

14.

Asif

, et al. Low-dimensional models for missing data imputation in road networks. In: Proc. 38th IEEE ICASSP. IEEE: New York, NY; pp. 3527–3531; 2013.

15.

Enders

. Applied Missing Data Analysis. Guilford Press: New York, NY, 2010.

16.

Joenssen

, et al. Hot deck methods for imputing missing data. In: Proc. 8th MLDM, LNCS 7376, pp. 63–75; 2012.

17.

Troyanskaya

, et al. Missing value estimation methods for DNA microarrays. Bioinformatics., 2001; 17:520–525.

18.

Little

, et al. Statistical Analysis with Missing Data. Wiley: New York, NY, 1987.

19.

, et al. DynaMMo: mining and summarization of coevolving sequences with missing values. In: Proc. 15th ACM SIGKDD, New York; pp. 527–534; 2009.

20.

Yang

, et al. Online recovery of missing values in vital signs data streams using low-rank matrix completion. In: Proc. 11th IEEE ICMLA. IEEE: New York, NY; pp. 281–287; 2012.

21.

Ouyang

, et al. Gaussian mixture clustering and imputation of microarray data. Bioinformatics., 2004; 20:917–923.

22.

Aittokallio

, et al. Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform., 2010; 11:253–264.

23.

Kim

D-W

, et al. Iterative clustering analysis for grouping missing data in gene expression profiles. In: Proc. PAKDD 2006, LNAI 3918, pp.129–138; 2006.

24.

Anagnostopoulos

, et al. Scaling out big data missing value imputations: pythia vs. godzilla. In: Proc. 20th ACM SIGKDD (KDD ’14) International Conference on Knowledge Discovery and Data Mining. ACM: New York, NY; pp. 651–660; 2014.

25.

Carpenter

, et al. The ART of adaptive pattern recognition by a self-organizing neural network. IEEE Comput., 1988; 21:77–88.

26.

Kohonen

. Self-Organizing Maps, 3rd ed. Springer-Verlag: Secaucus, NJ, 2001.

27.

Ahmad

, et al. A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng., 2007; 63:503–527.

28.

Meng

, et al. Vigilance adaptation in adaptive resonance theory. In: IEEE International Joint Conference on Neural Networks (IJCNN). IEEE: New York, NY; pp. 1–7; 2013.

29.

Prudent

, et al. An incremental growing neural gas learns topologies. IEEE International Joint Conference on Neural Networks (IJCNN), Vol. 2. IEEE: New York, NY; pp. 1211–1216; 2005.

30.

Bache

, et al. UCI Machine Learning Repository. University of California, Irvine, CA, 2013. http://archive.ics.uci.edu/ml

31.

Aggarwal

, et al. On the Surprising Behavior of Distance Metrics in High Dimensional Space. Springer: New York, NY, 2001.

32.

Beyer

, et al. When is nearest neighbors meaningful? In: ICDT Conference Proceedings, Springer-Verlag: London; pp. 217-235; 1999.

33.

Chen

, Tu

. Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: New York, NY; pp. 133–142; 2007.

Scalable Data Quality for Big Data: The Pythia Framework for Handling Missing Values

Abstract

Abstract

Get full access to this article

References