Visualized mixed-type data analysis via dimensionality reduction

Abstract

Visualization is a useful technique in data analysis, especially, in the initial stage, data exploration. Since high-dimensional data is not visible, dimensionality reduction techniques are usually used to reduce the data to a lower dimension, say two, for visualization. In previous studies, dimensionality reduction was investigated in the context of numeric datasets. Nevertheless, most of real-world datasets are of mixed-type containing both numeric and categorical attributes. In this case, a traditional approach could neither handle it directly nor output appropriate results. To address this problem, we propose a procedure for visualized analysis of mixed-type data via dimensionality reduction. Dissimilarity between categorical values is learned from the dataset and further used to measure the distance between mixed-type data points. In addition, we propose an approach to identifying significant features and visualizing patterns from the projection map chosen according to quality measures. Experiments on real-world datasets were conducted to demonstrate feasibility of the proposed method.

Keywords

Dimensionality reduction mixed-type data data visualization data analysis

Get full access to this article

View all access options for this article.

References

Geng

Zhan

D.C.

and Zhou

Z.H.

, Supervised nonlinear dimensionality reduction for visualization and classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 35 (2005), 1098–1107.

Venna

and Kaski

, Visualizing gene interaction graphs with local multidimensional scaling, Paper presented at the European Symposium on Artificial Neural Networks, Bruges, Belgium, 2006.

Chen

H.-T.

Chang

H.-W.

and Liu

T.-L.

, Local Discriminant Embedding and Its Variants, Paper presented at the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 2005.

Hsu

C.-C.

and Huang

W.-H.

, Integrated dimensionality reduction technique for mixed-type data involving categorical values, Applied Soft Computing 43 (2016), 199–209.

Liu

Feng

and Qiao

, Scatter Balance: An Angle-Based Supervised Dimensionality Reduction, IEEE Transactions on Neural Networks and Learning Systems 26(2) (2015), 277–289.

Yan

Zhang

H.-J.

Yang

and Lin

, Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1) (2007), 40–51.

Belkin

M..

and Niyogi

, Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Paper presented at the Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2001.

Kaski

, Dimensionality reduction by random mapping: fast similarity computation for clustering, Paper presented at the IEEE World Congress on Computational Intelligence, Anchorage, AK, 1998.

Lafon

S.P.

and Lee

A.B.

, Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9) (2006), 1393–1403.

10.

Niu

J.G.

and Jordan

M.I.

, Dimensionality Reduction for Spectral Clustering, Journal of Machine Learning Research 15 (2011), 552–560.

11.

Feng

M.Y.

Song

and Wei

, ICA-Based Dimensionality Reduction and Compression of Hyperspectral Images, Journal of Electronics and Information Technology 29(12) (2007), 2871–2875.

12.

and Fowler

J.E.

, Hyperspectral image compression using JPEG2000 and principal component analysis, IEEE Geoscience and Remote Sensing Letters 4(2) (2007), 201–205.

13.

Mignotte

, A bicriteria optimization approach based dimensionality reduction model for the color display of hyperspectral images, IEEE Transactions on Geoscience and Remote Sensing 50(2) (2012), 501–513.

14.

Salakhutdinov

and Hinton

G.E.

, Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure, Paper presented at the AISTATS, 2007.

15.

Teh

Y.W.

and Roweis

, Automatic Alignment of Local Representations, Paper presented at the Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2002.

16.

Weinberger

K.Q.

Sha

and Saul

L.K.

, Learning a Kernel Matrix for Nonlinear Dimensionality Reduction, Paper presented at the International Conference on Machine Learning, Banff, Alberta, Canada, 2004.

17.

Maaten

L.V.D.

Postma

and Herik

J.V.D.

, Dimensionality Reduction: A Comparative Review (TiCC-TR 2009-005). Retrieved from https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf, 2009.

18.

Yan

Niyogi

and Zhang

H.-J.

, Face Recognition Using Laplacianfaces, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(3) (2005), 328–340.

19.

Frank

and Asuncion

, UCI machine learning repository, (12 Sep 2010).

20.

Dash

and Liu

, Feature selection for classification, Intell Data Anal 1 (1997), 131–156.

21.

Dash

and Liu

, Consistency-based search in feature selection, Artif Intell 151 (2003), 155–176.

22.

Gan

J.Q.

Hasan

B.A.S.

and Tsui

C.S.L.

, A filter-dominating hybrid sequential forward floating search method for feature subset selection in high-dimensional space, Int J Mach Learn Cybern 5 (2014), 413–423.

23.

S.X.

Wang

X.Z.

Zhang

G.Q.

and Zhou

, Effective algorithms of the Moore – Penrose inverse matrices for extreme learning machine, Intell Data Anal 19 (2015), 743–760.

24.

Mitra

Murthy

C.A.

and Pal

S.K.

, Unsupervised feature selection using feature similarity, IEEE Trans Pattern Anal Mach Intell 24 (2002), 301–312.

25.

Peng

H.C.

Long

F.H.

and Ding

, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans Pattern Anal Mach Intell 27 (2005), 1226–1238.

26.

Xie

Z.X.

and Xu

, Sparse group LASSO based uncertain feature selection, Int J Mach Learn Cybern 5 (2014), 201–210.

27.

Tang

W.Y.

and Mao

K.Z.

, Feature selection algorithm for mixed data with both nominal and continuous features, Pattern Recognit Lett 28 (2007), 563–571.

28.

Q.H.

D.R.

Liu

J.F.

and Wu

C.X.

, Neighborhood rough set based heterogeneous feature subset selection, Inf Sci 178 (2008), 3577–3594.

29.

Chen

D.G.

and Yang

Y.Y.

, Attribute reduction for heterogeneous data based on combination of classical and fuzzy rough set models, IEEE Trans Fuzzy Syst 22 (2014), 1325–1334.

30.

Zhang

Mei

Chec

and Li

, Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy, Pattern Recognition 56 (August 2016), 1–15.

31.

Tuv

Borisov

and Torkkola

, Best Subset Feature Selection for Massive Mixed-Type Problems, IDEAL 2006, Lecture Notes in Computer Science (LNCS) 4224 (2006), 1048–1056.

32.

Hedjazi

Aguilar-Martin

and Le Lann

M.-V.

, Tatiana Kempowsky-Hamon, Membership-margin based feature selection for mixed type and high-dimensional data: Theory and applications, Information Sciences 322(20) (Nov. 2015), 174–196.

33.

Hotelling

, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 23 (1933), 417–441.

34.

Torgerson

W.S.

, Multidimensional scaling: I. Theory and method, Psychometrika 17(4) (1952), 401–419.

35.

Sammon

J.W.

, A Nonlinear Mapping for Data Structure Analysis, IEEE Transactions on Computers C-18 (1969), 401–409.

36.

DeMers

and Cottrell

, Non-linear dimensionality reduction, Paper presented at the Advances in Neural Information Processing Systems, San Mateo, CA, USA, 1993.

37.

Demartines

and Hérault

, Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets, IEEE Transactions on Neural Networks 8(1) (1997), 148–154.

38.

Roweis

S.T.

and Saul

L.K.

, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science 290(5500) (2000), 2326.

39.

Tenenbaum

J.B.

Silva

V.D.

and Langford

J.C.

, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science 290 (2000), 2319–2323.

40.

Brand

, Charting a manifold, Paper presented at the Advances in Neural Information Processing Systems, Cambridge, MA, USA, 2002.

41.

Zhang

and Zha

, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, SIAM J Sci Comput 26(1) (2004), 313–338.

42.

Law

M.H.C.

and Jain

A.K.

, Incremental nonlinear dimensionality reduction by manifold learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006), 377–391.

43.

Maaten

L.V.D.

and Hinton

, Visualizing Data using t-SNE, Journal of Machine Learning Research 9 (2008).

44.

Shawe-Taylor

and Christianini

, Kernel Methods for Pattern Analysis, Cambridge, UK.: Cambridge University Press, 2004.

45.

Donoho

D.L.

and Grimes

, Hessian eigenmaps: New locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences 102(21) (2005), 7426–7431.

46.

Kohonen

, The self-organizing map, Proceedings of the IEEE 78(9) (1990), 1464–1480.

47.

Kohonen

, Essentials of the self-organizing map, Neural Networks 37 (2013), 52–65.

48.

Hsu

C.-C.

Lin

S.-H

and Tai

W.-S.

, Apply extended self-organizing map to cluster and classify mixed-type data, Neurocomputing 74 (2011), 3832–3842.

49.

Halkidi

Batistakis

and Vazirgiannis

, Cluster Validity Methods-Part I, ACM SIGMOD Record 31(2) (2002), 40–45.

50.

Halkidi

Batistakis

and Vazirgiannis

, Cluster Validity Methods-Part II, ACM SIGMOD Record 31(3) (2002), 19–27.

51.

Tan

P.-N.

Steinbach

and Kumar

, Introduction to Data Mining, Addison Wesley, 2006.

52.

Hsu

C.-C.

, Generalizing Self-Organizing Map for Categorical Data, IEEE Transactions on Neural Networks 17 (2006), 294–304.

53.

Deegalla

and Boström

, Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods, Paper presented at the Intelligent Data Engineering and Automated Learning, Birmingham, UK, 2007.

54.

Kullback

and Leibler

R.A.

, On information and sufficiency, Annals of Mathematical Statistics 22(1) (1951), 8.

55.

Lichman

, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2013.

56.

Han

and Kamber

, Data Mining: Concepts and Techniques, 2

{}^{\rm nd}

ed., Morgan Kaufmann 2006.