Hephaistos: A fast and distributed outlier detection approach for big mixed attribute data

Abstract

This paper tackles a new problem in outlier detection: how to promptly detect the local outlier of a large-scale mixed attribute data in the big data era. This poses significant challenges due to a lack of access to the entire mixed attribute dataset at any individual compute machine. Proposed approaches firstly form a mechanism that deletes the massive clear non-noise and extracts cluster-based pre-noise set. Furthermore, we analyze pre-noise set using multi-step distributed LOF computing method on the Spark platform. Finally, the ordered LOF list is the output result. Comprehensive experiments are implemented by large-scale Benchmark datasets and the Spark platform. Extensive results show that the performance of our approaches are superior to the previous ones (4X faster than baseline LOF/2X faster than DLOF) when compared to state-of-the-art techniques, and therefore is believed to be able to give better guidance to local outlier detection of mixed attribute data.

Keywords

Mixed attribute data clustering algorithm local outlier detection distributed framework Spark platform

Get full access to this article

View all access options for this article.

References

Aggarwal

C.C.

and Yu

P.S.

, Outlier detection for high dimensional data, Acm Sigmod Record 30(2) (2001), 37–46.

Bai

, An efficient algorithm for distributed density-based outlier detection on big data, Neurocomputing 181(C) (2016), 19–28.

Bhaduri, Algorithms for speeding up distance-based outlier detection, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, pp. 859–867.

Breunig, Lof: identifying density-based local outliers, Acm Sigmod Record 29(2) (2000), 93–104.

Chen

, The “best k” for entropy-based categorical data clustering, in: International Conference on Scientific and Statistical Database Management, SSDBM 2005, 27–29 June 2005, University of California, Santa Barbara, Ca, Usa, Proceedings, 2005, pp. 253–262.

D’Amora

B.D.

, Techniques for representing 3d scenes using fixed point data, 2010.

Fang

Huang

and Zeng

, Mmdbc: Density-based clustering algorithm for mixed attributes and multi-dimension data, in: IEEE International Conference on Big Data and Smart Computing, 2018, pp. 549–552.

Elgendy

and Elragal

, Big data analytics: A literature review paper, 8557 (2014), 214–227.

Ernst

, Comparison of local outlier detection techniques in spatial multivariate data, Data Mining and Knowledge Discovery 31(2) (2017), 1–29.

10.

, Parallel outlier detection using kd-tree based on mapreduce, in: IEEE Third International Conference on Cloud Computing Technology and Science, 2012, pp. 75–80.

11.

Hong

, Subspace clustering of categorical and numerical data with an unknown number of clusters, IEEE Trans Neural Netw Learn Syst PP(99) (2017), 1–18.

12.

Huang

, Clustering large data sets with mixed numeric and categorical values, 1997, pp. 21–34.

13.

Khade

, Frequent set mining for streaming mixed and large data, in: IEEE International Conference on Machine Learning and Applications, 2015, pp. 1130–1135.

14.

Lam

, Clustering data of mixed categorical and numerical type with unsupervised feature learning, IEEE Access 3(2) (2017), 1605–1613.

15.

Mai

S.T.

, Anydbc: An efficient anytime density-based clustering algorithm for very large complex datasets, in: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1025–1034.

16.

Prasetyo

, Comparison of distance and dissimilarity measures for clustering data with mix attribute types, in: International Conference on Information Technology, Computer and Electrical Engineering, 2015, pp. 276–280.

17.

Ramaswamy, Efficient algorithms for mining outliers from large datasets, in: ACM SIGMOD International Conference on Management of Data, 2000, pp. 427–438.

18.

Shannon

C.E.

, A mathematical theory of communication, Bell System Technical Journal 27(4) (1948), 379–423.

19.

Sheth

, Transforming big data into smart data: Deriving value via harnessing volume, variety, and velocity using semantic techniques and technologies, in: IEEE International Conference on Data Engineering, 2014, pp. 2–2.

20.

Shou

Z.Y.

, Outlier detection based on multi-dimensional clustering and local density, Journal of Central South University 24(6) (2017), 1299–1306.

21.

Sreenivasulu

, A proficient approach for clustering of large categorical data cataloguing, in: International Conference on Electrical, Electronics, and Optimization Techniques, 2016, pp. 2870–2875.

22.

Tan

, Hierarchical speaker verification: Kernel fisher discriminant plus mixed-pca classifier and fcm clustering, in: International Conference on Fuzzy Systems and Knowledge Discovery, 2016, pp. 1561–1565.

23.

Tekumalla

L.S.

, Vine copulas for mixed data: multi-view clustering for mixed data beyond meta-gaussian dependencies, Machine Learning 106(9-10) (2017), 1331–1357.

24.

Vinh

N.X.

, Discovering outlying aspects in large datasets, Data Mining and Knowledge Discovery 30(6) (2016), 1520–1555.

25.

Wang

, A distributed algorithm for the cluster-based outlier detection using unsupervised extreme learning machines, Mathematical Problems in Engineering 2017(2) (2017), 1–12.

26.

Wei

, Efficient processing of k nearest neighbor joins using mapreduce, Proceedings of The Vldb Endowment 5(10) (2012), 1016–1027.

27.

Xuan

, An improved clustering algorithm for mixed attributes data based on k-prototypes algorithm, in: International Conference on Broadband and Wireless Computing, Communication and Applications, 2016, pp. 396–399.

28.

Yan

, Distributed local outlier detection in big data, in: The ACM SIGKDD International Conference, 2017, pp. 1225–1234.

29.

Yin

, Clustering Mixed Type Attributes in Large Dataset, Springer Berlin Heidelberg, 2005.

30.

Yin

, An efficient clustering algorithm for mixed type attributes in large dataset, 3 (2005), 1611–1614.

31.

, A novel three-way clustering algorithm for mixed-type data, in: IEEE International Conference on Big Knowledge, 2017, pp. 119–126.

32.

Zhang

, Review of big data: A revolution that will transform how we live, work and think, by kenneth cukier and viktor mayer-schonberger, Information Polity 19 (2014), 157–160.

33.

Zhang

, From categorical to numerical: Multiple transitive distance learning and embedding, in: Siam International Conference on Data Mining, 2015.