Hierarchical clustering of heavy-tailed data using a new similarity measure

Abstract

Clustering is the primary technique used to divide data into groups based on unknown models inherent to the data. Regulation of the entire clustering method is complicated and submitted to several uncertainties. Similarity measures one of the first decisions to be made to establish how the similarity between two objects must be measured. This research focuses on the influence of similarity measures in the hierarchical clustering to uncover patterns in heavy-tailed data. Stable distributions are the most important subclass of heavy-tailed distributions. A well-known measure of similarity is defined based on correlation of two objects. However, this measure cannot be used for heavy-tailed data. We will illustrate how to perform a hierarchical cluster analysis in heavy-tailed data by extending the similarity measure based on the correlation. We introduce a new similarity measure based on covariation coefficient. We evaluate the performance of covariation similarity and compare it to others using external and internal criteria.

Keywords

Hierarchical clustering dissimilarity measure stable distribution currency market

Get full access to this article

View all access options for this article.

References

Aggarwal

C.C.

Hinneburg

and Keim

D.A.

, On the surprising behavior of distance metrics in high dimensional spaces, 8th International Conference Lecture Notes in Computer Science 1973 (2001), 420–434.

Amorim

R.C.

and Mirkin

, Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition 45(3) (2012), 1061–1075.

Assent

, Clustering high dimensional data, WIREs Data Mining and Knowledge Discovery 2(4) (2012), 340–350.

Embrechts

Klppelberg

and Mikosch

, Modelling extremal Events: for insurance and finance, Springer, Berlin Heidelberg, 2013.

Faith

D.P.

, Asymmetric binary similarity measures, Oecologia 57(3) (1983), 287–290.

Fenn

D.J.

Porter

M.A.

Mucha

P.J.

McDonald

Williams

Johnson

N.F.

and Jones

N.S.

, Dynamical clustering of exchange rates, Quantitative Finance 12(10) (2012), 1493–1520.

Filippone

Camastra

Masulli

and Rovetta

, A survey of kernel and spectral methods for clustering, Pattern Recognition 41(1) (2008), 176–190.

Francois

Wertz

and Verleysen

, The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering 19(7) (2007), 873–886.

Gan

Ch.

and Wu

, Data clustering: Theory, algorithms, and applications, ASA-SIAM, Philadelphia, 2007.

10.

Garg

Enright

C.G.

and Madden

M.G.

, On asymmetric similarity search, 2015 IEEE 14th International Conference on Machine Learning and Applications (2015), 649–654.

11.

Geva

A.B.

Steinberg

Bruckmair

Sh.

Nahum

, A comparison of cluster validity criteria for a mixture of normal distributed data, Pattern Recognition Letters 21(6–7) (2000), 511–529.

12.

Gil-Garcia

and Pons-Porrata

, Dynamic hierarchical algorithms for document clustering, Pattern Recognition Letters 31(6) (2010), 469–477.

13.

Guerra

Robles

Bielza

and Larraaga

, A comparison of clustering quality indices using outliers and noise, Intelligent Data Analysis 16(4) (2012), 703–715.

14.

Heyer

Kruglyak

and Yooseph

, Exploring expression data: Identification and analysis of coexpressed genes, Genome Research 9 (1999), 1106–1115.

15.

Jia

and Darrell

, Heavy-tailed Distances for Gradient Based Image Descriptors, Advances in Neural Information Processing Systems 24, (2011), 379–405.

16.

Jiang

Tang

and Zhang

, Cluster analysis for gene expression data: A survey, IEEE Transactions on Knowledge and Data Engineering 16(11) (2004) 1370–1386.

17.

Kwapień

Gworek

Drożdż

and Górski

, Analysis of a network structure of the foreign currency exchange market, Journal of Economic Interaction and Coordination 4(1) (2009), 1860–7128.

18.

Martos

Munoz

and Gonzalez

, Generalizing the Mahalanobis distance via density kernels, Intelligent Data Analysis 18(6) (2014), 19–31.

19.

McDonald

Suleman

Williams

Howison

and Johnson

S.N.F.

, Detecting a currency’s dominance or dependence using foreign exchange network trees, Physical Review E 72(4) (2005), 046106.

20.

Modarres

and Nolan

J.P.

, A Method for simulating stable random vectors, Computational Statistics 9 (1994), 11–19.

21.

Nielsen

and Nock

, Clustering multivariate normal distributions, Lecture Notes in Computer Science 5416 (2009), 164–174.

22.

Nikias

C.L.

and Shao

, Signal processing with alpha-stable distributions and applications, Wiley, New York, 1995.

23.

Nolan

J.P.

, Stable: Functions for working with stable distributions, R package version 5.1, 2009.

24.

Nolan

J.P.

Panorska

A.K.

and McCulloch

J.H.

, Estimation of stable spectral measures, Mathematical and Computer Modelling, 34 (2001), 1113–1122.

25.

Omran

M.G.

Engelbrecht

A.P.

and Salman

, An overview of clustering methods, Intelligent Data Analysis 11(6) (2007), 583–605.

26.

R Core Team, R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. ISBN 3-900051-07-0, URL http://www.R-project.org/.

27.

Reiss

R.D.

and Thomas

, Statistical Analysis of Extreme Values: with applications to insurance, finance, hydrology and other fields, Springer, 2007.

28.

Rencher

A.C.

, Methods of multivariate analysis, Wiley, New York, 2003.

29.

Resnick

S.I.

, Heavy-tail phenomena: Probabilistic and statistical modeling, Springer, New York, 2007.

30.

Salas-Gonzalez

Kuruoglu

E.E.

Ruiz

D.P.

, Modelling with mixture of symmetric stable distributions using Gibbs sampling, Signal Processing, 90(3) (2010), 774–783.

31.

Samorodnitsky

and Taqqu

M.S.

, Stable non-Gaussian random processes: Stochastic models with infinite variance, Chapman & Hall, New York, 1994.

32.

Szekely

G.J.

and Rizzo

M.L.

, Hierarchical clustering via joint between-within distances: Extending Ward’s minimum variance method, Journal of Classification 22(2) (2005), 151–183.

33.

Teimouri

and Rezakhah

Mohammdpour

, EM algorithm for symmetric stable mixture model, Communications in Statistics-Simulation and Computation (2017), 1532–4141.

34.

Wang

Chen

and Pan

, A fast hierarchical clustering algorithm for functional modules discovery in protein interaction networks, IEEE Transactions on Computational Biology and Bioinformatics 8(3) (2011), 607–620.

35.

Wang

and Xie

, Tail dependence structure of the foreign exchange market: A network view, Expert Systems with Applications 46 (2016), 164–179.