Sage Journals: Discover world-class research

Abstract

Data skew in parallel joins results in poor load balancing which can lead to significantly varying execution times for the reducers in MapReduce. The performance of join operation is severely degraded in the presence of heavy skew in the datasets to be joined. Previous work mainly focuses on either input or output load imbalance among reducers, which is ineffective for load balancing. In this paper, we present a new data skew handling method based on Cluster Cost Partitioning (CCP) for optimizing parallel joins in MapReduce. A new cost model which considers the properties of both input and output is defined to estimate the cost of the parallel join. CCP employs clusters instead of join keys from input relations to create join matrix. Using the cost model, CCP identifies and splits heavy cells in the cluster join matrix. Then CCP assigns a set of non-heavy cells to reducers for join load-balancing. For different applications, the input and output weight values in the cost model could be dynamically adjusted to depict the join costs more precisely. The experimental results demonstrate that CCP achieves a more accurate load balancing result among reducers.

Keywords

Data skew load balance join algorithm cluster cost partitioning MapReduce

Get full access to this article

View all access options for this article.

References

Ibrahim

Jin

B.S.

Antoniu

and Wu

, Handling partitioning skew in MapReduce using LEEN, Peer-to-Peer Networking and Applications 6(4) (2013), 409–424.

Al Hajj Hassan

Bamha

and Loulergue

, Handling data-skew effects in Join operations using MapReduce, 14th International Conference on Computational Science (ICCS 2014), Cairns, Australia, (2014), 145–158.

Myung

Shim

Yeon

and Lee

S.G.

, Handling data skew in join algorithms using MapReduce, Expert Systems with Applications 51 (2016), 286–299.

Chen

Yao

J.Y.

and Xiao

, LIBRA: Lightweight data skew mitigation in MapReduce, IEEE Transactions on Parallel and Distributed Systems 26(9) (2015), 2520–2533.

Walton

C.B.

Dale

A.G.

and Jenevein

R.M.

, A taxonomy and performance model of data skew effects in parallel joins, Proceedings of the 17th International Conference on Very Large Data Bases, San Francisco, USA, (1991), 537–548.

Poosala

and Ioannidis

Y.E.

, Estimation of query-result distribution and its application in parallel join load balancing, Proceedings of the 22th International Conference on Very Large Data Bases, San Francisco, USA, (1996), 448–459.

and Kostamaa

, Efficient outer join data skew handling in parallel DBMS, VLDB Endowment 2(2) (2009), 1390–1396.

Acharya

Gibbons

P.B.

and Poosala

, Congressional samples for approximate answering of group-by queries, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, USA, (2000), 487–498.

Shatdal

and Naughton

J.F.

, Adaptive parallel aggregation algorithms, Acm Sigmod Record 24(2) (1995), 104–114.

10.

Okcan

and Riedewald

, Processing theta-joins using MapReduce, Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, (2011), 949–960.

11.

Blanas

Patel

J.M.

Ercegovac

Rao

Shekita

E.J.

and Tian

Y.Y.

, A comparison of join algorithms for log processing in MapReduce, Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, USA, (2010), 975–986.

12.

Afrati

F.N.

and Ullman

J.D.

, Optimizing joins in a Map-Reduce environment, Proceedings of the 13th International Conference on Extending Database Technology, Lausanne, Switzerland, (2010), 99–110.

13.

Vitorovic

Elseidy

and Koch

, Load balancing and skew resilience for parallel joins, IEEE 32nd International Conference on Data Engineering, Helsinki, Finland, (2016), 313–324.

14.

Yang

H.C.

Dasdan

Hsiao

R.L.

and Parker

D.S.

, Map-Reduce-Merge: Simplified relational data processing on large clusters, Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, Beijing, China, (2007), 1029–1040.

15.

Jiang

D.W.

Tung

A.K.H.

and Chen

, MAP-JOIN-REDUCE: Towards scalable and efficient data analysis on large clusters, IEEE Transactions on Knowledge & Data Engineering 23(9) (2011), 1299–1311.

16.

Lin

Agrawal

Chen

Ooi

B.C.

and Wu

, Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework, Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, (2011), 961–972.

17.

Kaldewey

Shekita

E.J.

and Tata

, Clydesdale: Structured data processing on MapReduce, 15th International Conference on Extending Database Technology, EDBT ’12, Berlin, Germany, (2012), 15–25.

18.

Atta

Viglas

S.D.

and Niazi

, SAND join – A skew handling join algorithm for Google’s MapReduce framework, IEEE 14th International Multitopic Conference, Karachi, Pakistan, (2011), 170–175.

19.

Atta

, Implementation and analysis of join algorithms to handle skew for the Hadoop Map/Reduce framework, University of Edinburgh, 2010.

20.

Dewitt

D.J.

and Gray

, Parallel database systems: The future of high performance database systems, Comunications of the ACM 35(6) (1992), 85–98.

21.

Kwon

Balazinska

Howe

and Rolia

, A study of skew in MapReduce applications, Open Cirrus Summit, Moscow, Russia, (2011).

22.

Kwon

Balazinska

Howe

and Rolia

, Skew-resistant parallel processing of feature-extracting scientific user-defined functions, Proceedings of the 1st ACM Symposium on Cloud Computing, Indianapolis, USA, (2010), 75–86.

23.

Kwon

Balazinska

Hoew

and Rolia

, SkewTune: Mitigating skew in MapReduce applications, Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, USA, (2012), 25–36.

24.

Gufler

Augsten

Reiser

and Kemper

, Handling data skew in MapReduce, International Conference on Cloud Computing and Services Science, Noordwijkerhout, Netherlands, (2012), 574–583.

25.

Gao

Zhou

Shi

and Zhang

, Handling data skew in MapReduce cluster by using partition tuning, Journal of Healthcare Engineering 2017(5) (2017), 1–12.

26.

Zhang

X.F.

Chen

and Wang

, Efficient multi-way theta-join processing using MapReduce, Proceedings of the VLDB Endowment 5(11) (2012), 1184–1195.

27.

Hassan

M.A.H.

and Bamha

, Towards scalability and data skew handling in GroupBy-Joins using MapReduce model, Procedia Computer Science 51(1) (2015), 70–79.

28.

Zhao

Zhang

and Qin

, KNN-DP: Handling data skewness in kNN joins using MapReduce, IEEE Transaction on Parallel and Distributed Systems, (2017). doi: 10.1109/TPDS.2017.2767596.

Handling data skew in joins based on cluster cost partitioning for MapReduce

Abstract

Keywords

Get full access to this article

References