Sage Journals: Discover world-class research

Abstract

MapReduce (MR) computing paradigm and its open source implementation Hadoop have become a de facto standard to process big data in a distributed environment. Initially, the Hadoop system was homogeneous in three significant aspects, namely, user, workload, and cluster (hardware). However, with growing variety of MR jobs and inclusion of different configurations of nodes in the existing cluster, heterogeneity has become an essential part of Hadoop systems. The heterogeneity factors adversely affect the performance of a Hadoop scheduler and limit the overall throughput of the system. To overcome this problem, various heterogeneous Hadoop schedulers have been proposed in the literature. Existing survey works in this area mostly cover homogeneous schedulers and classify them on the basis of quality of service parameters they optimize. Hence, there is a need to study the heterogeneous Hadoop schedulers on the basis of various heterogeneity factors considered by them. In this survey article, we first discuss different heterogeneity factors that typically exist in a Hadoop system and then explore various challenges that arise while designing the schedulers in the presence of such heterogeneity. Afterward, we present the comparative study of heterogeneous scheduling algorithms available in the literature and classify them by the previously said heterogeneity factors. Lastly, we investigate different methods and environment used for evaluation of discussed Hadoop schedulers.

Get full access to this article

View all access options for this article.

References

How Much Data Is Produced Every Day?—Level Blog. Available online at www.northeastern.edu/levelblog/2016/05/13/how-much-data-produced-every-day (last accessed March 16, 2018).

How Much Data Is Created on the Internet Each Day? | Micro Focus Blog. Available online at https://blog.microfocus.com/how-much-data-is-created-on-the-internet-each-day (last accessed March 16, 2018).

Dean

, Ghemawat

. MapReduce: Simplied data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design & Implementation. San Francisco, CA, 2004. pp. 137–149.

Malewicz

, Austern

, Bik

, et al. Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 International Conference in Management of Data—SIGMOD'10. Indianapolis, IN, 2010. pp. 135–146.

Giraph—Welcome To Apache Giraph!. Available online at http://giraph.apache.org (last accessed June 29, 2017).

Zaharia

, Chowdhury

, Das

, Dave

. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI. San Jose, CA, 2012.

Foundation A. 2015. Spark Lightning-fast cluster computing. Spark. Apache.Org. Available online at https://spark.apache.org (last accessed June 29, 2017).

The Apache Software Foundation. 2015. Apache Storm. Available online at http://storm.apache.org (last accessed June 29, 2017).

Samza. Available online at http://samza.apache.org (last accessed June 29, 2017).

10.

Chohan

, Castillo

, Spreitzer

, Steinder

. See spot run: Using spot instances for MapReduce workflows. HotCloud. 2010; 2012:1–7. Available online at http://dl.acm.org/citation.cfm?id=1863110 (last accessed May 7, 2017).

11.

, Hu

, Li

, et al. MapReduce parallel programming model: A state-of-the-art survey. Int J Parallel Program. 2016; 44:832–866.

12.

Shvachko

, Kuang

, Radia

, Chansler

. The Hadoop Distributed File System. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST2010. Washington, DC, 2010.

13.

Welcome to Apache Pig! Available online at https://pig.apache.org (last accessed June 25, 2017).

14.

Apache Hive™. Available online at https://hive.apache.org (last accessed June 25, 2017).

15.

Apache Mahout: Scalable machine learning and data mining. Available online at http://mahout.apache.org (last accessed June 25, 2017).

16.

ZooKeeper. Available online at https://zookeeper.apache.org/doc/trunk/zookeeperOver.html (last accessed June 25, 2017).

17.

Moseley

, Dasgupta

, Kumar

, Sarlós

. On scheduling in map-reduce and flow-shops. In: Proceedings of the 23rd ACM Symposium on Parallelism Algorithms and Architectures—SPAA'11. San Jose, CA, 2011. pp. 289–298.

18.

Aggarwal

, Xu

, Lan

, Subramaniam

. On the optimality of scheduling dependent MapReduce tasks on heterogeneous machines. 2017, arXiv preprint arXiv:1711.09964.

19.

Zhu

, Jiang

, Wu

, et al. Minimizing makespan and total completion time in MapReduce-like systems. In: Proceedings - IEEE INFOCOM, 2014, pp. 2166–2174.

20.

Fischer

, Su

, Yin

. Assigning tasks for efficiency in Hadoop. In: Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures—SPAA'10, New York, NY: ACM Press, 2010, p. 30.

21.

[HADOOP-3412]. Refactor the scheduler out of the JobTracker—ASF JIRA. Available online at https://issues.apache.org/jira/browse/HADOOP-3412 (last accessed January 10, 2018).

22.

Zaharia

, Konwinski

, Joseph

, et al. Improving MapReduce performance in heterogeneous environments. In: OSDI'08 Proceedings of the 8th USENIX Conference on Operating Systems and Design and Implementation. San Diego, CA, 2008. pp. 29–42.

23.

Hadoop

. 2016. Apache Hadoop 2.7.2—Apache Hadoop YARN. Available online at https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html (last accessed June 26, 2017).

24.

Yoo

, Sim

. A comparative review of job scheduling for MapReduce. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems, IEEE, Beijing, China, 2011. pp. 353–358.

25.

Harshitha

, Rekha

, Guruprasad

. A survey on scheduling techniques in Hadoop. Int J Eng Dev Res; 3:2321–9939.

26.

Mohamed

, Hong

. Hadoop-MapReduce job scheduling algorithms survey. In: Proceedings—2016 7th International Conference on Cloud Computing and Big Data, CCBD 2016, IEEE, Taipa, Macau, China, 2017. pp. 237–242.

27.

Varma

. Survey on MapReduce and scheduling algorithms in Hadoop. Int J Sci Res. 2013; 14:2319–7064.

28.

Nimbalkar

, Gadekar

. Survey on Scheduling Algorithm in MapReduce Framework. Int J Sci Eng Technol Res. 2015; 4:1226–1230.

29.

Thomas

. Survey on MapReduce scheduling algorithms. Int J Comput Appl. 2014; 95:975–8887.

30.

Rao

, Reddy

LSS

. Survey on improved scheduling in Hadoop MapReduce in cloud environments. Int J Comput Appl. 2011; 34:975–8887.

31.

Tiwari

, Sarkar

, Bellur

, Indrawan

. Classification framework of MapReduce scheduling algorithms. ACM Comput Surv. 2015; 47:47:1–38.

32.

QtConcurrent. Available online at http://doc.qt.io/qt-5 (last accessed January 10, 2018).

33.

Ranger

, Raghuraman

, Penmetsa

, et al. Evaluating MapReduce for multi-core and multiprocessor systems. Available online at http://csl.stanford.edu/∼christos/publications/2007.cmp_mapreduce.hpca.pdf (last accessed January 10, 2018).

34.

Ekanayake

, Li

, Zhang

, et al. Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing—HPDC'10. Chicago, IL, 2010. pp. 810–818.

35.

Disco MapReduce. Available online at http://discoproject.org (last accessed January 10, 2018).

36.

Dou

, Kalogeraki

, Gunopulos

, et al. Misco. In: Proceedings of the 3rd International Conference on PErvasive Technologies Related to Assistive Environments—PETRA'10, New York, NY: ACM Press, 2010. p. 1.

37.

Skynet. Available online at https://github.com/wonko9/skynet (last accessed January 10, 2018).

38.

Ghemawat

, Gobioff

, Leung

S-T

. The Google file system. ACM SIGOPS Oper Syst Rev. 2003; 37:29.

39.

, Hadoop

, et al. Welcome to Apache Hadoop. Available online at http://hadoop.apache.org (last accessed June 25, 2017).

40.

Kawa

. Introduction to YARN. Available online at https://www.ibm.com/developerworks/library/bd-yarn-intro (last accessed October 3, 2017).

41.

Ghodsi

, Zaharia

, Hindman

, et al. Dominant resource Fairness: Fair allocation of multiple resource types maps reduces. In: Proceedings of the 8th USENIX Conference on NSDI. Boston, MA, Vol. 11, p. 24.

42.

Apache Mesos. Available online at http://mesos.apache.org (last accessed June 25, 2017).

43.

Verma

, Pedrosa

, Korupolu

, et al. Large-scale cluster management at Google with Borg. In: Proceedings of the Tenth European Conference on Computer Systems—EuroSys'15. Bordeaux, France, 2015. pp. 1–17.

44.

Schwarzkopf

, Konwinski

. Omega: flexible, scalable schedulers for large compute clusters. In: EuroSys'13 Proceedings of the 8th ACM European Conference on Computer Systems. Prague, Czech Republic, 2013. pp. 351–364.

45.

Ousterhout

, Wendell

, Zaharia

, Sparrow

Stoica I

.. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles—SOSP'13, New York, NY: ACM Press, 2013. pp. 69–84.

46.

Rasooli

, Down

. Guidelines for selecting Hadoop schedulers based on system heterogeneity. J Grid Comput. 2014; 12:499–519.

47.

Rasooli

, Down

. COSHH: A classification and optimization based scheduler for heterogeneous Hadoop systems. Future Gener Comput Syst. 2014; 36:1–15.

48.

Zaharia

, Borthakur

, Sarma

, et al. Job Scheduling for Multi-User MapReduce Clusters. EECS Department, University of California, Berkeley Technical Report UCBEECS200955 April 2009; UCB/EECS-2009-55.

49.

Sandholm

, Lai

. Dynamic proportional share scheduling in Hadoop. Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.457.7438&rep=rep1&type=pdf (last accessed May 27, 2018).

50.

Yan

, Cherkasova

, Zhang

, Smirni

. Heterogeneous cores for MapReduce processing: Opportunity or challenge? In: 2014 IEEE Network Operations and Management Symposium (NOMS). Krakow, Poland, 2014. pp. 1–4.

51.

CapacityScheduler Guide. Available online at https://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html (last accessed June 27, 2017).

52.

Lee

. Embracing heterogeneity in scheduling MapReduce. CS267 Project Report. Available online at www.cs.berkeley.edu/*agearh/cs267.sp10/files/cs267_gunho.pdf (last accessed May 27, 2018).

53.

Lee

, Chun

, Katz

. Heterogeneity-aware resource allocation and scheduling in the cloud. In: Proceedings of the 3rd USENIX conference on Hot topics in Cloud computing (HotCloud 2011), USENIX Association, Berkeley, CA, 2011.

54.

Yao

, Wang

, Sheng

, et al. HaSTE: Hadoop YARN scheduling based on task-dependency and resource-demand. In: IEEE International Conference on Cloud Computing (CLOUD). Anchorage, AK, 2014. pp. 184–191.

55.

Zaharia

, Borthakur

, Sen Sarma

, et al. Delay scheduling. In: Proceedings of the 5th European Conference on Computer Systems—EuroSys'10, New York, NY: ACM Press, 2010. p. 265.

56.

[HADOOP-3412] Refactor the scheduler out of the JobTracker—ASF JIRA. Available online at https://issues.apache.org/jira/browse/HADOOP-3412 (last accessed May 8, 2017).

57.

Max-min fairness—Wikipedia. Available online at https://en.wikipedia.org/wiki/Max-min_fairness (last accessed May 8, 2017).

58.

Verma

, Cherkasova

, Campbell

. Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secur Comput. 2013; 10:314–327.

59.

Tian

, Li

, Yang

, Buyya

. HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput. 2016; 72:2376–2393.

60.

Guo

, Fox

, Zhou

. Improving resource utilization in MapReduce. Available online at http://grids.ucs.indiana.edu/ptliupages/publications/Improve_Resource_Utilization_MapReduce_V8.pdf (last accessed May 24, 2018).

61.

Chen

, Zhang

, Guo

, et al. SAMR: A self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: 2010 10th IEEE International Conference on Computer and Information Technology, IEEE, Bradford, West Yorkshire, UK, 2010. pp. 2736–2743.

62.

Chen

, Guo

, Deng

, et al. HAT: History-based auto-tuning MapReduce in heterogeneous environments. J Supercomput. 2013; 64:1038–1054.

63.

Yang

S-J

, Chen

Y-R

. Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds. J Netw Comput Appl. 2015; 57:61–70.

64.

Mashayekhy

, Nejad

, Grosu

, et al. Energy-aware scheduling of MapReduce jobs for big data applications. IEEE Trans Parallel Distrib Syst. 2015; 26:2720–2733.

65.

Tang

, Zhou

, Li

. MTSD: A task scheduling algorithm for MapReduce base on deadline constraints. In: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012, IEEE, Shanghai, China, 2012. pp. 2012–2018.

66.

Polo

, Castillo

, Carrera

, et al. Resource-aware adaptive scheduling for MapReduce clusters. In: Proceedings of 12th International Middleware Conference. Lisbon, Portugal, 2011. pp. 180–199.

67.

Tian

, Zhou

, He

, Zha

. A dynamic MapReduce scheduler for heterogeneous workloads. In: 8th International Conference on Grid and Cooperative Computing GCC 2009. Lanzhou, China, 2009. pp. 218–224.

68.

Nanduri

, Maheshwari

, Reddyraja

, Varma

. Job aware scheduling algorithm for MapReduce framework. In: Proceedings—2011 3rd IEEE International Conference on Cloud Computing Technology and Science (CloudCom) 2011. Athens, Greece, Novemeber 2011. pp. 724–729.

69.

Hall

, Frank

, Holmes

, et al. The WEKA data mining software. ACM SIGKDD Explor Newsl. 2009; 11:10.

70.

Funk

, Baruah

. Task assignment on uniform heterogeneous multiprocessors. Proc Euromicro Conf Realtime Syst. 2005; 2005:219–226.

71.

Yao

, Tai

, Sheng

, Mi

. Lsps: A job size-based scheduler for efficient task assignments in Hadoop. IEEE Trans Cloud Comput. 2015; 3:411–424.

72.

Zhang

, Zhani

, Yang

, et al. PRISM: Fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput. 2015; 3:182–194.

73.

Verma

, Cherkasova

, Campbell

. ARIA: Automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing—ICAC'11, New York, NY: ACM Press, 2011. p. 235.

74.

Herodotou

, Lim

, Luo

, et al. A self-tuning system for big data analytics. Cidr. 2011; 11:261–272.

75.

Sun

, He

, Lu

. ESAMR: An enhanced self-adaptive MapReduce scheduling algorithm. In: Proceedings of the International Conference on Parallel and Distributed Systems—ICPADS, Singapore, 2012. pp. 148–155.

76.

Witteveen

, Gupta

, Fritz

, et al. ThroughputScheduler: Learning to schedule on heterogeneous Hadoop clusters. In: 10th International Conference on Autonomic Computing ICAC'13, San Jose, CA, June 26th–28th, 2013. pp. 159–165.

77.

Kumar

, Konishetty

, Voruganti

, Rao

GVP

. CASH: Context aware scheduler for Hadoop. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics; ICACCI'12. Chennai, India, 2012. pp. 52–61.

78.

Yigitbasi

, Datta

, Jain

, Willke

. Energy efficient scheduling of MapReduce workloads on heterogeneous clusters. In: Green Computing Middleware on Proceedings of the 2nd International Workshop—GCM'11, Lisbon, Portugal, 2011. pp. 1–6.

How Heterogeneity Affects the Design of Hadoop MapReduce Schedulers: A State-of-the-Art Survey and Challenges

Abstract

Abstract

Get full access to this article

References