A New Hybrid Scheduling Approach to Improve MapReduce Performance in Hadoop Clusters

Abstract

Big Data generally refers to massive quantities of data that are difficult to manipulate and process using traditional data management tools. Hadoop is a solution that introduces new techniques for processing, storing, and managing these large data sets in a cluster of machines or nodes. MapReduce is one of the core components of Hadoop that promotes parallel processing of workloads by dividing them into smaller tasks that will be distributed across multiple nodes. The scheduling technique adopted can have a significant effect on the data locality rate over the cluster and the overall performance of MapReduce. In this paper, we present a hybrid scheduling algorithm that combines the use of locality index and dynamic job priority techniques when distributing tasks among nodes to improve the performance of Hadoop MapReduce, by rising the data locality rate and reducing the processing time of workloads. Experiment results showed that our proposed algorithm achieved better processing time and high locality rate compared to the default Hadoop schedulers FIFO, FAIR and Capacity while ensuring efficient resource utilization.

Keywords

fuzzy approaches for big data human-Computer interfaces and intelligent systems data analytics and optimization techniques in engineering

Get full access to this article

View all access options for this article.

References

Abdel-Basset

Mohamed

Abd Elkhalik

Sharawi

Sallam

K. M.

(2022). Task scheduling approach in cloud computing environment using hybrid differential evolution. Mathematics, 10(21), 4049. https://doi.org/10.3390/math10214049

Abualigah

Shehab

Alshinwan

Alabool

Abuaddous

H. Y.

Khasawneh

A. M.

Diabat

M. A.

(2020). TSGWO: IoT tasks scheduling in cloud computing using grey wolf optimizer. In Swarm intelligence for cloud computing (pp. 127–152). Routledge. https://doi.org/10.1201/9780429020582-5

Abueid

A. I.

(2024). Big data and cloud computing opportunities and application areas. Engineering, Technology & Applied Science Research, 14(3), 14509–14516. https://doi.org/10.48084/etasr.7339

Alsurdeh

Calheiros

R. N.

Matawie

K. M.

Javadi

(2021). Hybrid workflow scheduling on edge cloud computing systems. IEEE Access, 9, 134783–134799. https://doi.org/10.1109/access.2021.3116716

Althebyan

AlQudah

Jararweh

Yaseen

(2017). A scalable map reduce tasks scheduling: A threading-based approach. International Journal of Computational Science and Engineering, 14(1), 44–54. https://doi.org/10.1504/ijcse.2017.081175

Bakni

Assayad

(2024a). Runtime estimation using linear regression method in Hadoop MapReduce. Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security, 1–4. https://doi.org/10.1145/3659677.3659757

Bakni

N.-E.

Assayad

(2024b). Analysis of the MapReduce performance in Hadoop. Revue D’Intelligence Artificielle, 38(6), 1391–1397. https://doi.org/10.18280/ria.380601

Brahmwar

Kumar

Sikka

(2016). Tolhit – A scheduling algorithm for Hadoop cluster. Procedia Computer Science, 89, 203–208. https://doi.org/10.1016/j.procs.2016.06.043

Cassales

G. W.

Charão

A. S.

Pinheiro

M. K.

Souveyet

Steffenel

L. A.

(2015). Context-aware scheduling for apache Hadoop over pervasive environments. Procedia Computer Science, 52, 202–209. https://doi.org/10.1016/j.procs.2015.05.058

10.

Chen

Liu

Zhu

(2017). A real-time scheduling strategy based on processing framework of Hadoop. 2017 IEEE International Congress on Big Data (BigData Congress), 321–328.

11.

Choppara

Mangalampalli

S. S.

(2024). A hybrid task scheduling technique in fog computing using fuzzy logic and deep reinforcement learning. IEEE Access, 12, 176363–176388. https://doi.org/10.1109/access.2024.3505546

12.

Dev

Patgiri

(2014). Performance evaluation of HDFS in big data management. 2014 International Conference on High Performance Computing and Applications (ICHPCA), vol. 9, 1–7. https://doi.org/10.1109/ichpca.2014.7045330

13.

Hou

Nie

(2023). Dynamic priority job scheduling on a Hadoop YARN platform. 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), 412–419.

14.

Wang

(2022) Load balancing algorithms for Hadoop cluster in unbalanced environment. Computational Intelligence and Neuroscience, 2022(1), 1–9. https://doi.org/10.1155/2022/1545024

15.

Sun

Wang

(2021). Task scheduling of cloud computing based on hybrid particle swarm algorithm and genetic algorithm. Cluster Computing, 26(5), 2479–2488. https://doi.org/10.1007/s10586-020-03221-z

16.

Gandomi

Reshadi

Movaghar

Khademzadeh

(2019). HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework. Journal of Big Data, 6(1), 1–21. https://doi.org/10.1186/s40537-019-0253-9

17.

Swanson

(2011). Matchmaking: A New MapReduce Scheduling Technique. 2011 IEEE Third International Conference on Cloud Computing Technology and Science. https://doi.org/10.1109/cloudcom.2011.16

18.

Jalalian

Sharifi

(2021). A hierarchical multi-objective task scheduling approach for fast big data processing. The Journal of Supercomputing, 78(2), 2307–2336. https://doi.org/10.1007/s11227-021-03960-9

19.

Javanmardi

A. K.

Yaghoubyan

S. H.

Bagherifard

Nejatian

Parvin

(2020). A unit-based, cost-efficient scheduler for heterogeneous Hadoop systems. The Journal of Supercomputing, 77(1), 1–22. https://doi.org/10.1007/s11227-020-03256-4

20.

Jeyaraj

Ananthanarayana

V. S.

Paul

(2020). Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment. Journal of Ambient Intelligence and Humanized Computing, 11(10), 4261–4272. https://doi.org/10.1007/s12652-020-01707-7

21.

Kalia

Dixit

Kumar

Gera

Epifantsev

John

Taskaeva

(2022). Improving MapReduce heterogeneous performance using KNN fair share scheduling. Robotics and Autonomous Systems, 157, 104228. https://doi.org/10.1016/j.robot.2022.104228

22.

Khezr

S. N.

Navimipour

N. J.

(2017). Mapreduce and its applications, challenges, and architecture: A comprehensive review and directions for future research. Journal of Grid Computing, 15(3), 295–321. https://doi.org/10.1007/s10723-017-9408-0

23.

Kim

Song

Han

Jung

Kang

(2020). Collaborative task scheduling for IoT-assisted edge computing. IEEE Access, 8, 216593–216606. https://doi.org/10.1109/access.2020.3041872

24.

Kumar

K. A.

Konishetty

V. K.

Voruganti

Rao

G. V. P.

(2012). CASH. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (pp. 52–61). https://doi.org/10.1145/2345396.2345406

25.

Wang

Lyu

Yang

(2018). An improved algorithm for optimizing MapReduce based on locality and overlapping. Tsinghua Science and Technology, 23(6), 744–753. https://doi.org/10.26599/tst.2018.9010115

26.

Wang

Abdelzaher

Gupta

Pace

(2014). WOHA: Deadline-aware map-reduce workflow scheduling framework over Hadoop clusters. 2014 IEEE 34th International Conference on Distributed Computing Systems, 93–103. https://doi.org/10.1109/icdcs.2014.18

27.

Nguyen

Simon

Halem

Chapman

(2012). A hybrid scheduling algorithm for data intensive workloads in a MapReduce environment. 2012 IEEE Fifth International Conference on Utility and Cloud Computing, 161–167. https://doi.org/10.1109/ucc.2012.32

28.

Pandit

M. K.

Mir

R. N.

Chishti

M. A.

(2020). Adaptive task scheduling in IoT using reinforcement learning. International Journal of Intelligent Computing and Cybernetics, 13(3), 261–282. https://doi.org/10.1108/ijicc-03-2020-0021

29.

Pirozmand

Hosseinabadi

A. A. R.

Farrokhzad

Sadeghilalimi

Mirkamali

Slowik

(2021). Multi-objective hybrid genetic algorithm for task scheduling problem in cloud computing. Neural Computing and Applications, 33(19), 13075–13088. https://doi.org/10.1007/s00521-021-06002-w

30.

Rahmani

A. M.

Chamzini

E. Y.

pourshaban

Hosseinzadeh

(2024). Scheduling of big data workflows in the Hadoop framework with heterogeneous computing cluster. Arabian Journal for Science and Engineering, 12449–12461. https://doi.org/10.1007/s13369-024-09779-9

31.

Roosta

S. H.

(2012). Parallel processing and parallel algorithms: theory and computation. Springer Science & Business Media.

32.

Sagiroglu

Sinanc

(2013). Big data: A review. 2013 International Conference on Collaboration Technologies and Systems (CTS). https://doi.org/10.1109/cts.2013.6567202

33.

Saif

F. A.

Latip

Hanapi

Z. M.

Shafinah

(2023). Multi-Objective grey wolf optimizer algorithm for task scheduling in cloud-fog computing. IEEE Access, 11, 20635–20646. https://doi.org/10.1109/access.2023.3241240

34.

Salehnia

Seyfollahi

Raziani

Noori

Ghaffari

Alsoud

A. R.

Abualigah

(2023). An optimal task scheduling method in IoT-fog-cloud network using multi-objective moth-flame algorithm. Multimedia Tools and Applications, 83(12), 34351–34372. https://doi.org/10.1007/s11042-023-16971-w

35.

Satouf

Hamidoğlu

Gül

ÖM

Kuusik

Durak Ata

Kadry

(2024). Metaheuristic-based task scheduling for latency-sensitive IoT applications in edge computing. Cluster Computing, 28(2), 1559–1581. https://doi.org/10.1007/s10586-024-04878-6

36.

Sayah

Aqil

Lahby

(2025). Hybrid metaheuristics-driven distributed task scheduling for latency-sensitive edge data processing. 2025 12th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 129–135. https://doi.org/10.1109/ntms65597.2025.11076755

37.

Singh

Verma

H. K.

(2021). EMM: Extended matching market based scheduling for big data platform hadoop. Multimedia Tools and Applications, 81(24), 34823–34847. https://doi.org/10.1007/s11042-021-11283-3

38.

Sui

X.-F.

Wang

J.-S.

Zhang

S.-H.

Zhang

S.-W.

Zhang

Y.-H.

(2025). Multi-strategy fusion mayfly algorithm on task offloading and scheduling for IoT-based fog computing multi-tasks learning. Artificial Intelligence Review, 58(5), 5105–5142. https://doi.org/10.1007/s10462-025-11145-6

39.

Tan

Meng

Zhang

(2013). Coupling task progress for MapReduce resource-aware scheduling. 2013 Proceedings IEEE INFOCOM, 1618–1626. https://doi.org/10.1109/infcom.2013.6566958

40.

Vavilapalli

V. K.

Murthy

A. C.

Douglas

Agarwal

Konar

Evans

Graves

Lowe

Shah

Seth

Saha

Curino

O’Malley

Radia

Reed

B.C.

Baldeschwieler

(2013). Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC ’13) (pp. 1–16). https://doi.org/10.1145/2523616.2523633

41.

Vinutha

D. C.

Raju

G. T.

(2021). Budget constraint scheduler for big data using Hadoop MapReduce. SN Computer Science, 2(4), 1–13. https://doi.org/10.1007/s42979-021-00638-0

42.

Walia

N. K.

Kaur

Alowaidi

Bhatia

K. S.

Mishra

Sharma

N. K.

Sharma

S. K.

Kaur

(2021). An energy-efficient hybrid scheduling algorithm for task scheduling in the cloud computing environments. IEEE Access, 9, 117325–117337. https://doi.org/10.1109/access.2021.3105727

43.

Wang

Yao

Mao

Sheng

(2014). FRESH: Fair and efficient slot configuration and scheduling for Hadoop clusters. 2014 IEEE 7th International Conference on Cloud Computing, 761–768. https://doi.org/10.1109/cloud.2014.106

44.

Ward

W. A.

Mahood

C. L.

West

J. E.

(2002). Scheduling jobs on parallel systems using a relaxed backfill strategy. In Job scheduling strategies for parallel processing (Vol. 2537, pp. 88–102). https://doi.org/10.1007/3-540-36180-4_6

45.

White

(2012). Hadoop: The definitive guide. O'Reilly Media.

46.

Yao

Gao

Wang

Sheng

(2021). New scheduling algorithms for improving performance and resource utilization in Hadoop YARN clusters. IEEE Transactions on Cloud Computing, 9(3), 1158–1171. https://doi.org/10.1109/tcc.2019.2894779

47.

Yao

Wang

Sheng

Lin

(2014). HaSTE: Hadoop YARN Scheduling Based on Task-Dependency and Resource-Demand. 2014 IEEE 7th International Conference on Cloud Computing. https://doi.org/10.1109/cloud.2014.34

48.

Zaharia

Borthakur

Sen Sarma

Elmeleegy

Shenker

Stoica

(2010). Delay scheduling. Proceedings of the 5th European Conference on Computer Systems. https://doi.org/10.1145/1755913.1755940

49.

Zhang

Rajasekaran

Wood

Zhu

(2014). MIMP: Deadline and interference aware scheduling of Hadoop virtual machines. 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 394–403.