Sage Journals: Discover world-class research

Abstract

Uploading all Internet of Things big data to a centralized cloud for data analytics is infeasible because of the excessive latency and bandwidth limitation of the Internet. A promising approach to addressing the challenges for data analytics in Internet of Things is “edge cloud” that pushes various computing and data analysis capabilities to multiple edge clouds. MapReduce provides an efficient way to deal with a large amount of data. When performing data analysis, a challenge is to predict the performance of MapReduce jobs. In this article, we propose and evaluate InSTechEM, which is an extended Internet of Things big data–oriented model for predicting MapReduce performance in multiple edge clouds. InSTechEM is able to predict MapReduce jobs’ total execution time in a general implementation scenario with varying reduce amounts and cluster scales. The proposed model is built based on historical job execution records and employs locally weighted linear regression techniques to predict the execution time of each job. By modifying the prediction model used in Hadoop 1 and extracting more representative features to represent a job, the InSTechEM model can effectively predict the total execution time of MapReduce applications with the average relative error of less than 10% in Hadoop 2 with Ceph as the storage system.

Keywords

Internet of Things big data Hadoop edge cloud MapReduce performance modeling performance prediction job estimation

Introduction

With the development and popularity of many Internet applications and services, the exponential growth in users data is described by the term Big Data.¹ It is dramatically crucial for big data practitioners to understand the methods of providing users with shorter response time and user-friendly experience based on the physical machines. At the same time, Internet of Things (IoT) consists of billions of sensors that range from nano-sensors to smart high-definition video cameras. However, migrating all IoT data to a centralized cloud for data analytics is infeasible because of the excessive latency and bandwidth limitation. For example, excessive latency may cause applications with real-time or near-real-time requirements, such as surveillance or smart transportation management, to fail in detecting suspicious objects or crucial traffic patterns in a timely manner. A promising approach to addressing the challenges for data analytics in IoT is developing the “edge cloud” that pushes various computing and data analysis capabilities to multiple edge clouds.

MapReduce² provides an efficient way for dealing with “Big Data.” Users specify the computation in terms of a map and a reduce function. The underlying runtime system automatically parallelizes the computation throughout the large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to efficiently use networks and disks.²

The MapReduce system is easy to use and Hadoop³ is an open-source implementation of MapReduce, providing easy access to parallel computing. A lot of Internet companies have deployed Hadoop clusters for big data processing. However, it is challenging to reasonably allocate resources when fulfilling a job. It is desirable to achieve the performance goal without wasting resources.⁴ Moreover, Hadoop parameter tuning, scheduling policy, and job performance optimization are important issues, which are closely related to job performance prediction. It is worth to mention that Amazon has offered Hadoop services for several years as Amazon Elastic MapReduce (EMR) in the Amazon public cloud; however, Hadoop clusters running in one centralized cloud for IoT data real-time analytics has serious bottleneck problems, as all the geo-distributed sensors need to transfer all the data to the centralized cloud.

In this article, we propose InSTechEM, an extended IoT big data–oriented model for better prediction of MapReduce job execution time in multiple edge clouds. Based on the job profile, input dataset size, and allocated resources, a MapReduce performance model is presented to predict the job completion time using multivariate local weighted linear regression.⁵

The main contributions of this article are summarized as follows:

We propose our IoT big data edge cloud architecture to address the challenges of data analytics in IoT using “edge cloud” that pushes various computing and data analysis capabilities to multiple edge clouds.

We propose InSTechEM, an extended IoT big data–oriented model for predicting MapReduce performances using the locally weighted linear regression (LWLR) model in multiple Edge Clouds Hadoop 2 environments. By extracting distinguishing features for job representations, InSTechEM selects a cluster scale as a crucial parameter and improves the prediction model used in Hadoop 1 to adapt to Hadoop 2.

We propose our IoT Big data Edge Cloud Architecture based on Ceph, even though many previous researches on MapReduce performance prediction are mostly based on hadoop distributed file system (HDFS). Ceph is a unified, distributed storage system designed for excellent performance, reliability, and scalability.

We have validated the accuracy of the proposed MapReduce performance model using TestDFSIO and Sort benchmark applications in the IoT Big data Edge Cloud Architecture environment based on Hadoop 2 with Ceph as storage system.

The rest of the article is organized as follows. Section “Related work” gives related work. Section “IoT big data multiple edge clouds architecture” proposes IoT big data edge clouds architecture. Section “InSTechEM: an IoT big data–oriented extended MapReduce performance model in multiple edge clouds” presents the extended job execution prediction model InSTechEM. Section “Experiment and performance evaluation” evaluates the accuracy and effectiveness of the proposed approach. Section “Conclusion and future work” concludes the article and points out future work.

Related work

Hadoop performance modeling has received great attention recently, involving job optimizations, scheduling, predictions, and resource provisioning. Meanwhile, IoT technology is gradually driving edge computing and edge cloud research to become a hot topic. Therefore, we should consider deploying IoT-big data–oriented Hadoop platforms in multiple edge clouds.

A number of models have been proposed for predicting MapReduce performances. Herodotou⁶ developed an expensive and comprehensive mathematical model of each phase of MapReduce. The execution of a MapReduce job included mapping tasks and reducing tasks. A map task execution was divided into five phases, which included Read, Map, Collect, Spill, and Merge. Reduce task execution was divided into four phases, including Shuffle, Merge, Reduce, and Write.

Kambatla et al.⁷ used historical execution traces of MapReduce to profile jobs and predict performance. Many models^7–9 used test runs on small-scale settings to characterize the behaviors of large-scale settings. Lin et al.¹⁰ divided the job processing from the perspective of resource dimensions instead of the execution order and proposed a cost vector that contained the cost of disk I/O, network traffic, computational complexity, central processing unit (CPU), and internal sorting for predicting the execution durations of the map and reduce tasks. However, no prediction on reduce tasks were presented.

Song et al.¹¹ presented a dynamic lightweight Hadoop job analyzer, and a prediction module using locally weighted regression methods. Tian and Chen¹² and Chen et al.¹³ proposed a cost model that showed the relationships among the amount of input data, the available system resources (Map and Reduce slots), and the complexity of the Reduce function for the target MapReduce job. Based on the cost model, optimal resource provisioning could be obtained.

Performance prediction modeling for MapReduce applications with large-scale data is a very important issue. In Wang et al.,¹⁴ we used the LWLR algorithm and linear regression (LR) algorithm to establish three kinds of prediction models based on different characteristics to estimate the execution time of the applications that have large-scale data and run on the Hadoop framework. We also made comparison and improvement to the three models. Verma et al.¹⁵ proposed a framework automatic resource inference and allocation (ARIA), for a Hadoop deadline-based scheduler which extracts and utilizes the job profiles from the past executions. These job profiles were used to compute the lower and upper bounds on the job completion time. Furthermore, ARIA model gave the resource provisioning model.

Based on ARIA model, the Hewlett-Packard Development Company (HP) model¹⁶ added scaling factors and used a simple LR to predict the job execution when processing larger datasets. Zhang et al.¹⁷ employed ARIA model¹⁵ to predict the MapReduce job completion time periods in heterogeneous Hadoop cluster environments. The work presented in Zhang et al.¹⁸ used a set of micro-benchmarks to profile generic phases of the MapReduce processing pipeline of a given Hadoop cluster. Zhang et al.¹⁸ divided the map phase and reduce phase into six generic sub-phases and used a regression technique to predict the durations of these sub-phases. Then, the overall job execution time could be computed using the method presented in Verma et al.¹⁵

Building on the HP model,¹⁶ Khan et al.¹⁹ presented an improved HP model for Hadoop job execution prediction and resource provisioning. The improved HP model employed LWLR instead of a simple regression technique to predict the execution time of a Hadoop job with a variety of reduce tasks. Furthermore, it took multiple waves into consideration.

In our paper,⁴ we designed, implemented, and evaluated the InSTechAH, an autoscaling scheme for a Hadoop system in a private cloud, which attempted to improve the resource utilization in cloud data centers as well as maintain required quality of services by autoscaling and scheduling background analytics tasks. We evaluated our scheme partially at a real data trace and partially on simulations, with Hadoop as the parallel data analytics framework and OpenStack as the cloud management architecture, to show efficiency of the InSTechAH system.

In Engin et al.,²⁰ proactive content caching in 5G wireless networks was investigated and a big-data-enabled architecture was proposed. In this practical architecture, a vast amount of data is harnessed for content popularity estimation, and strategic contents are cached at 5G base stations to achieve higher user satisfaction and backhaul central cloud offloading.

In Mathew et al.,²¹ the authors presented Nebula: a dispersed cloud infrastructure that uses voluntary edge resources for both computation and data storage. They described the lightweight Nebula architecture that enables distributed data-intensive computing through a number of optimizations, including location-aware data and computation placement, replication, and recovery. The authors evaluated Nebula’s performance on an emulated volunteer platform that spanned over 50 PlanetLab nodes distributed across Europe, and showed how MapReduce can be deployed and run on Nebula, as the common data-intensive framework. They showed that Nebula MapReduce is robust for a wide array of failures and substantially outperforms other wide-area versions based on a Berkeley Open Infrastructure for Network Computing (BOINC)-like model.

S Ola et al.²² proposed that the main objective of mobile edge computing (MEC) solution was export of central cloud capabilities to user’s proximity for decreasing latency, augmenting available bandwidth, and decreasing traffic load on the core network. On the other hand, IoT obtains benefits from the proliferation in mobile phone usage. Many mobile applications have been developed to connect a world of “things” (e.g. wearables, home automation systems, sensors, and radio frequency identification (RFID) tags) with the Internet. Even if it was an incomplete solution to scalable IoT architecture, time-sensitive IoT applications (e.g. e-healthcare, real-time monitoring) would be optimized from using MEC architecture.

W Shi and D Schahram²³ described that the success of IoT and rich cloud services have helped creating the need for edge computing, in which data processing occurs in part at the network edge, rather than completely in the centralized cloud. Edge computing could address concerns such as latency, mobile devices’ limited battery life, bandwidth costs, security, and privacy. Stream processing frameworks (SPF, for example, Apache Storm) were solutions that facilitated and managed the execution of processing topologies that consisted of multiple parallelizable steps/tasks and involved continuous data exchanges. Streaming from the world of Cloud-centric Big Data processing often failed in addressing certain requirements of IoT systems. Apostolos et al.²⁴ described topology-aware SPF extensions, which can eliminate latency requirement violations and reduce cloud-to-edge bandwidth consumption to 1/3 comparing to Apache Storm.

In Cheng et al.,²⁵ the authors focus on enabling flexible and efficient edge analytics for large-scale IoT systems to process stream data both at the network edges and in the Cloud dynamically and transparently. As compared with existing centralized cloud stream processing platforms, GeeLytics is designed to support dynamic topology to achieve low latency results for efficient task sharing and scheduling while minimizing the bandwidth consumption between the network edges and the Cloud, as shown in Figure 1.

Figure 1.

System setup for GeeLytics.²⁵

As we see, the end points could be various sensors like cars, glasses, video cameras, and mobile phones, connected to the system via different types of edge networks (e.g. WiFi, ZigBee, or 4G, but maybe also fixed networks).²⁵ They are constantly reporting heterogeneous, dimensional, and unstructured data over time. The Cloud represents the central control point with powerful processing capability and large storage. Between the end points and the Cloud, there are a large number of edge nodes distributed at different locations. These edge nodes have certain processing power and are able to process immediately the incoming stream data, therefore reducing both the bandwidth required to send raw data to the Cloud and the delay to prepare a response for actuators.²⁵

IoT big data multiple edge clouds architecture

IoT big data edge cloud architecture

Mobile or static sensors such as video cameras, audio sensors, air sensors, environment sensors, and motion sensors are becoming ubiquitous in the IoT. In many smart cities, a large number of IoT sensors are now widely deployed at different locations, producing a huge amount of stream data. Although the generated data provide us great potential to observe our city environments, it still remains a big challenge to efficiently analyze real-time results from sensor data to make fast and smart decisions. Current big data processing platforms, such as Storm, Spark Streaming, and S4, are well designed to process stream data within a cluster in the centralized cloud, but they are not suitable for geographically distributed IoT systems in which data are naturally geo-distributed and low latency analytics results are expected to be shared across users and applications. Therefore, centralized cloud-based big data infrastructure suffer from inefficient data mobility due to the centralization of cloud resources, and hence, edge cloud infrastructures are highly unsuited for dispersed-data-intensive applications, where the data may be spread at multiple geographical locations.²¹ Therefore, we propose our IoT Big data Edge Cloud Architecture, where our approach to addressing the challenges for data analytics in IoT is “edge cloud” that pushes various computing and data analysis capabilities to multiple edge clouds.

As shown in Figure 2, when the IoT sensors collect a lot of data and transfer all the data to the remote centralized cloud-host big data platform, the centralized cloud will become a bottleneck with network delay. We cooperated with one famous CDN company that has many edge network data centers across China and constructed a multiple edge clouds-based big data platform in their edge network data centers to enable IoT sensors to transport their data to the nearest edge cloud. In every edge big data platform, we build many Hadoop clusters as computing nodes; at the same time, we use Ceph distributed storage architecture as storage node. When many IoT sensor groups collect a lot of data, the multiple edge clouds redirector will allocate different sensor groups to different edge clouds based on the nearest node allocation policy. After one sensor group is allocated to an edge cloud, the edge cloud will carry out data cleaning and resource allocation. The edge cloud’s resource allocator will determine the amount of cluster resource to be allocated to handle this sensor group’s IoT data based on the collected data size and data processing type, such as CPU-intensive or IO-intensive task. After that, most IoT data intermediate processing of the sensor group is carried out in the allocated cluster, and only some aggregation processing is sent to the centralized cloud to be completed there. This architecture is similar to the idea of CDN, which may greatly reduce the load of the centralized cloud by allowing the IoT data to be distributed to multiple edge clouds for processing.

Figure 2.

IoT big data multiple edge clouds architecture.

Ceph is a massively scalable, open-source, distributed storage system. It comprises an object store, block store, and a Portable Operating System Interface of UNIX (POSIX)-compliant distributed file system. The platform is capable of autoscaling to the exabyte level and beyond. It runs on commodity hardware and is self-healing and self-managing with no single point of failure. Ceph is in the Linux kernel and can be integrated with the OpenStack cloud operating system. In our architecture, we use Ceph as the storage of Hadoop, so there is no need for moving data to HDFS since all the data are already stored on Ceph. In addition, it is easy to realize computing node autoscaling in the edge cloud environment. The potential disadvantages is that while all the data reading and writing operation are passing through the network switch of Ceph, the switch node bandwidth may become the bottleneck. But as we use multiple edge clouds architecture, we can distribute different IoT sensors to different edge clouds and thus can avoid the occurrence of bottleneck of Ceph switch network node. Therefore, this architecture is very scalable and flexible.

InSTechEM: an IoT big data–oriented extended MapReduce performance model in multiple edge clouds

In IoT >big data multiple edge clouds architecture, different sensor groups are allocated into different edge clouds based on the nearest cloud node allocation strategy. In each edge cloud, we need to determine how large cluster resource to be allocated to handle IoT data from different sensor groups based on the collected data size and data processing type, such as CPU-intensive or input/output (IO)-intensive task. In order to achieve reasonable resource allocation in multiple edge clouds architecture, we can use many methods including Hadoop parameters tuning, scheduling policy and job performance optimization, and job performance prediction. In this article, we focus on MapReduce job performance prediction and propose InSTechEM as an extended IoT Big data–oriented model for MapReduce performance prediction using the LWLR model in multiple Edge Clouds Hadoop2 environment. In InSTechEM, more representative features are extracted to represent a job and different features are assigned different weights for better prediction, which can support IoT data process job prediction in multiple edge clouds architecture.

InSTechEM model parameters selection

The performance model relies on a set of parameters to predict the total job execution time. The key factors of job execution time include Hadoop parameters configuration, job setting, cluster scale, application type, and its workload.

A general MapReduce performance model can be formalized as

T = f (P, S, C, T, D)

where P is the impact of Hadoop parameter configuration, S is the impact of job setting, C represents cluster scale, T is application type, and W is its workload.

MapReduce programs have very different logic and time complexity; the performance functions should be different from application to application.¹³ However, they can share some general form, only differing in the setting of parameters. So we build models for each different application.

Cluster scale has influence on performance model, which is obvious that the same application running on different clusters may have different execution times. In this article, we keep the physical machine distribution unchanged and vary cluster scale by controlling the number of virtual machines to find the relationship between these characteristics in the performance model.

There are a lot of Hadoop parameters that are set to be constants in our experiments for simplification. As for job setting, we choose the most important one, reduce number, to construct the model.

A more specific model will be in the form of

T = f (R, C, D)

where R is the reduce number of jobs, C represents the number of working machines in the cluster, and D is input files sizes.

Apply LWLR model to enhance performance prediction

LWLR algorithm⁵ is a non-parametric learning algorithm while LR is a parametric learning algorithm. In parametric learning algorithm, there exist fixed specific parameters that will not be changed once established. Trained sample data are used to determine the parameters and then training samples are discarded in parametric learning algorithm. Non-parametric learning algorithms need to keep training sample all the time since for each prediction, they will learn a new set of parameters. That is, the parameters are variables.

When the capacity of the training set is large, non-parametric learning algorithms cost more storage space and are relatively slow in computation speed. However, it is important and difficult to select the appropriate feature for the parametric learning algorithm like LR model. Models built on very different features may lead to very different results.

Many performance models are presented using LR of different feature selection.^14–18

Since LWLR model is relaxed on feature sets and provides a better fit to real data, this article utilizes LWLR to predict the corresponding data.

LWLR is a memory-based method that performs a regression around a point of interest using only training data that are “local” to that point.⁵

More data points lead to better fit but dramatically increase computational cost.

As the below equations show, LWLR model assigns weight coefficient to each sample point. The rule is that points are weighted by proximity to the predicted x using a kernel. LWLR model tries to fit the parameter $θ$ that makes the sum of distance between sample points and query point minimal

\begin{matrix} minimize \sum_{i} w_{i} (y_{i} - θ^{T} x_{i})^{2} \\ w_{i} = \exp (- \frac{dis (x_{i}, x_{p})}{2 τ^{2}}) \end{matrix}

where $x_{i}$ is the ith sample point, $y_{i}$ is the ith measured point, $x_{p}$ is the predicted point, $w_{i}$ is the ith weight coefficient for $x_{i}$ , and $τ$ is the scope of neighbors which is a smoothing parameter.

Obviously, the weight coefficient is approximately 1 when a sample point is very close to the predicted point and approximately 0 while the sample point is far from the predicted point. The value of τ establishes the scope of neighbors and is crucial to LWLR.

Bound-based prediction in Hadoop 2

Bound-based performance model is proposed in ARIA¹⁵ for predicting the MapReduce job completion time. The improved HP model¹⁹ modifies the upper bound to get a narrower gap between lower and upper bounds. We apply this model to the Hadoop 2 environment. Table 1 lists all symbols’ meanings used in this part.

Table 1.

Symbols used in bounds-based prediction.

$T_{M}^{low}$	The lower bound duration of the map phase in the first wave
$T_{M}^{up}$	The upper bound duration of the map phase in the first wave
$T_{sh - w 1}^{low}$	The lower bound duration of the shuffle phase in the first wave
$T_{sh - w 1}^{up}$	The upper bound duration of the shuffle phase in the first wave
$T_{sh - w 2}^{low}$	The lower bound duration of the shuffle phase in the other wave
$T_{sh - w 2}^{up}$	The upper bound duration of the shuffle phase in the other wave
$T_{sh}^{low}$	The lower bound duration of the entire shuffle phase
$T_{r}^{low}$	The lower bound duration of the reduce phase
$T_{r}^{up}$	The upper bound duration of the reduce phase
$T_{job}^{low}$	The lower bound duration of the overall job
$T_{job}^{up}$	The upper bound duration of the overall job
$M_{avg}$	The average execution time of a map task
$M_{max}$	The max execution time of a map task
$S h_{avg - w 1}$	The average execution time of a shuffle task in the first wave
$S h_{max - w 1}$	The max execution time of a shuffle task in the first wave
$S h_{avg - w 2}$	The average execution time of a shuffle task in the other wave
$S h_{max - w 2}$	The max execution time of a shuffle task in the other wave
$R_{max}$	The max execution time of a reduce task
$R_{avg}$	The average execution time of a reduce task
$N_{M}$	Map number of the job
$N_{M - w 1}$	The number of map tasks that complete in the first wave
$N_{sh - w 1}$	The number of shuffle tasks that complete in the first wave
$N_{sh - w 2}$	The number of shuffle tasks that complete in the other wave
$N_{r}$	Reduce number allocated to the job
$P_{r}$	The number of max parallel reduce tasks
$P_{m}$	The number of max parallel map tasks
C	Number of machines that compose the cluster
$R M_{total}$	Total memory of node manager allocated in the yarn
$R M_{m}$	Memory of map task allocated in the MapReduce application
$R M_{r}$	Memory of reduce task allocated in the MapReduce application

Even if the same configured job is submitted in a given execution environment, the job execution time may vary every time. Network, IO performance, and non-determinism of task scheduling are all the factors that may impact the final result. Taking this into consideration, it is reasonable to predict the time bounds and the gap between lower and upper bounds indicating the range of possible completion time.

First, using LWLR model introduced before, we predict the average and maximum task durations of different execution phases including map, shuffle, and reduce phases. Then, we utilize the bound-based model to compute the upper bound and lower bound of execution time for different phases in the job process which are completed in multiple waves. The equations are listed below

\begin{matrix} T_{M}^{low} = M_{avg} \times N_{M - w 1} / P_{m} \\ T_{M}^{up} = M_{max} \times N_{M - w 1} / P_{m} \end{matrix}

\begin{matrix} T_{sh - w 1}^{low} = S h_{avg - w 1} \times N_{sh - w 1} / P_{r} \\ T_{sh - w 1}^{up} = S h_{max - w 1} \times N_{sh - w 1} / P_{r} \\ T_{sh - w 2}^{low} = S h_{avg - w 2} \times N_{sh - w 2} / P_{r} \\ T_{sh - w 2}^{up} = S h_{max - w 2} \times N_{sh - w 2} / P_{r} \end{matrix}

\begin{matrix} T_{r}^{low} = R_{avg} \times N_{r} / P_{r} \\ T_{r}^{up} = R_{max} \times N_{r} / P_{r} \end{matrix}

\begin{matrix} T_{job}^{low} = T_{M}^{low} + T_{sh - w 1}^{low} + T_{sh - w 2}^{low} + T_{r}^{low} \\ T_{job}^{up} = T_{M}^{up} + T_{sh - w 1}^{up} + T_{sh - w 2}^{up} + T_{r}^{up} \end{matrix}

In HDFS and Ceph file systems, a job with a new dataset is partitioned into $N_{M}$ map tasks. That is, the number of map tasks can be computed when the data scale and block size are fixed. The duration of the entire map phase depends on the number of map waves. In Hadoop 1, slots are used to perform tasks and organize resources on each node, which is different from the environment we use. Hadoop 2 uses yet another resource negotiator (YARN) to manage and schedule the cluster resources. The task directly applies for resources they need, while scheduler allocates the fine-grained resources according to the actual demand instead of simply allocating slots.

In this situation, max parallel map and reduce tasks are determined by max containers that can be allocated to the tasks. They can be determined using the following expressions

\begin{matrix} P_{m} = R M_{total} / R M_{m} \times C \\ P_{r} = min (R M_{total} / R M_{r}, N_{r} / C) \times C \end{matrix}

Usually, $P_{r}$ is same as the reduce number. In practice, the MapReduce application master will use container so the actual number of concurrent mappers and reducers will be less than the calculated value. Thus, the execution time of the map phase can be predicted using bound-based model. The duration of other phases can be obtained similarly.

It is worth noting that since there exists overlap between shuffle phase and map phase, we need to divide the shuffle phase into two parts: the overlapping portion with map stage and the non-overlapping part. We characterize the two shuffle phases.

If the reduce number allocated to a MapReduce job is less than the total max parallel reduce tasks, then the shuffle phase will be finished in one single execution. The lower bound time spent on the shuffle phase can be calculated as the following formula

T_{sh}^{low} = S h_{avg} \times N_{r} / P_{r}

Finally, the overall job execution prediction time can be calculated using the following equation

T_{job}^{avg} = (T_{job}^{low} + T_{job}^{up}) / 2

Extended job execution prediction model

By extracting more representative features to represent a job and assigning weights to different features, we extend the improved HP model proposed in Khan et al.¹⁹ for predicting the MapReduce job completion time.

In this article, LWLR is used to predict the durations of the map, shuffle, and reduce phases. In fact, LWLR can be used to predict any time and what we need to do is to change the corresponding representing values in Y.

The feature sets employed in the LWLR are the modeling parameters listed before. That is, we use input file sizes $D_{i}$ , reduce number $R_{i}$ and the number of working machines in the cluster $C_{i}$ to represent ith sample point $x_{i}$ . It can be expressed in the following equation

x_{i} = [\begin{matrix} D_{i} \\ C_{i} \\ R_{i} \end{matrix}] (i = 1, 2, \dots, m)

where m is the number of sample points.

Each variable $x_{i}$ needs to be scaled so they have similar measures of spread. Different features are standardized in the same range, which makes different dimensions comparable. Moreover, the program can run faster with normalized data. In this article, we use min-max algorithm to normalize samples.

Then, we define a matrix X to contain all the training dataset and a vector Y to express the time that corresponds to the sample point $x_{i}$ . For example, if you want to predict the average execution time of the shuffle task for a new instance, then $y_{i}$ represents the actual average execution time of the shuffle task of the ith sample point. Y can be extracted from the past job execution records

\begin{matrix} X = [x_{1}, x_{2}, \dots, x_{m}] \\ Y = {[y_{1}, y_{2}, \dots, y_{m}]}^{T} \end{matrix}

For the prediction of $T_{phase}$ , we calculate the weight for each sample point to find the most similar jobs. Then, we use a distance as the following expression shows

dis (x_{i}, x_{p}) = \sqrt{{(D_{i} - D_{p})}^{2} + {(C_{i} - C_{p})}^{2} + {(R_{i} - R_{p})}^{2}}

Finally, from standard weighted least-squares theory, we get the prediction time of new instance $x_{p}$ as below. Furthermore, we should take the practical significance of predicted value into consideration that prediction time of different phases is positive

\begin{matrix} T_{phase} = e^{T} {(X_{x}^{T} W X_{x})}^{- 1} X_{x}^{T} WY \\ subject to T_{phase} > 0 \end{matrix}

Here, $W = diag (w_{i})$ is the diagonal matrix where all the non-diagonal items are 0

\begin{matrix} e = (1, 0, 0, 0)^{T} \\ X_{x} = [\begin{matrix} 1 & {(X_{1} - x_{p})}^{T} \\ ⋮ & ⋮ \\ 1 & {(X_{n} - x_{p})}^{T} \end{matrix}] \end{matrix}

Experiment and performance evaluation

In this section, we will evaluate the performance and accuracy of InSTechEM prediction model.

Experimental setup

Hardware and Hadoop configuration

In the experiment, we set up our edge cloud on an OpenStack cloud platform in our lab, which is composed of nine physical machines and eight of them are compute nodes hosting virtual machine (VMs). Each compute node has a 12-core Intel 2.4 GHz CPU, 64 GB memory, 10 Gbps network bandwidth, and 2T disk capacity, and runs CentOS 7.1. Each VM runs Ubuntu 12.04 64bit. Each compute node hosts six VMs and each VM is allocated with two CPU cores, 5 GB memory, and 80 GB disk storage. We use Hadoop-2.7.1 and set the replication level of data block to 3 and the max container of map/reduce to 5 per node. A physical topological diagram of the experiment environment is shown in Figure 3. In our edge cloud experiment environment, Hadoop directly uses Ceph²⁶ instead of HDFS as the storage platform which takes VM as the unit. Ceph, a free-software storage platform, presents object, block, and file storage from a single distributed computer cluster. When deploying Hadoop, we manually configure the components to get to Ceph osds by S3 API. Ceph osd configuration is 1.8 TB disk (every pm has one disk) and we have five Ceph nodes. We build four Hadoop clusters. They all have the same VM placement and VM specification but with different numbers of VMs (i.e. different cluster scales). Number of VMs that compose the clusters is 12, 24, 36, and 48.

Figure 3.

Edge cloud big data platform experiment environment.

Tested programs

We employ two typical MapReduce applications. The TestDFSIO benchmark is a read and write test for file systems. It is helpful for tasks to discover performance bottlenecks in the network. Map tasks of TestDFSIO perform parallel read and write jobs, respectively, and the reduce task processes statistical information to get the throughput and average IO speed. The Sort benchmark simply uses the map/reduce framework to sort input data. The inputs and outputs must be sequence files where the keys and values are bytes writable.

Datasets

We test our performance model with data varying from 1 to 100 GB for TestDFSIO and data varying from 1 to 45 GB for Sort. TestDFSIO write job generates data for the read benchmark. RandomWriter³ tool in the Hadoop package is utilized to generate random data. These types of data are used by the Sort program.

Job profile information

We perform our experiments with TestDFSIO and the Sort benchmarks on the four Hadoop clusters, respectively, and gather the job profiles from the Hadoop log. Table 2 shows the job profile information of Sort that is carried out on 12 VMs. We vary the reduce number from 72 to 144.

Table 2.

Job profile information of Sort.

Data scale (GB)	Reduce number	Map task duration (s)		First shuffle duration (s)		Other shuffle duration (s)		Reduce duration (s)		Total time (s)
Data scale (GB)	Reduce number	Max	Avg	Max	Avg	Max	Avg	Max	Avg	Total time (s)
1	72	16	11	20	14	17	5	27	17	133
5	72	18	8	89	37	9	6	78	62	281
10	72	15	6	281	116	10	7	153	125	576
15	72	15	6	431	176	15	10	246	205	870
20	144	19	6	676	278	43	25	263	208	1471
25	72	19	6	800	351	71	61	440	343	1533
30	105	16	6	982	532	268	147	361	226	1912
35	105	16	5	1190	607	523	413	570	179	2333
40	72	32	6	1276	914	712	701	1359	538	2841
45	105	19	5	1600	990	850	624	544	217	3225

Job execution prediction

We implement and evaluate the performance model InSTechEM in three scenarios with Sort as shown in Figure 4. The blue line on the graph shows the InSTechEM solution that serves as a model for the job which represents predicted job execution time and the red dots are duration measurements. Shorter distance between the two lines indicates better quality that the model achieves. All of the four subgraphs show excellent fit. $τ$ is set to 3. (1) Keeping the reduce number and VM number as constants, we increase the data size and record the job execution time. We run benchmarks on Hadoop clusters with VM number 24, 36, and 48. Reduce number is set equal to VM number. Map number is decided by input data size. As can be seen from Figure 4(a) and (b), total job execution time grows approximately linearly with the increase in data size processed by fixed reduce task in given Hadoop cluster. (2) Keeping VM number as constant, we increase the data size with a varied reduce number from 72 to 144 in Hadoop cluster of 12 VMs. In Figure 4(c), big fluctuation of durations in linear growth can be observed no matter predicted time or actual time due to varied reduce number. Although some points do not get perfect prediction due to the big gap in reduce number among the training samples, most points obtain good prediction. (3) We use data size, reduce number, and VM number as features to represent sample points and all the data in the previous situation are collected to predict job execution time. Figure 4(d) shows the good fitness of this situation.

Figure 4.

Prediction of job execution time of Sort: (a) VM = 24, (b) VM = 36, (c) VM = 12, and (d) all data.

TestDFSIO only has 1 reduce task and we set 8 map tasks for 1 GB data size and 80 map tasks for other data sizes. We run TestDFSIO read and write benchmark on Hadoop clusters with VM number set to 24, 36, and 48. We compare predicted job execution time to the measured execution time as shown in Figure 5. Figure 5(a) shows the prediction of TestDFSIO write jobs under VM number fixed to 48 and Figure 5(b) shows the prediction time with all the training data of TestDFSIO write job. Both graphs show that the calculated values coincide with the experimental results. Similar results can be obtained from TestDFSIO read jobs as shown in Figure 5(c) and (d).

Figure 5.

Prediction of job execution time of TestDFSIO read and write: (a) VM = 48 (write), (b) all data (write), (c) VM = 48 (read), and (d) all data (write).

Accuracy of the MapReduce performance model

Cross validation

We perform a leave-one-out cross validation¹² to study the prediction accuracy of the model. That is, if there are N samples, each sample is used as a validation set separately and the rest N − 1 samples as the training set. We utilize average relative errors (ARE) over the N rounds of testing to formally assess the accuracy of performance model. ARE can be computed using the following equation

ARE = \frac{1}{N} \sum_{i = 1}^{N} \frac{| y_{i}^{p} - y_{i}^{m} |}{y_{i}^{m}}

where $y_{i}^{m}$ represents the measured job execution time in the ith round of testing and $y_{i}^{p}$ represents the predicted job execution time.

Table 3 shows ARE and $R^{2}$ in leave-one-out cross validation in different situations of Sort and TestDFSIO. $R^{2}$ is a measure for evaluating the goodness of fit in regression modeling. R² = 1 means a perfect fit, while R² > 90% indicates a very good fit. The obtained results confirm that the InSTechEM model performs well. Figure 6 shows the relative error of prediction from all data situations of Sort and TestDFSIO, respectively. For Sort running on clusters with VM number set to 12, 24, 36, and 48, we conduct 40 tests. For TestDFSIO running on cluster with VM number set to 24, 36, and 48, we carry out 33 tests. From Figure 6, we can see that the relative errors are high for some experiments. These particular experiments have the smallest input size 1 GB. Our trained model may have no similar samples and could not come with a good prediction. This could be solved by improving the training. As for Sort, when predicting job execution time of a small data size in a given Hadoop cluster, there exist a large number of errors due to the error prediction in shuffle phase and short total finish time. Prediction on large scale of data achieves better performance.

Table 3.

ARE and $R^{2}$ in leave-one-out cross validation.

Situation	Sort		TestDFSIO (read)		TestDFSIO (write)
Situation	ARE	$R^{2}$	ARE	$R^{2}$	ARE	$R^{2}$
VM = 12	13.4%	93.1%	−	−	−	−
VM = 24	12.4%	99.5%	1.3%	99.9%	7.3%	99.3%
VM = 36	10.1%	99.2%	1.3%	100.0%	10.5%	98.5%
VM = 48	14.7%	97.9%	4.3%	99.7%	17.7%	99.1%
All data	8.7%	98.4%	3.1%	99.8%	8.8%	98.9%

ARE: average relative error.

Figure 6.

Relative error of Sort and TestDFSIO.

Conclusion and future work

In this article, we proposed and evaluated the InSTechEM, an IoT Big data–oriented extended model (EM) for MapReduce performance prediction in multiple edge clouds architecture, which predicts MapReduce job total execution time in a more general scenario with various reduce numbers and cluster scales. The proposed model is built based on historical job execution records and employs LWLR technique. By extracting more representative features to represent a job, the extended model (EM) can effectively predict the total execution time of MapReduce applications with the average relative error of less than 10%.

With the growing number of IoT sensors, the multiple edge clouds will expand across many distributed locations. As future work, we plan to consider a scalable multi-tier edge clouds architecture to address the possible performance bottlenecks caused by a small number of edge clouds with one centralized cloud. At the same time, we will keep improving the MapReduce performance prediction model by studying details of MapReduce and Hadoop framework. Furthermore, the enhanced performance prediction model is utilized to estimate the amount of resources for Hadoop jobs with deadline requirements in multiple edge clouds.

Footnotes

Acknowledgements

The authors thank Professor Duan Qiang at the Pennsylvania State University Abington College, USA, for valuable suggestions and English correction of the manuscript.

Academic Editor: Chin-Tser Huang

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by 2016–2019 National Natural Science Foundation of China under Grant No. 61572137 (Multiple Clouds based CDN as a Service Key Technology Research), Shanghai 2016 Innovation Action Project under Grant No. 16DZ1100200 (Data-trade-supporting Big data Testbed), and 2015–2017 Shanghai Innovation Action Project under Grant No. 1551110700 (New media-oriented Big data analysis and content delivery key technology and application).

References

Snijders

Matzat

Reips

UD.

“Big data”: big gaps of knowledge in the field of internet science. Int J Internet Sci 2012; 7(1): 1–5.

Dean

Ghemawat

MapReduce: simplified data processing on large clusters. Oper Syst Des Implement 2004; 51(1): 107–113.

Hadoop, http://hadoop.apache.org

Wang

. InSTechAH: an autoscaling scheme for Hadoop in the private cloud. In: Proceedings of the 2015 IEEE international conference on services computing (SCC), New York, 27 June–2 July 2015, pp.395–402. New York: IEEE.

Ruppert

Wand

MP.

Multivariate locally weighted least squares regression. Ann Stat 1994; 22(3): 1346–1370.

Herodotou

Hadoop performance models. Technical report, CS-2011-05 (eprint arXiv: 1106.0940). Durham, NC: Computer Science Department, Duke University.

Kambatla

Pathak

Pucha

. Towards optimizing Hadoop provisioning in the cloud. In: Proceedings of the 2009 conference on hot topics in cloud computing, San Diego, CA, 15 June 2009. Berkeley, CA: USENIX Association.

Kavulya

Tan

Gandhi

. An analysis of traces from a production MapReduce cluster. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing, Melbourne, VIC, Australia, 17–20 May 2010, pp.94–103. Washington, DC: IEEE Computer Society.

Yin

Qiao

. Performance modeling and optimization of MapReduce programs. In: Proceedings of the 3rd international conference on cloud computing and intelligence systems (CCIS), Shenzhen University Science & Technology Building Convention Center, Shenzhen, China, 27–29 November 2014.

10.

Lin

Meng

. A practical performance model for Hadoop MapReduce. In: Proceedings of the 2012 IEEE international conference on cluster computing workshops, Beijing, China, 24–28 September 2012, pp.231–239. New York: IEEE.

11.

Song

Meng

Huet

. A Hadoop MapReduce performance prediction method. In: Proceedings of the 10th international conference on high performance computing and communications & embedded and ubiquitous computing, HPCC_EUC, Zhangjiajie, China, 13–15 November 2013, pp.820–825. New York: IEEE.

12.

Tian

Chen

. Towards optimal resource provisioning for running MapReduce programs in public clouds. In: Proceedings of the 2011 IEEE international conference on cloud computing (CLOUD), Washington DC, USA, 4–9 July 2011, pp.155–162. Washington, DC: IEEE Computer Society.

13.

Chen

Powers

Guo

. CRESP: towards optimal resource provisioning for MapReduce computing in public clouds. IEEE T Parall Distr 2014; 25(6): 1403–1412.

14.

Wang

Yang

. Comparison and improvement of Hadoop MapReduce performance prediction models in the private cloud. In: Proceedings of the international conference of Asia-Pacific services computing (APSCC) (LNCS), vol. 10065, Zhangjiajie, China, 16 November 2016, pp.77–91. Cham: Springer.

15.

Verma

Cherkasova

Campbell

. ARIA: automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM international conference on autonomic computing, Karlsruhe, 14–18 June 2011, pp.235–244. New York: ACM.

16.

Verma

Cherkasova

Campbell

RH.

Resource provisioning framework for MapReduce jobs with performance goals. Lect Notes Comput Sc 2011; 7049: 165–186.

17.

Zhang

Cherkasova

Loo

. Performance modeling of MapReduce jobs in heterogeneous cloud environments. In: Proceedings of the sixth international conference on cloud computing (CLOUD), Washington DC, USA, 28 June–3 July 2013, pp.839–846. Washington, DC: IEEE Computer Society.

18.

Zhang

Cherkasova

Loo

. Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the ACM/SPEC international conference on performance engineering, Prague, 21–24 April 2013, pp.253–258. New York: ACM.

19.

Khan

Jin

. Hadoop performance modeling for job estimation and resource provisioning. IEEE T Parall Distr 2015; 27: 441–454.

20.

Engin

Ejder

Mehdi

. Big data caching for networking: moving from cloud to edge. IEEE Commun Mag 2016; 54(9): 36–42.

21.

Mathew

Kwangsung

Abhishek

. Nebula: distributed edge cloud for data intensive computing. In: Proceedings of the 2014 IEEE international conference on cloud engineering, Boston, MA, 11–14 March 2014, pp.57–66. New York: IEEE.

22.

Ola

Imad

Ayman

. Edge computing enabling the Internet of Things. In: Proceedings of the 2nd world forum on Internet of Things, WF-IoT, Milan, 14–16 December 2015, pp.603–609. New York: IEEE.

23.

Shi

Schahram

The promise of edge computing. Computer 2016; 49(5): 78–81.

24.

Apostolos

Ehsan

Cheng

. Edge-computing-aware deployment of stream processing tasks based on topology-external information: model, algorithms, and a storm-based prototype. In: Proceedings of the 2016 IEEE international congress on big data (BigData Congress), San Francisco, CA, 27 June–2 July 2016.

25.

Cheng

Apostolos

Cirillo

. GeeLytics: geo-distributed edge analytics for large scale IoT systems based on dynamic topology. In: Proceedings of the 2nd world forum on Internet of Things, WF-IoT, Milan, 14–16 December 2015, pp.565–570. New York: IEEE.

26.

Ceph, http://ceph.com/

InSTechEM: An Internet of Thing big data–oriented extended model for MapReduce performance prediction in multiple edge clouds

Abstract

Keywords

Introduction

Related work

IoT big data multiple edge clouds architecture

IoT big data edge cloud architecture

InSTechEM: an IoT big data–oriented extended MapReduce performance model in multiple edge clouds

InSTechEM model parameters selection

Apply LWLR model to enhance performance prediction

Bound-based prediction in Hadoop 2

Extended job execution prediction model

Experiment and performance evaluation

Experimental setup

Hardware and Hadoop configuration

Tested programs

Datasets

Job profile information

Job execution prediction

Accuracy of the MapReduce performance model

Cross validation

Conclusion and future work

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

References