Sage Journals: Discover world-class research

Abstract

In a distributed environment, replication is the most investigated phenomenon. Replication is a way of storing numerous copies of the same data at different locations. Whenever data is needed, it will be fetched from the nearest accessible copy, avoiding delays and improving system performance. To manage the replica placement strategy in the Cloud, three key challenges must be addressed. The challenges in determining the best time to make replicas were generated, the kind of files to replicate, as well as the best location to store the replicas. This survey conducts a review of 65 articles published on data replication in the cloud. The literature review examines a series of research publications and offers a detailed analysis. The analysis begins by presenting several replication strategies in the reviewing articles. Analysis of each contributor’s performance measures is conducted. Moreover, this survey offers a comprehensive examination of data auditing systems. This work also determines the analytical evaluation of replication handling in the cloud. Furthermore, the evaluation tools used in the papers are examined. Furthermore, the survey describes a lot of research issues & limitations that might help researchers support better future work on pattern mining for data replication in the cloud.

Keywords

Data replication cloud system data handling data auditing research gaps

Table 1
Nomenclature

Abbreviation Description

TPA Third Party Auditor

HDRS Hierarchical Data Replication Strategy

IT Information Technology

SEU Storage Element Usage

ROM Random Oracle Model

EaaS Expert as a Service

TUE Traffic Usage Efficiency

HS Harmony Search

SaaS Software as a Service

DHR Dynamic Hierarchical Replication

DMDR Data Mining-based Data Replication

CEMR Correlation and Economic Model-based Replication

IaaS Infrastructures as a Service

BCVS Bottleneck and Cost Value Scheduling

DPRS Dynamic Predicted Replication Strategy

PaaS Platforms as a Service

MSP Multi-fixed Sequencers Protocol

HDFS Hadoop Distributed File System

FFTF Flexible Fault Tolerance Framework

DPRS Dynamic Popularity aware Replication Strategy

MOE Multi-Objective Evolutionary

TRC Task Replication With Cancellation

SSOR Scalable Service Oriented Replication

MRT Mean Response Time

FHE Fully Homomorphic Encryption

PDP Provable Data Possession

MORM Multi-objective Optimized Replication Management

BaRRS Balanced and file Reuse-Replication Scheduling

FT Fault Tolerance

DA Data Availability

HEFT Heterogeneous Earliest Finish Time

PDR Prefetching-aware Data Replication

RRSD data Reliability and Reducing Storage consumption in a Dynamic Cloud-P2P

BVS Bottleneck Value Scheduling

DOSN Decentralized Online Social Network

MO-ACO Multi-Objective Ant Colony Optimization

CEMR Correlation and Economic Model based Replication

EIMORM Efficient And Improved Multi-Objective Optimized Replication Management

MO-PSO Multi-Objective Particle Swarm Optimization

SDS Stochastic Diffusion Search

IWD Intelligent Water Drop

CB-MT DBMS Cluster-Based Multi-Tenant Database Management System

MORG Multi-Objective Randomized Greedy

MHT Merkle Hash Tree

MOE Multiobjective Evolutionary

R-PCP Replication-Based Partial Critical Path

SDR Secure Data Replication

ADAS Adaptive Data-Aware Scheduling

Abbreviation	Description
TPA	Third Party Auditor
HDRS	Hierarchical Data Replication Strategy
IT	Information Technology
SEU	Storage Element Usage
ROM	Random Oracle Model
EaaS	Expert as a Service
TUE	Traffic Usage Efficiency
HS	Harmony Search
SaaS	Software as a Service
DHR	Dynamic Hierarchical Replication
DMDR	Data Mining-based Data Replication
CEMR	Correlation and Economic Model-based Replication
IaaS	Infrastructures as a Service
BCVS	Bottleneck and Cost Value Scheduling
DPRS	Dynamic Predicted Replication Strategy
PaaS	Platforms as a Service
MSP	Multi-fixed Sequencers Protocol
HDFS	Hadoop Distributed File System
FFTF	Flexible Fault Tolerance Framework
DPRS	Dynamic Popularity aware Replication Strategy
MOE	Multi-Objective Evolutionary
TRC	Task Replication With Cancellation
SSOR	Scalable Service Oriented Replication
MRT	Mean Response Time
FHE	Fully Homomorphic Encryption
PDP	Provable Data Possession
MORM	Multi-objective Optimized Replication Management
BaRRS	Balanced and file Reuse-Replication Scheduling
FT	Fault Tolerance
DA	Data Availability
HEFT	Heterogeneous Earliest Finish Time
PDR	Prefetching-aware Data Replication
RRSD	data Reliability and Reducing Storage consumption in a Dynamic Cloud-P2P
BVS	Bottleneck Value Scheduling
DOSN	Decentralized Online Social Network
MO-ACO	Multi-Objective Ant Colony Optimization
CEMR	Correlation and Economic Model based Replication
EIMORM	Efficient And Improved Multi-Objective Optimized Replication Management
MO-PSO	Multi-Objective Particle Swarm Optimization
SDS	Stochastic Diffusion Search
IWD	Intelligent Water Drop
CB-MT DBMS	Cluster-Based Multi-Tenant Database Management System
MORG	Multi-Objective Randomized Greedy
MHT	Merkle Hash Tree
MOE	Multiobjective Evolutionary
R-PCP	Replication-Based Partial Critical Path
SDR	Secure Data Replication
ADAS	Adaptive Data-Aware Scheduling

Table 1

(Continued)

Abbreviation	Description
LRM	Locality Replication Manager
QADR	Qos-Aware Data Replication
PMCR	Popularity-Aware Multi-Failure Resilient And Cost-Effective Replication
HQFR	High-Qos First-Replication
RHC	Receding Horizon Control
VRS-BQ	Vigorous Replication Strategy with Balanced Quorum
LAST	Location-Aware Storage Technique
DROPS	Division and Replication of Data in the Cloud for Optimal Performance and Security
DCs	Data Centers
MOO	Multi-Objective Optimization
MOABC	Multi-Objective Optimization With The Artificial Bee Colony

1. Introduction

Operating systems, storage, networks, hardware, databases, and even complete software applications are supplied to consumers as on-demand services through cloud computing, which is a network-based architecture [45,63]. Cloud computing does not make use of certain latest techniques, but it saves money and improves the scalability of IT services management. SaaS, IaaS, PaaS, & EaaS are the four types of cloud computing. A tremendous quantity of data is currently a significant and critical portion of shared resources across several scientific fields. In several disciplines, the size of data is estimated in terabytes or even petabytes. Cloud data centres [61,67] are often used to store such massive amounts of data. As a result, data replication is commonly used to manage a large amount of data by producing identical copies of data in geographically dispersed locations known as replicas. Data replication [56,74] has the benefit of accelerating data access, lowering access latency, and enhancing data availability. To enhance the response time to consumers, a common method is to deploy numerous replicas spread across geographically distant clouds. The overheads of producing, maintaining, & updating replicas are essential, and the difficult concern in distributed setups is ‘multiple copies of data at several places. Cloud service providers offer a variety of geographically dispersed facilities that users can share in accordance with the Service Level Agreement, and the market for these services is expanding quickly. Using cloud computing, a potent paradigm for addressing the requirements of people and companies, information may be exchanged across the Internet. Distributed cloud-based or peer-to-peer (P2P)-based material on a wide scale can be effectively replaced by the peer-to-peer-cloud (P2P-Cloud).By solving some of this criterion’s major issues, including availability, reliability, security, bandwidth, and reaction time of data access, data replication’s primary goal is to increase performance for data-intensive applications [29,57,58,70].

In the field of distributed & cloud computing, data replication [71] has been widely investigated. If one of the nodes is unavailable, the data can be retrieved through another node. The staging, placement, & transfer of data across a cloud are all part of data replication. Data staging refers to the temporary storage of data for evaluation at a later stage of implementation. Data replication is frequently used by cloud service providers to meet their service level goals for availability and response speed. Cloud computing is a rapidly evolving paradigm that gives users the flexibility and on-demand access to cloud services they need using a pay-as-you-go pricing structure. A well-known method for increasing data availability, lowering bandwidth usage, and achieving fault tolerance is data replication [23,60,68]. If a shared resource is unavailable in a cloud computing [62] system, data is “staged-in” at the execution site. The data is “staged-out” of the storage after it has been “staged-in.” Data placement is the process of placing the data at various locations, and data movement is focused on how (a) data should be transferred to maintain replication levels & (b) data would be accessible from various locations. Traditionally, many strategies are included to improve the cloud performance by dividing the file into multiple blocks & distributing the pieces among data nodes for parallel data transmission. Data management & replication strategies should be built to reach QoS in consideration of preventing performance degradation. Mostly, data replication is used to handle enormous amounts of data in a dispersed manner. Data replication [22] improves data availability, which also increases data access speed & lowers access latency. While considering replication of data operations in cloud storage clusters, there are two major issues to be considered: the right amount of data replicas and the replicas must be placed in the system correctly to complete the task efficiently.

Many replication levels [47] are typical in cloud-based systems that run all over the world, particularly between clusters, servers, inter-data centers, and even cloud systems. Conventional approaches for consistency across an object’s replicas include pessimistic (i.e. lock-based) & optimistic (i.e. non-lock-based). Although the replication process enhances application performance, there is a situation that results in poor overall performance. Updates require various organizations across the network when the data file is read as the load, causing performance reduction. The replication protocol is also utilized, as well as the replication cost has a significant impact on the process of replication [42,70]. Data replication can increase costs & energy consumption despite it has many benefits. Another concept of data replication is the method of enhancing performance, availability, & dependability by maintaining multiple copies of a data file across many sites. It’s frequently used in applications that require a large amount of data to be collected from multiple locations across the world. Whenever one of the sites’ replicas fails, the requested data file could still be supplied from other locations. The purpose of data replication is to fulfill requests from nearby locations to keep the appropriate replicated files. Thus, it is required to implement a data replication [54,55] technique that analyses the balancing of several trade-offs. Optimization has become a popular area in recent years for determining the best answer to complex situations. As a consequence, researchers have concentrated on meta-heuristic algorithms to address replication issues. Many people and businesses are outsourcing their data to remote cloud service providers (CSP) to cut maintenance costs and the workload associated with managing huge data storage systems. Replication of the data is essential for boosting information availability and reliability. Additionally, maintaining numerous copies of the data across many hosts results in higher maintenance costs for the providers, who then pass those costs along to the client in the form of higher prices. However, occasionally they might not preserve that copy on all servers. The clients’ requests for data updates were also not adequately carried out.

The following is a list of the key contributions to this work.

Conduct a comprehensive review of 65 research papers based on pattern mining of data replication in the cloud.

Analyzes various replication methods in reviewing articles. This survey provides a comprehensive analysis of data auditing schemes. Moreover, the analytical review on replication handling in the cloud is performed.

Moreover, evaluation tools used in the papers are also analyzed. Analysis of performance measures of each contribution is also done. Finally, research gaps and challenges in this topic are determined.

The previous literature based on data replication in the cloud is listed in Section 2 of this article. In Section 3, a review of cloud data replication models cloud, performance, & maximum attainments is preferred. Section 4 contains an evaluation of the replication handling schemes in the cloud and a chronological review. Section 5 also includes a review of data auditing schemes and simulation tools. Section 6 also listed the research gaps and challenges. The conclusion of this work is preferred in Section 7.

2. Literature review

2.1. Related work

This study evaluates 65 papers that were published between 2015 and 2021. The papers are chosen from well-known publishers including IEEE, Springer, Elsevier, and others. The papers are arranged as follows:

2.1.1. Replication management approach

In 2016, Mansouri et al. [43] have presented a novel replica placement to provide cost-effective availability, reduce application response time, & load balance cloud storage. The replica placement was determined by five key factors: mean service time, failure probability, load fluctuation, latency, & storage use. Nevertheless, since each site’s storage capacity was restricted, replication should be handled with caution. They use the CloudSim simulator to analyze the method & discover that it outperforms existing algorithms based on mean response time, effective network utilization, load balancing, replication frequency, & storage consumption.

In 2012, Sun et al. [73] have explored a dynamic data replication approach in this study, along with a brief examination of replication strategies appropriate for distributed computing settings. It entails: 1) evaluating as well as modelling the relationship among system availability as well as the number of replicas; 2) analyzing as well as specifying popular data & triggering a replication operation whenever the popularity data reaches a dynamic threshold; 3) determining a suitable number of copies to fulfill a sensible system byte effective rate necessity as well as evenly distributing replicas among data nodes; as well as 4) developing the dynamic data replication method in the cloud. Results demonstrate that the proposed cloud technique increases the accuracy & effectiveness of the enhanced system.

In 2019, Ramanan et al. [65] have investigated the use of these data corruption like effective reliability of data replication using an SDS technique to continually fulfill the QoS constraints. When the proposed technique was compared to cost-effective dynamic data replication, it was observed that the average recovery time was reduced by 18.18% for the 250 amount of requested nodes, 11.11% for the 750 amount of requested nodes, 14.28% for the 500 quantity of suggested nodes, as well as 8.69% for the 1000 amount of requested nodes.

In 2019, Edwin et al. [13] have provided 3 different features on dynamic, cost-aware data replication methods via optimization, that determine the least quantity of data replication information necessary to ensure the data availability growth when the replication process is increased. By combining these optimization objectives, an EIMORM may solve this optimum problem. As a consequence, the simulation findings could be used to promote energy efficiency as data replication recommendations.

In 2021, Mohammad et al. [50] have suggested a CSO-based method for SDR that uses a smart fuzzy inference system with four inputs: centrality, energy, storage usage, as well as load to identify the best data center for a novel replica. In particular, a high-quality knowledge base was created to characterize the CSO method’s fuzzy system. They use the CloudSim toolkit to test the suggested algorithm, as well as the results reveal that the SDR method could decrease total energy usage and reaction time by 31% & 28% (on average) compared to other relevant algorithms.

In 2015, Boru et al. [6] has investigated data replication in cloud computing data centres. This study examines data replication in geographically dispersed cloud computing data centers & provides a new replication strategy that maximizes the system’s energy efficiency in combination with standard performance measures like network bandwidth availability. The findings of the comprehensive simulations aid in revealing performance and energy efficiency trade offs for guiding the development of future data replication systems.

In 2021, Maheshwari et al. [41] has suggested a consensus-based file replication methodology based on the message forwarding framework that tackles the previous methods’ server confidence problem. The Master–Slave protocol employs a two-layered framework that divides servers into masters and slaves. Following the suggested methodology, the updated data file would be provided instantly to the readers without any ambiguity. The protocol was conducted, as well as the experimental and theoretical results were confirmed.

In 2021, Ulabedin et al. [78] have presented a workflow scheduling strategy to resolve data transmission & complete workflow activities under tight deadlines & budgets. The suggested strategies include an initial data placement phase that clusters & distributes datasets depending on their interconnections, as well as an R-PCP methodology that organizes tasks with data locality & continuously manages the dependency matrix for the deployment of created data sets. When comparing random & ADAS approaches, R-PCP has 44.93% & 31.37% less data movement, correspondingly. When compared to the ADAS approach, R-PCP uses 26.48% less energy.

In 2013, Chen et al. [35] have suggested two QADR techniques in the cloud computing environment. The first method utilizes the HQFR concept to conduct data replication. Nevertheless, this greedy approach has been unable to reduce the cost of data replication as well as the count of data copies with QoS violations. Furthermore, a cloud computing system is known to have a huge number of sensor nodes. Furthermore, simulation tests were conducted to show that the suggested methods were successful in data replication & restoration.

In 2019, Toosi et al. [53] have developed the best offline solution based on dynamic & linear programming approaches and also the requirement of accurate workload knowledge on objects. The second online method was randomized, and it used the RHC approach to utilize future workload data for “w” time periods. Experiments employing a workload created based on features of the Facebook workload confirmed the efficacy of the suggested algorithms.

In 2021, Latip et al. [2] has proposed the VRS-BQ replica placement approach in cloud contexts to reduce storage usage, response time, & replication process time. The suggested VRS-BQ algorithm achieves a 25% reduced average response time, 22% lesser replication time, as well as 20% lower storage consumption similar to the previous DPRS algorithm while maintaining the highest level of data availability, according to extensive experiments performed by using well-known CloudSim simulation platform.

In 2021, Mseddi et al. [61] has determined a CRANE cost-effective replication migration strategy for a distributed cloud storage scheme. CRANE enhances any replica placement technique by effectively managing replica generation in geo-distributed infrastructures to (1) reduce the time it takes to copy data to a new replica site, (2) eliminate network congestion, & (3) assure the data’s current minimum availability. They further demonstrate that, when compared to OpenStack Swift, CRANE was able to cut replica generation & migration time to 60% & inter-data center network traffic to 50% while maintaining minimal data availability.

In 2021, Raouf et al. [66] has suggested a CB-MT DBMS that anticipates migration & replication choices by monitoring as well as responding before the SLA violation. In comparison to earlier methodologies, investigational results demonstrate that the proposed MTDB-MR method seems to be the best candidate for migration as well as replication of violated multi-tenant databases because it minimizes the total count of SLA violations, the amount of multi-tenant client SLA violations, client sites average response time, as well as the overall execution time of every multi-tenant client site.

In 2021, Zhang et al. [83] has determined BDS+, a near-optimal network solution for large-scale inter-DC data replication. BDS+ had been a completely centralized application-level multicast overlay network which permitted a central controller to keep track of the data delivery condition of intermediary servers throughout accessible overlay pathways. Furthermore, dynamic bandwidth separation could decrease bulk data transfer execution times by approximately 1.2 to 1.3 times.

In 2021, Awad et al. [5] has suggested two bio-inspired methods to optimize both data replica selection & placement in the clouds. MO-PSO & MO-ACO were offered as algorithms for dynamic data replication. In comparison to other methods, the simulation results show that MOPSO provides better data replication. When compared to other techniques, MOACO has greater data availability, cheaper cost, & reduced bandwidth usage.

In 2020, Amel et al. [30] have presented a unique mixture of the BVS algorithm for scheduling and the CEMR method for dynamic data replication. The goal was to increase data access efficiency while maintaining provider profit to achieve service level objectives based on reaction time SLORT & least availability SLOMA. The numerical simulations show that the suggested scheduling & replication algorithms work much better than existing solutions.

In 2019, Riad et al. [59] has presented a data replication solution within Cloud data centers that fits both the performing tenant goal as well as the provider profit goal. Data replication was performed if only: (i) the expected Response Time of Q outperforms a crucial RT threshold (per-query replication), or (ii) more frequently if RTQ surpasses another (smaller) RT threshold for a specific amount of times before the tenant query Q was executed (replication per set of queries). RSPC outperforms the other 4 techniques in terms of RT under heavy loads, complicated queries, & restrictive RT criteria. Furthermore, penalty & data transmission costs have been drastically decreased, affecting provider profit.

In 2018, Liang et al. [34] has tackled co-residence attacks, in which a malicious attacker might steal or damage a user’s sensitive information by co-residing the attacker’s VM on the same physical server as the target user’s VM. Dynamic data replication rules were investigated and optimized using the proposed evaluation methodology. The effects of different method parameters on dynamic data sustainability & security were demonstrated using numerical results.

In 2018, Mansouri et al. [46] have developed a novel dynamic replication approach known as PDR, that uses file access history to identify the correlation of data files as well as pre-fetches the most popular files. PDR delivers excellent data availability, a good hit ratio, and minimal storage & bandwidth usage, according to comprehensive CloudSim tests. As compared to other techniques, PDR decreases reaction time by an approximate average of 35%.

In 2016, Navneet et al. [19] has investigated a dynamic, cost-aware, optimal data replication technique in this research that determines the smallest number of copies needed to achieve the specified availability. The CloudSim toolbox was used to evaluate the suggested technique. The test findings show that the method was beneficial in lowering replication costs and maximizing data availability.

In 2021, Gregory et al. [33] has developed a cloud system that relies on the TRC approach to increase the likelihood of a real-time activity being completed successfully by reducing system loads & user costs. A probabilistic model for evaluating job completion possibility by a specific deadline, projected work completion duration, & data theft success probability would be included in the solution technique.

In 2020, Behnam et al. [52] have introduced a multi-objective optimized placement method on a meta-heuristic approach as well as the fuzzy system that balances the trade-offs among the six optimization objectives to discover the best places for replicas. Furthermore, comprehensive experiments using CloudSim demonstrate that the suggested replication algorithms improve the most popular replication strategies in terms of hit ratio, amount of replications, load variance, latency, average service time, availability, & energy usage.

In 2019, Gustavo et al. [27] has introduced FT-Aurora, a high-availability IaaS cloud manager which permits access to cloud resources even though the manager fails. By enabling network programmability, FT-Aurora allows for more efficient and flexible resource management. Both the efficiency & reliability of FT-Aurora were tested as well as the conclusions were given.

In 2018, Suji et al. [20] has presented an approach for dynamically adjusting the replica factor for each data item based on the data’s popularity, its present replication factor, as well as the number of active nodes in the cloud services. HDFS was used to accomplish the suggested technique. The test findings demonstrate that the suggested strategy keeps an appropriate number of clones for each data item depending on its popularity while respecting the cloud storage availability limitation.

In 2017, Kuchaki et al. [51] has developed a data replication approach based on data access behavior called the PRS on Cloud system. Depending on the 80/20 principle, DPRS duplicates just a tiny portion of frequently requested data files. Using the CloudSim simulation, they measure efficient network consumption, mean task execution time, hit ratio, the overall amount of replications, & percent of storage filled. Extensive testing has shown that DPRS was successful in the majority of access patterns.

In 2014, Sai et al. [39] has presented a multi-objective offline optimization strategy regarding replica management. MORM, the suggested method, balances the trade-offs among the 5 optimization goals to provide near-optimal alternatives. The MORM outperforms the default HDFS replication management as well as the MOE method in terms of reliability & load balancing for large-scale cloud storage clusters, according to experimental data.

In 2014, Tao et al. [11] has introduced the SSOR technology, a middleware capable of meeting the consistency needs of applications while duplicating cloud-based services. They introduce an MSP, which needs an aware version of the standard fixed sequencer technique, to handle the linked sub-problem of atomic broadcasting. Furthermore, comparing the adopted method to the traditional replication strategy and performing experiments show that the suggested technique offers superior scalability with much more adjustable consistency requirements.

In 2020, Abbes et al. [1] has discussed the issues of replication factor adaptation & presents a unique replication factor modelling strategy that uses prediction models to forecast the correct replication factor. They performed regression analysis to determine the relationship between availability as well as the number of copies. Simulations on the Grid’5000 testbed show the advantages of the concept for meeting the availability criterion by utilizing real fault-tolerant cloud infrastructure.

In 2020, He et al. [26] have developed a unique DPRS that combines file access frequency with a prediction algorithm to anticipate future file access & calculate the optional amount of replicas based on real as well as prospective access periodically. The findings of this research reveal that DPRS may significantly cut the response time of a file request, which also reduces the additional cost of cloud storage.

In 2021, Khelifa et al. [31] have suggested a new scheduling method known as BCVS, as well as a new dynamic data replication approach known as CEMR. According to test findings, the suggested mixture of scheduling & replication methods provides a 30 percent larger monetary profit for the cloud provider than conventional approaches. It also provides for improved performance.

In 2020, Javidi et al. [48] has investigated a novel dynamic replication approach dubbed DMDR, that uses the file access histories to measure the correlation of data files retrieved. They’re especially interested in how extracted information combined with maximum frequent correlated pattern mining optimizes data replication. In comparison to the present technology, numerical simulations using CloudSim show that the DMDR strategy seems to have a significant benefit in average response time, efficient network utilization, & hit ratio.

In 2014, Kumar et al. [32] has designed SWORD, a workload-aware data placement & replication technique to decrease resource use in a scenario. They demonstrate that these strategies result in a considerable decrease in total resource consumption for analytical read-only applications. They illustrate that lowering the amount of distributed transactions increases transaction latencies & total throughput for OLTP workloads.

In 2016, Galen et al. [40] has designed and developed an active machine learning system for temporal updating (or backcasting) of land cover data at 3 NLCD research sites. These findings show that the land cover data included in the NLCD may be efficiently retrieved for replication purposes utilizing only Landsat images. The system was completely automated & adaptable at numerous points in time at the landscape as well as regional sizes.

In 2015, Sreekumar et al. [79] has implemented a dynamic data replication mechanism to improve the efficiency of the software application. They evaluate the popularity degree as well as the replica factor to choose the best file to replicate as well as the appropriate number of replicas. They utilize the round-robin approach to arrange the replicas in the specified systems, but they use the fuzzy logic system for determining the system to position the replicas. They evaluate the approach’s effectiveness to traditional algorithms.

In 2016, Bui et al. [8] has created a system for dynamically replicating data files based on predictive analysis & thoughtful consideration of the HDFS replication disadvantages. After that, the popular files could be copied due to their own access capabilities. As a result, when compared to the default technique, the technique significantly enhances availability and maintains dependability. In addition, while dealing with Big Data, complexity reduction was used to improve the prediction’s efficacy.

2.1.2. Replication selection approach

In 2016, Sookhtsaraei et al. [72] have suggested the LRM, or recommended replication manager, uses two fundamental algorithms that take use of the physical adjacency property of blocks. They evaluated the method against MOE, MORG, & Hadoop in terms of replica count. For replication management, the HDFS structure was employed. A series of simulations was also published to illustrate that LRM might be a good alternative for distributed environments since it requires less energy and resources, optimizes load distribution, and also has higher availability & lower latency.

In 2019, Shen et al. [37] have introduced a PMCR technique for cloud storage with great data durability. PMCR divides the cloud storage service into a primary and backup layer, it categorises data into hot, warm, & cold categories depending on its popularity. In comparison to previous replication systems, extensive performance calculations based on trace parameters as well as actual results from real-world Amazon S3 reveal that PMCR delivers great data durability, the minimal likelihood of data loss, inexpensive storage as well as bandwidth costs.

In 2018, Bilal et al. [3] have suggested DROPS, which addresses performance and security problems jointly. Additionally, the DROPS approach for data security does not rely on existing cryptographic procedures, alleviating the platform of computationally costly methodologies. Researchers also compared the DROPS methodology’s performance to those of eleven other systems. The improved level of security has been seen with a little performance overhead.

In 2018, Marwa et al. [56] have been intended to function primarily in a cloud-based Advanced Metering Infrastructure with 3 levels: Smart Meter, Aggregator, and Cloud. While sent from the Aggregator layer to the Cloud layer, the System greatly decreases the amount of the energy consumption data provided through smart meters. The findings suggested a 55% reduction in uncorrelated data, which would be impressive. The particular amount of decrease was demonstrated to be highly dependent on parameters including forecasting tolerance & switching period duration.

2.1.3. Replication placement approach

In 2020, Salem et al. [67] have provided varied costs & shortest path sides in the cloud with relation to replication as well as its location among DCs via MOO, and also the cost distance evaluated using knapsack issues. In the suggested system, the MOABC method may be employed to obtain the maximum efficiency & minimum cost. The extensive experiments show that the suggested MOABC seems to be more effective & efficient than other methods for the optimal replication location.

In 2021, Bowers et al. [7] have proposed LAST-HDFS, a solution that combines LAST with open-source HDFS. The LAST-HDFS platform implements location-aware file deployments & monitors file transactions in the cloud in real-time to prevent possibly unlawful transfers. They deployed the suggested architecture & conducted thorough empirical evaluations in large-scale real-world cloud infrastructure to establish the system’s performance & effectiveness.

In 2020, Peng et al. [38] have investigated new offline community discovery & online community adjustment strategies for scalable & adaptive replica deployment. Depending on typical read/write rates over some time, the offline strategy could determine a replica placement option. Extensive empirical analyses based on real-world data traces indicate the usefulness of the approach in handling huge databases.

In 2021, Fan et al. [15] has presented a reliability-aware & energy-efficient task replica assignment mechanism depending on the operational task replicas at slower rates & allowing numerous task replicas to use similar server resources. To decrease the number of servers necessary, many task replicas could share server resources. The suggested method has efficiently reduced energy usage while maintaining a healthy balance between the number of servers used & job completion time, as demonstrated by test findings.

In 2017, Israel et al. [9] have described a BaRRS method for scheduling scientific application operations in cloud computing settings. BaRRS parallelizes scientific workflows into several sub-workflows to balance system consumption. Four well-known scientific procedures were tested, each with a specific dependence structure & data file size. The findings were encouraging, and they also emphasized the most important elements influencing the implementation of application areas on clouds.

In 2019, Amrith et al. [69] has been described a novel fault-tolerant workflow scheduling method that develops replication heuristics in an unsupervised approach. Compared to the Replicate-All method, the suggested method enhances measures including Resource Wastage & Resource Usage yet retains an acceptable increase in Makespan when compared to the vanilla HEFT.

In 2017, Qiu et al. [64] has created models to calculate the response-time dispersion for networks wherein replica cancellation could be too expensive or impossible to perform, including “quick” technologies like web services or older technologies. Additional testing on MATLAB benchmarks as well as a three-tier web application (MediaWiki) indicates exceptional accuracy, with queries taking seconds to complete. This rigorous quantitative study provided insights into appropriate replication levels below several system circumstances.

In 2012, Mansouri et al. [44] have suggested a DHR algorithm that places replicas at the most suitable sites, i.e. the greatest site with the most access for that replica. The simulation findings with OptorSim, the European Data Grid simulation indicate that the DHR method outperforms the other techniques as well as avoids the construction of unneeded replicas, leading to more efficient storage consumption.

In 2021, Younes et al. [28] has suggested a method for allocating replicas to IoT data in the cloud computing environment depending on the HS algorithm in an attempt to lessen data access costs. The suggested technique chooses the optimum location for data replication within a cloud computing system using the HS algorithm. The suggested strategy beat the other approaches in terms of data access time & latency, as well as energy usage, as per the deployment findings.

In 2017, Wiese et al. [80] has developed a replication technique that permits several fragmentations of the same data table in a distributed database system. They also investigate the effects of data updates (insertions & deletions) on the data distribution if there were more fragmentations (resulting in the circumstance where some replication criteria were unnecessary).

2.1.4. Replication creation approach

In 2021, Javidi et al. [49] have proposed HDRS, a dynamic replication technique. HDRS comprises of replica creation, which could also adaptively increase replicas based on exponential growth or decay rates, replica placement based on the access load & labelling approach, as well as replica replacement associated with future file value. Results demonstrate that when compared to other approaches, HDRS could minimize reaction time & bandwidth utilization. By balancing the load of sites, this strategy prevents unnecessary replications & reduces access latency.

2.1.5. Replication retirement approach

In 2018, Tos et al. [76] has presented PEPRv2, Performance, & Profit Oriented Data Replication Strategy for Cloud Systems. PEPRv2 provides the tenant with both throughput & minimum availability requirements even while considering the cloud provider’s profitability. They constructed a virtual cloud system that handles tenant inquiries using a simulation platform. PEPRv2 was compared against its predecessor, PEPR, as well as another data replication approach, CDRM, that does not include the cost of cloud computing.

In 2021, Mokadem et al. [75] have introduced APER, a dynamic data replication strategy in the cloud that meets both the response time goal as well as the provider’s economic advantage. If a query was predicted to break the SLA, the suggested technique contemplates making a new replica depending on the placement heuristic to continue the execution, ensuring that the response time SLO was met whereas the provider earns a profit. They compared APER’s performance to that of PEPR & CDRM techniques. APER met the provider’s expectations and made a profit by carefully positioning copies to increase data access time while reducing resource usage.

In 2020, Guo et al. [22] have introduced Mirror, a multi-replica method to solve the challenges. They discovered that was susceptible to a storage-saving attack, in which a dishonest provider may save significant storage costs relative to the costs of honestly keeping all the replicas even while passing any issue. They also discovered that Mirror was vulnerable to replacement & forgery attacks, posing additional security dangers for cloud users. Experiments demonstrate that the system performs similarly to Mirror but maintains a high level of security.

2.1.6. Replication decision approach

In 2020, Nannai et al. [63] have developed a unique dynamic data replication technique based on the IWD algorithm to handle replication & cloud storage management concerns. The IWD algorithm, a swarm intelligence-based optimization technique, was utilized to optimize the processes of cloud storage replication & administration. They compared the D2R-IWD algorithm to common optimization approaches like PSO and GA and discovered that the methodology provides superior outcomes in terms of access effectiveness for numerous test scenarios, improving cloud performance.

In 2018, Mansouri et al. [45] was used HRS in replica placement, selection, & replacement processes. HRS was designed to replicate data files on the cloud & comprises three steps. To improve reaction time, the decision to replace was adopted. The results of the experiments show that HRS may greatly improve data-intensive application availability, performance, & load balancing. Furthermore, it performs admirably without adding to overhead costs.

In 2019, Ali et al. [4] have suggested a method named “Secure Provable Data Possession method with Replication support in the Cloud utilizing Tweaks,” which protects the CSP from deceiving the data owner by keeping fewer replicas than the SLAs stipulate but also enables dynamic data operations. For all replicas of the outsourced data blocks, they employ the MHT-based storage mechanism to provide security and integrity. They use experimental analysis to demonstrate the effectiveness of their strategy and show that it outperforms previous solutions.

In 2021, Liu et al. [36] has developed a nonlinear integer programming approach to optimize the availability of data in both sorts of failures and thereby reduce replication costs. As compared to earlier replication systems, numerical simulation findings using trace parameters as well as trials from real-world Amazon S3 revealed that MRR delivers excellent data availability, minimal data loss probability, and inexpensive consistency maintaining & storage costs.

In 2020, Castro et al. [10] has examined various approaches for database fragmentation, allocation, & replication, as well as a Web application termed FRAGMENT, which uses the work methodology chosen during the analysis step. It proposes a fragmentation and replication strategy that may be implemented in a cloud environment and seems to be simple to implement, intending to enhance the effectiveness of database operations. Testing with the TPC-E benchmarks revealed that queries conducted against the distributed database built using FRAGMENT had a faster response time than queries performed against a centralized database.

In 2017, Tziritas et al. [77] have examined the usage of data replication in conjunction with the VM assignment problem, to improve methods that determine both data should indeed be replicated wherever VMs should be transferred to reduce network overhead between standard cloud & mobile cloud platforms. As compared to the extant methods in the literature, experiment results demonstrated that the suggested technique reduces network overhead by up to 53%.

In 2019, Sheng et al. [74] has introduced a file replication approach for assuring RRSD, intending to reduce the number of copies to improve load balancing while maintaining the data dependability. To accomplish this, RRSD employs the methods of many-time replica placement & redundant replica deletion. Comprehensive testing shows that RRSD outperforms other comparable approaches in terms of load balance, data reliability, as well as storage consumption, with a 10% increase in load balance as well as a 60% reduction in storage consumption while fulfilling data reliability requirements.

In 2016, Songling et al. [16] have suggested DOSN technologies to preserve data privacy. In DOSN, a user’s published data & replicas were only ever saved in the user’s friendship circle. While complete replication could increase DA, clean DOSNs may not be able to provide long-term DA. This research also suggests data placement options for achieving the required DA & enhancing other aspects of performance. Cadros’ efficiency has been verified by investigations.

In 2019, Moin et al. [25] have suggested a cloud FFTF. To apply the appropriate level of FT, FFTF allows customers to designate their activities as premium, standard, or economy. In terms of FT capacity, resource consumption, as well as utilization with an established FT architecture, the effectiveness of FFTF was studied and the results were through comprehensive simulation tests on simulated and real workloads.

2.1.7. Replication decision approach

In 2016, Nahir et al. [62] has created a distinctive technique that eliminates any scheduling overhead from the task’s critical route by incurring no communication overhead between users and service providers upon work arrival. The strategy involved making several copies of each work and transmitting each copy to a separate server. They propose a heuristic solution to the performance loss that occurs in such situations and demonstrate that it effectively mitigates the negative effects of propagation delays through experiments.

2.1.8. Static and dynamic replication

In 2021, Shakarami et al. [70] this study gives a thorough analysis and classification of state-of-the-art data replication schemes among several existing cloud computing solutions in the form of a classical classification to characterize current schemes on the subject and discuss open challenges. The three key categories in the classification that is being offered are data management, data auditing, and data de-duplication systems. A thorough analysis of the replication schemes emphasizes their key characteristics, including the classes they use, the type of scheme, the location of implementation, the evaluation methods, and their strengths and shortcomings. Table 2 shows the reviews on the existing model.

Table 2
Review of the existing model

Author Method Features Challenges

Mansouri et al. [43] ADRS method ∙ Effectively reduce the mean job time. ∙ Computational cost is high.

∙ Better efficiency is achieved ∙ Need to concentrate on the trade-off problem

Javidi et al. [49] LALW method ∙ The precision is high ∙ Focus on business-driven data.

Tos et al. [76] PEPRv2 scheme ∙ Provider cost is low ∙ Need consideration on the joint operators

∙ The economic benefit is high

Sun et al. [73] D2RS model ∙ The execution rate is high ∙ The user waiting time must be reduced.

∙ Stability is improved.

Ramanan et al. [65] SDS algorithm ∙ Cost-effective method ∙ QoS is minimized

Edwin et al. [13] EIMORM model ∙ Load balancing is high. ∙ Need concentrate on the data migration techniques

∙ Energy efficiency is high.

Mokadem et al. [75] APER scheme ∙ Response time is low. ∙ Overhead has occurred

Nannai et al. [63] D2R-1WD model ∙ Access time is minimized ∙ Energy factors of the storage node can be considered

Mansouri et al. [45] Fuzzy interference system ∙ Performance efficiency is improved. ∙ Need to concentrate on the load-balancing technique

Sookhtsaraei et al. [72] LRM model ∙ Delay is reduced. ∙ Need to consider real-time clusters on the data.

Mohammad et al. [50] CSO algorithm ∙ The response time of the system is improved ∙ High energy consumption

Ali et al. [4] SPDR-Tweaks model ∙ Increased data availability. ∙ Computation time is high

Boru et al. [6] 3-tier cloud computing data center ∙ Communication delay is reduced ∙ High energy consumption

Maheshwari et al. [41] CB-DRP model ∙ Update time is less ∙ Lost due to network error

Ulabedin et al. [78] R-PCP technique ∙ Improved resource utilization ∙ Execution time is high

Chen et al. [35] QADR algorithm ∙ Computation time is reduced ∙ Cost is effective

Liu et al. [36] MRR scheme ∙ Cost is reduced ∙ Fewer storage data

Shen et al. [37] PMCR scheme ∙ Data loss is less ∙ Energy consumption is high

∙ Maintenance cost is high

Toosi et al. [53] Optimal offline algorithm ∙ Minimize the storage cost ∙ The workload is high

Latip et al. [2] VRS-BQ technique ∙ Lower replication time ∙ Response time is high

Salem et al. [67] MOABC algorithm ∙ Low cost ∙ Efficiency is high

Bilal et al. [3] DROPS model ∙ Storage capacity is high ∙ Update time is high

Guo et al. [22] Mirror scheme ∙ Economic bandwidth is high ∙ Security risk is high

Castro et al. [10] Fragment model ∙ Easy to implement ∙ Attribute expense is high

Mseddi et al. [61] Crane scheme ∙ Replication and migration is efficient ∙ Smaller scale simulations

Bowers et al. [7] LAST-HDFS system ∙ Flexibility, scalability ∙ Computation time is increased.

Raouf et al. [66] MTDB-MR algorithm ∙ Reduce collisions ∙ The prediction model is high

Tziritas et al. [77] Quadratic HPA ∙ To improve slow down. ∙ No storage and capacity constraints.

Zhang et al. [83] BDS scheme ∙ Improved link utilization ∙ The trade-off between efficiency and bandwidth.

Awad et al. [5] MO-PSO&MO-ACO algorithm ∙ Less bandwidth consumption ∙ Complex data replication on accuracy and efficiency

Peng et al. [38] DCD&OCA algorithm ∙ Location accuracy is better ∙ Low accuracy

Fan et al. [15] RER Algorithm ∙ Reduced energy consumption ∙ Speed is high

Amel et al. [30] BVS algorithm ∙ Minimum availability ∙ Data replication and deletion decisions will be enhanced.

Riad et al. [59] RSPC Strategy ∙ Reduced resource consumption ∙ Maximum threshold value.

Author	Method	Features	Challenges
Mansouri et al. [43]	ADRS method	∙ Effectively reduce the mean job time.	∙ Computational cost is high.
∙ Better efficiency is achieved	∙ Need to concentrate on the trade-off problem
Javidi et al. [49]	LALW method	∙ The precision is high	∙ Focus on business-driven data.
Tos et al. [76]	PEPRv2 scheme	∙ Provider cost is low	∙ Need consideration on the joint operators
∙ The economic benefit is high
Sun et al. [73]	D2RS model	∙ The execution rate is high	∙ The user waiting time must be reduced.
∙ Stability is improved.
Ramanan et al. [65]	SDS algorithm	∙ Cost-effective method	∙ QoS is minimized
Edwin et al. [13]	EIMORM model	∙ Load balancing is high.	∙ Need concentrate on the data migration techniques
∙ Energy efficiency is high.
Mokadem et al. [75]	APER scheme	∙ Response time is low.	∙ Overhead has occurred
Nannai et al. [63]	D2R-1WD model	∙ Access time is minimized	∙ Energy factors of the storage node can be considered
Mansouri et al. [45]	Fuzzy interference system	∙ Performance efficiency is improved.	∙ Need to concentrate on the load-balancing technique
Sookhtsaraei et al. [72]	LRM model	∙ Delay is reduced.	∙ Need to consider real-time clusters on the data.
Mohammad et al. [50]	CSO algorithm	∙ The response time of the system is improved	∙ High energy consumption
Ali et al. [4]	SPDR-Tweaks model	∙ Increased data availability.	∙ Computation time is high
Boru et al. [6]	3-tier cloud computing data center	∙ Communication delay is reduced	∙ High energy consumption
Maheshwari et al. [41]	CB-DRP model	∙ Update time is less	∙ Lost due to network error
Ulabedin et al. [78]	R-PCP technique	∙ Improved resource utilization	∙ Execution time is high
Chen et al. [35]	QADR algorithm	∙ Computation time is reduced	∙ Cost is effective
Liu et al. [36]	MRR scheme	∙ Cost is reduced	∙ Fewer storage data
Shen et al. [37]	PMCR scheme	∙ Data loss is less	∙ Energy consumption is high
∙ Maintenance cost is high
Toosi et al. [53]	Optimal offline algorithm	∙ Minimize the storage cost	∙ The workload is high
Latip et al. [2]	VRS-BQ technique	∙ Lower replication time	∙ Response time is high
Salem et al. [67]	MOABC algorithm	∙ Low cost	∙ Efficiency is high
Bilal et al. [3]	DROPS model	∙ Storage capacity is high	∙ Update time is high
Guo et al. [22]	Mirror scheme	∙ Economic bandwidth is high	∙ Security risk is high
Castro et al. [10]	Fragment model	∙ Easy to implement	∙ Attribute expense is high
Mseddi et al. [61]	Crane scheme	∙ Replication and migration is efficient	∙ Smaller scale simulations
Bowers et al. [7]	LAST-HDFS system	∙ Flexibility, scalability	∙ Computation time is increased.
Raouf et al. [66]	MTDB-MR algorithm	∙ Reduce collisions	∙ The prediction model is high
Tziritas et al. [77]	Quadratic HPA	∙ To improve slow down.	∙ No storage and capacity constraints.
Zhang et al. [83]	BDS scheme	∙ Improved link utilization	∙ The trade-off between efficiency and bandwidth.
Awad et al. [5]	MO-PSO&MO-ACO algorithm	∙ Less bandwidth consumption	∙ Complex data replication on accuracy and efficiency
Peng et al. [38]	DCD&OCA algorithm	∙ Location accuracy is better	∙ Low accuracy
Fan et al. [15]	RER Algorithm	∙ Reduced energy consumption	∙ Speed is high
Amel et al. [30]	BVS algorithm	∙ Minimum availability	∙ Data replication and deletion decisions will be enhanced.
Riad et al. [59]	RSPC Strategy	∙ Reduced resource consumption	∙ Maximum threshold value.

Table 2

(Continued)

Author	Method	Features	Challenges
Sheng et al. [74]	RRSD model	∙ Improves load balance	∙ Storage consumption is less
Liang et al. [34]	Poisson stochastic process	∙ Data survivability is improved	∙ The data partition is difficult.
Mansouri et al. [46]	PDR Strategy	∙ Low storage and bandwidth consumption	∙ Data replication is dynamic
Songling et al. [16]	DOSN System	∙ Data availability in social networks is improved	∙ Overhead is high
Marwa et al. [56]	Adaptive forecasting replication	∙ Data transmission is large	∙ Energy consumption is the maximum
Israel et al. [9]	BaRRS algorithm	∙ Less execution time	∙ Low quality and overhead are high
Navneet et al. [19]	DCR2S	∙ Replication cost is low	∙ Data cost is high
Mingxu et al. [81]	MR-PDP scheme	∙ The security level is high	∙ Durability is increased
Zeng et al. [82]	SaRBP	∙ Scalability is improved	∙ Time delay is high
Gregory et al. [33]	TRC technique	∙ Less completion time	∙ More data task allocation
Behnam et al. [52]	FSDA method	∙ Service time is average	∙ Bandwidth is high
Gustavo et al. [27]	FT-Aurora robust cloud manager	∙ Reliability is improved	∙ Usage of resources is high
Suji et al. [20]	Weighted dynamic replication strategy	∙ Flexibility is improved	∙ Computational time is high
Kuchaki et al. [51]	DPRS	∙ Complexity is reduced	∙ The flexibility of the system is low
Moin et al. [25]	FFTF scheme	∙ Transmission time is less	∙ Reliability is reduced
Amrith et al. [69]	Fault-tolerant work scheduling algorithm	∙ Average resource usage	∙ Network usage is high.
Sai et al. [39]	MORM Strategy	∙ Reliability is high	∙ Overhead reduction is high
Tao et al. [11]	SSOR architecture	∙ Load balancing capability is high	∙ Access time is high
Abbes et al. [1]	The replication factor modelling approach	∙ Resource allocation is maintained	∙ Threshold error rate is high
He et al. [26]	Single exponential smoothing method	∙ The arrival rate is low	∙ Task failure penalty is high
Khelifa et al. [31]	BCVS algorithm	∙ Fault-tolerant is high	∙ Delay is high
Javidi et al. [48]	DMRD	∙ Efficiency is high	∙ Recovery time is high
Kumar et al. [32]	SWORD approach	∙ The detection rate is high	∙ The execution rate is high
Qiu et al. [64]	EM algorithm	∙ Low failure rate	∙ The data reduction rate is high
Nahir et al. [62]	Replication-based load balancing scheme	∙ The bit error rate is reduced	∙ Energy consumption is high
Younes et al. [28]	DHR algorithm	∙ Time of execution is low	∙ Provider expenditure is high
Galen et al. [40]	HS Algorithm	∙ Mapping time is low	∙ Information loss occurs
Sreekumar et al. [79]	Active machine learning framework	∙ High resolution is achieved	∙ Transmission time is high
Wiese et al. [80]	Fuzzy logic system	∙ Response time is very fast	∙ Need to concentrate on the fragment size.
Bui et al. [8]	ILP model	∙ Data locality is improved.	∙ The computational complexity is reduced.

In 2021, Séguéla et al. [68] have provide the dynamic data replication technique (DE2ARS) in this research changes the number of copies based on the workload and tackles difficulties with energy consumption and cost. It is initiated by a Control Chart and occurs after an initial placement. To properly analyze the suggested technique, we first contrast various parameter options. We contrast DE2ARS with methods found in the literature.

2.1.9. Cost analysis

In 2021, Hamrouni et al. [23] offers a thorough examination of the data replication techniques now in use in cloud systems, including both standalone and networked clouds. We also describe critical steps for data correlation-aware techniques. In addition, we look at the characteristics of the main techniques, such as how replication problems are addressed, how providers and consumers are prioritized, how service level agreements are taken into account, how cost and economic factors are taken into account, and how assessment tools are used. Finally, using extensive simulations of several replication algorithms designed for standalone and networked clouds, we present a performance study.

In 2022, Mokadem et al. [60] provides a new classification of data replication tactics in cloud systems in this research. It also considers a number of other factors that are unique to cloud settings, including (i) the profit orientation, (ii) the examined objective function, (iii) the number of tenant objectives, (iv) the cloud environment’s characteristics, and (v) the assessment of economic expenses. Regarding the final criterion, we concentrate on the provider’s financial gain and take into account the provider’s energy use.

2.1.10. Economic profit

As a result, getting the most economic benefit at the lowest possible operating cost may not always coincide with achieving satisfactory performance. A replica is actually only constructed if a node that could receive a new duplicate is located, even if a replication is taken into consideration (per-query or per set of queries). Moreover, the provider needs to make money from this duplication. Before choosing to duplicate (before Q is executed), the provider’s economic advantage (Q_Profit) is also calculated for this purpose. As a result, the provider’s estimated revenues (Q_Revenues) and expenses (Q_Expenses), as indicated by Formula (1), are determined. $\begin{matrix} (1) & Q_Revenues - Q_Expenses = Q_Profit \end{matrix}$

In order to ensure profitability for the provider when implementing Q, its income must exceed its expenses when several renters are served.

3. Review on data replication models in the cloud, performances and the maximum attainments

3.1. Review on adopted data replication models in the cloud

Each work is reviewed regarding adopted data replication methods in the cloud, and the pictorial depiction is shown in Fig. 1. By delivering several copies with a coherent state of the same service, data replication a well-known distributed approach is the key mechanism utilized in the cloud for lowering user waiting time, boosting data availability, and minimizing cloud system bandwidth usage. Moreover, it was observed that the ADRS method was adopted in [43], and the LALW method was exploited in [49]. For Grid environments, a novel dynamic replication approach known as LALW [49] is employed. The LALW technique’s primary objective is to assign different weights to files of differing ages. As a consequence, the weight declension rate would be reduced.

Fig. 1.

Architectural diagram of data replication models in the cloud.

Further, the PEPRv2 scheme was adopted in [76], the D2RS model was exploited in [73], SDS algorithm was determined in [65]. The suggested SDS [65] method would reduce the cost of data duplication. SDS is a positive feedback system that encourages the investigation of superior solutions by allocating additional agents to them. EIMORM model was exploited in [13], the APER scheme was adopted in [75], the D2R-IWD model was determined in [63], the fuzzy inference system was adopted in [45], and the LRM model was employed in [72], respectively. LRM [72] is the replication manager referred to in this study. LRM’s [72] primary responsibility is to obtain user inquiries, gather information about the cluster’s data nodes, as well as eventually choose the greatest host for the blocks. LRM carries out these responsibilities in collaboration with its other components. LRM is the last arbiter.

Moreover, the CSO algorithm was adopted in [50], the SPDPR-Tweaks model was employed in [4], the three-tier cloud computing data centre architecture was determined in [6], the CB-DRP model was adopted in [41], the R-PCP technique was adopted in [78], QADR algorithm was exploited in [35], MRR scheme was determined in [36], PMCR scheme was exploited in [37], the optimal offline algorithm was used in [53], and VRS-BQ technique was exploited in [2], correspondingly. In a cloud scenario, the VRS-BQ [2] replica placement strategy is utilized to reduce storage usage, response time, as well as replication process time. In addition, the MOABC algorithm was used in [67] and the DROPS model was adopted in [3]. Furthermore, the Mirror scheme was exploited in [22], the FRAGMENT model was adopted in [10], and CRANE scheme was adopted in [61], and the LAST-HDFS system was adopted in [7]. The LAST-HDFS [7] technology guarantees location-aware file allocations & monitors file transfers in the cloud in real-time to prevent any unlawful transfers.

Moreover, other data replication models in the cloud such as the MTDB-MR algorithm was deployed in [66], Quadratic HPA was exploited in [77], the BDS+ scheme was employed in [83] and MO-PSO and MO-ACO algorithm was adopted in [5] correspondingly. In addition, the DCD and OCA algorithm was deployed in [38], the RER algorithm was employed in [15], the BVS algorithm was exploited in [30], the RSPC strategy was adopted in [59] and the RRSD model was deployed in [74] respectively. RRSD [74] employs the approach of making minimum copies to enhance load balancing and minimize storage consumption, and data dependability.

Consequently, the Poisson stochastic process was adopted in [34], the PDR Strategy was deployed in [46], the DOSN system was exploited in [16], the adaptive forecasting replication Framework was employed in [56] and the BaRRS algorithm was adopted in [9]. Similarly, the DCR2S was adopted in [19], the MR-PDP scheme was deployed in [81], SaRBP was employed in [82], the TRC technique was used in [33], the FSDA method was adopted in [52], and FT-Aurora Robust Cloud Manager was adopted in [27]. To evaluate the trade-offs between the six objectives, FSDA [52] was utilized. It’s used to assess fitness and more accurately characterize the solution. The new replica is placed in the most optimal position to decrease access time while maximizing network & resource utilization. Likewise, the weighted dynamic replication strategy, DPRS, FFTF scheme, fault-tolerant workflow scheduling algorithm, MORM strategy, SSOR architecture, replication factor modelling approach, Single Exponential Smoothing method, BCVS algorithm, DMDR, SWORD approach, EM algorithm, replication-based load balancing scheme, DHR algorithm, HS algorithm, active machine learning framework, fuzzy logic system, ILP model, and GPR model were adopted in [1,11,20,25,26,28,31,32,39,40,44,48,51,62,64,69,79,80] and [8], respectively. To forecast future file demand, the single exponential smoothing approach [26] is utilised. It creates a smoothed time series & removes irregular as well as random interference. The DHR technique in [44] distributes replicas at the most suitable sites, i.e. the optimum site with the most access for such replicas. While many locations host replicas, it also decreases access latency by picking the optimal replica. When compared to other regression models, GPR [8] is quite accurate.

3.2. Analysis of performance measures

Table 3 determines the performance measures obtained from various contributions regarding data replication models in the cloud. From Table 3, it is noted that 19 papers that have made a performance analysis under response time have contributed about 29.23% of the reviewed works, and the computation time was examined in 9 papers which had contributed about 13.84% of the entire works. Likewise, the SEU has contributed about 12.30% (8 papers). Further, energy consumption, cost computation, execution time, and storage capacity have been adopted in 9.23% (6 papers). Moreover, the AUC and FPR have contributed about 9.23% (6 papers) of the entire contribution. In addition, the false network usage and load variance have been contributed about 7.69% (5 papers). Furthermore, the bandwidth, delay, and several files have contributed about 6.15% (4 papers) of the entire contribution. Likewise, the latency, storage cost, number of replications, and transmission time have contributed about 4.61% (3 papers). On the other hand, the replication frequency, number of storage nodes, replication cost, file size, storage overhead, query size, recovery time, and completion time have been adopted in 3.07% (2 papers). Accordingly, the measures like provider expenditure, execution rate, block availability, mean block unavailability, efficiency, updation time, makespan, access time, TUE, runtime, replication time, R/W ratio, size of outsourced file, replica size, number of partition, mitigation time, detection correctness rate, number of violation, network overhead reduction, accuracy, time complexity, total provider expenditure, VM processing capability, load balance, reliability, security levels, desired level of data availability, current time point, data reduction rate, number of data centers, available probability, TPV verification time, MDS memory, request arriving rate, task failure penalty, P-value, number of data nodes, VM size, memory usage, CPU utilization, number of replica factors, hit ratio, task count, check point interval, average resource usage, number of requests, failure rate, threshold error rate, number of host, average error, arrival rate, computational capacity, NLCD ref data, RLCD ref data, RLCD-NLCD data, SBER value, time of execution query, mapping time, replication factor, and data locality have contributed about 1.54% (1 paper), respectively.

Table 3
Review of various performance measures based on data replication in the cloud

Measures Citations

Response time [2,5,10,16,26,30,31,43,45,48–51,63,66,73,75,81,82]

Computation time [4,11,22,28,34,38,62,64,67]

Storage Element Usage (SEU) [5,43,45,49,50,74–76]

Energy consumption [6,32,39,56,76,78]

Cost computation [9,13,31,33,37,78]

Execution time [5,9,35,66,77,79]

Storage capacity [2,3,20,59,63,77]

Network usage [43,46,49,51,59]

Load variance [43,45,49,50,72]

Bandwidth [6,37,45,50]

Delay [6,28,44,83]

Number of files [7,39,48,52]

Latency [45,53,72]

Storage cost [36,37,53]

Number of replications [22,75,76]

Transmission time [5,35,67]

Replication frequency [43,49]

Number of storage nodes [26,38]

Replication cost [19,67]

File size [28,48]

Storage overhead [4,11]

Query size [32,44]

Recovery time [65,80]

Completion time [15,83]

Miscellaneous measures

Provider expenditure [76]

Execution rate [73]

Block availability [73]

Mean block unavailability [72]

Efficiency [50]

Updation time [41]

Makespan [78]

Access time [35]

TUE [36]

Runtime [37]

Replication time [2]

R/W ratio [3]

Size of outsourced file [22]

Replica size [61]

Number of partition [61]

Mitigation time [61]

Detection correctness rate [7]

Number of violation [66]

Network overhead reduction [77]

Accuracy [83]

Time complexity [15]

Total provider expenditure [30]

Measures	Citations
Response time	[2,5,10,16,26,30,31,43,45,48–51,63,66,73,75,81,82]
Computation time	[4,11,22,28,34,38,62,64,67]
Storage Element Usage (SEU)	[5,43,45,49,50,74–76]
Energy consumption	[6,32,39,56,76,78]
Cost computation	[9,13,31,33,37,78]
Execution time	[5,9,35,66,77,79]
Storage capacity	[2,3,20,59,63,77]
Network usage	[43,46,49,51,59]
Load variance	[43,45,49,50,72]
Bandwidth	[6,37,45,50]
Delay	[6,28,44,83]
Number of files	[7,39,48,52]
Latency	[45,53,72]
Storage cost	[36,37,53]
Number of replications	[22,75,76]
Transmission time	[5,35,67]
Replication frequency	[43,49]
Number of storage nodes	[26,38]
Replication cost	[19,67]
File size	[28,48]
Storage overhead	[4,11]
Query size	[32,44]
Recovery time	[65,80]
Completion time	[15,83]
Miscellaneous measures
Provider expenditure	[76]
Execution rate	[73]
Block availability	[73]
Mean block unavailability	[72]
Efficiency	[50]
Updation time	[41]
Makespan	[78]
Access time	[35]
TUE	[36]
Runtime	[37]
Replication time	[2]
R/W ratio	[3]
Size of outsourced file	[22]
Replica size	[61]
Number of partition	[61]
Mitigation time	[61]
Detection correctness rate	[7]
Number of violation	[66]
Network overhead reduction	[77]
Accuracy	[83]
Time complexity	[15]
Total provider expenditure	[30]

Table 3

(Continued)

Measures	Citations
VM processing capability	[59]
Load balance	[74]
reliability	[74]
Security levels	[34]
Desired level of data availability	[16]
Current time point	[16]
Data reduction rate	[56]
Number of data centers	[19]
Available probability	[19]
TPV verification time	[81]
MDS memory	[82]
Request arriving rate	[82]
Task failure penalty	[33]
P-value	[52]
Number of data nodes	[52]
VM size	[27]
Memory usage	[27]
CPU utilization	[27]
Number of replica factors	[20]
Hit ratio	[51]
Task count	[25]
Check point interval	[25]
Average resource usage	[69]
Number of requests	[11]
Failure rate	[1]
Threshold error rate	[1]
Number of host	[48]
Average error	[64]
Arrival rate	[64]
Computational capacity	[62]
NLCD ref data	[40]
RLCD ref data	[40]
RLCD-NLCD data	[40]
SBER value	[79]
Time of execution query	[80]
Mapping time	[8]
Replication factor	[8]
Data locality	[8]

Table 4

Maximum performance attained in the reviewed works

Sl. no	Citation	Performance measures	Maximum performance
1	[63]	Response time	100 ms
2	[62]	Computation time	200 ms
3	[5]	SEU	20%
4	[78]	Energy consumption	0.3 kwh
5	[31]	Cost computation	0.025$
6	[79]	Execution time	500 ms
7	[77]	Storage capacity	100%
8	[59]	Network usage	0.25 ENU
9	[49]	Load variance	0.17
10	[6]	Bandwidth	640 Gb/s
11	[44]	Delay	2500 ms
12	[48]	Number of files	1500 Mb
13	[45]	Latency	0.6 ms
14	[36]	Storage cost	15.3
15	[22]	Number of replications	2
16	[67]	Transmission time	50 sec
17	[43,49]	Replication frequency	0.1
18	[38]	Number of storage nodes	36
19	[67]	Replication cost	1000
20	[48]	File size	1500 Mb
21	[11]	Storage overhead	50%
22	[32]	Query size	11
23	[80]	Recovery time	0.5 sec
24	[62]	Completion time	200 ms
25	[76]	Provider expenditure	26.94$
26	[73]	Execution rate	92%
27	[73]	Block availability	0.8
28	[72]	Mean block unavailability	0.1428
29	[50]	Efficiency	6.5%
30	[41]	Update time	15 sec
31	[78]	Makespan	0.95
32	[35]	Access time	0.25 sec
33	[36]	TUE	1.38%
34	[37]	Runtime	10 sec
35	[2]	Replication time	22%
36	[3]	R/W ratio	0.50
37	[22]	Size of outsourced file	64 Mb
38	[61]	Replica size	15 Gb
39	[61]	Number of partition	4
40	[61]	Mitigation time	10 minutes
41	[7]	Detection correctness rate	100%
42	[66]	Number of violation	89%
43	[77]	Network overhead reduction	50%
44	[83]	Accuracy	99%
45	[15]	Time complexity	3.7
46	[30]	Total provider expenditure	95$
47	[59]	VM processing capability	1500 MIPS
48	[74]	Load balance	10%

Table 4

(Continued)

Sl. no	Citation	Performance measures	Maximum performance
49	[74]	Reliability	8%
50	[34]	Security levels	0.9
51	[16]	Desired level of data availability	99%
52	[16]	Current time point	31
53	[56]	Data reduction rate	99%
54	[19]	Number of data centers	1
55	[19]	Available probability	0.6–0.9
56	[81]	TPV verification time	18.827 ms
57	[82]	MDS memory	8 G
58	[82]	Request arriving rate	10⁴ per unit time
59	[33]	Task failure penalty	20000
60	[52]	P-value	0.05
61	[52]	Number of data nodes	7
62	[27]	VM size	2.5 MB
63	[27]	Memory usage	95%
64	[27]	CPU utilization	92%
65	[20]	Number of replica factors	3
66	[51]	Hit ratio	72%
67	[25]	Task count	100,000
68	[25]	Check point interval	1 hour
69	[69]	Average resource usage	41%
70	[11]	Number of requests	30
71	[1]	Failure rate	50%
72	[1]	Threshold error rate	9%
73	[48]	Number of host	3–10
74	[64]	Average error	2.72%
75	[64]	Arrival rate	6.67
76	[62]	Computational capacity	90%
77	[40]	NLCD ref data	41.8%
78	[40]	RLCD ref data	36.9%
79	[40]	RLCD-NLCD data	36.3%
80	[79]	SBER value	0.7
81	[80]	Time of execution query	0.25 s
82	[8]	Mapping time	0.5 s
83	[8]	Replication factor	1.22
84	[8]	Data locality	73.96%

3.3. Analysis of maximum performance

The maximum performance attained in every reviewed paper based on data replication in the cloud is illustrated in Table 4. Further, the response time attained in [63] has obtained a better range of 100 ms and the computation time used in [62] has the best value of 200 ms. By using the replication-based load balancing scheme [62], the computation time is lower with better outcomes than other techniques. Moreover, SEU has obtained a better value of 20% measured in [5] and Energy consumption has obtained a better value of 0.3 kwh examined in [78] respectively. The R-PCP approach [78] built a cloud platform with computing, storage, & bandwidth capabilities. To enhance resource utilization and energy consumption, R-PCP restricts the amount of provided resources. Likewise, the cost computation and execution time have attained a better value of 0.025$ and 500 ms which is examined in [31] and [79]. The response speed and average response time are greater by using the BCVS algorithm [31]. Similarly, storage capacity, network usage, load variance, bandwidth, delay, number of files, latency, storage cost, number of replications, transmission time, replication frequency, number of storage nodes, replication cost, and file size have attained better values of 100%, 0.25 ENU, 0.17, 640 Gb/s, 2500 ms, 1500 Mb, 0.6 ms, 15.3, 2, 50 sec, 0.1, 36, 1000, and 1500 Mb it has been examined in [6,22,36,38,43–45,48,49,49,59,67,67,77] and [48], correspondingly. The storage overhead, query size, recovery time, completion time, provider expenditure, execution rate, block availability, mean block unavailability, efficiency, updation time, makespan, access time, Traffic Usage Efficiency (TUE), runtime, replication time, R/W ratio, size of the outsourced file, replica size, number of partition, mitigation time, detection correctness rate, number of violation, network overhead reduction, and accuracy has attained a better value of 50%, 11, 0.5 sec, 200 ms, 26.94$, 92%, 0.8, 0.1428, 6.5%, 15 sec, 0.95, 0.25 sec, 1.38%, 10 sec, 22%, 0.50, 64 Mb, 15 Gb, 4, 10 minutes, 100%, 89%, 50%, and 99% that are examined in [2,3,7,11,22,32,35–37,41,50,61,61,61,62,66,72,73,73,76–78,80] and [83]. CRANE [61] can decrease replica construction & migration time by up to 60% & inter-data center network traffic by up to 50% when maintaining the minimum necessary data availability. The measures such as time complexity, total provider expenditure, VM processing capability, load balance, reliability, security levels, desired level of data availability, current time point, data reduction rate, number of data centers, available probability, TPV verification time, and MDS memory have attained higher values of 3.7, 95$, 1500MIPS, 10%, 8%, 0.9, 99%, 31, 99%, 1, 0.6–0.9, 18.827 ms, and 8 G and they have been analysed in [15,16,16,19,19,30,34,56,59,74,74,81], and [82], respectively. In comparison to other existing techniques, the experiments show that RRSD [74] can achieve better load balancing & assure data dependability. Also, request arriving rate, task failure penalty, P-value, number of data nodes, VM size, memory usage, CPU utilization, number of replica factors, hit ratio, task count, checkpoint interval, average resource usage, number of requests, failure rate, and threshold error rate were exploited in [1,11,20,25,25,27,27,27,33,51,52,52,69,82], and [1] and they have acquired higher values of 10⁴ per unit time, 20000, 0.05, 7, 2.5 MB, 95%, 92%, 3, 72%, 100,000, 1 hour, 41%, 30, 50%, and 9% correspondingly. The replication factor modelling approach [1] minimized the failure rate more than other schemes. In addition, the number of hosts, average error, arrival rate, computational capacity, NLCD ref data, RLCD ref data, RLCD-NLCD data, SBER value, time of execution query, mapping time, replication factor, and data locality have attained higher values of 3–10, 2.72%, 6.67, 90%, 41.8%, 36.9%, 36.3%, 0.7, 0.25 s, 0.5 s, 1.22, and 73.96% and they have been measured in [8,8,40,40,40,48,62,64,64,79,80], and [8], respectively. EM method [64] was applied to lower the average error and improve the arrival rate. When compared to the default replication method and the second best option, the GRP model [8] helps to minimize mapping time (ERMS). The average replication factor is better than the default method, as can be observed. Furthermore, lowering the thresholds improves the data locality measure.

4. Evaluation of adopted replication handling schemes in the cloud and chronological review

4.1. Review on replication handling in cloud storage systems

A review of major studies on replication handling in cloud storage systems is presented. As it will be explained, some of the studies evaluated took a replication modelling technique, whereas others took a different strategy: replication management, replication selection, replication placement, replication creation, replication retirement, replication choice, & replication assignment. Figure 2 depicts the replication handling in cloud storage systems.

Fig. 2.

Pictorial representation on review of replication handling in cloud system.

From the review, the replication management approach was employed in [1,2,5,6,11,13,19,20,26–28,30–35,39,41,43–46,48–53,59,61,63,66,69,72,73,75,76,78,79,83] and [8] respectively. An important topic that is closely related to replication migration is replication placement. One of the most difficult to solve when evaluating new replication requests is choosing the most effective location to transmit the duplicate. To reduce network congestion & ensure replica availability while keeping access time efficient, the updated schedule placement should be evaluated. Another key related problem is the replica’s effective size that falls within this group. The incremental technique which is a dynamic way of replicating could be a fair guideline for replica size. The size of replicas could be affected by the size of the targeting storage. In terms of replica size, replication might well be regarded as the granularity of replication that relates to the quantity of data size involved in replication creation. The replication selection approach was adopted in [2,3,5,9,25,28,30–32,37,39,40,44–46,48–52,56,62–64,67,72,79,81,83] and [80]. In addition, the replication placement approach was used in [2,3,5,7,16,19,20,26,30–32,35,37–39,43–46,48–53,59,61,63,66,67,72–80,82] and [8] respectively. However, replication creation approach was exploited in [4,19,20,26,27,30–33,46,48,49,52,59,61,66,69,74–76,78] and [44]; the replication retirement approach was adopted in [3,75,76] and [22], correspondingly. From the review, it is attained that the replication decision approach was used in [16,19,22,31,32,36,38–41,44–46,48–53,56,59,62–64,66,73–77,83] and [8], correspondingly Moreover, replication assignment approach was exploited in [15,39] and [62], correspondingly.

5. Review on data auditing schemes and simulation tool

5.1. Data auditing schemes

The review of data auditing schemes is analysed from the papers [12,14,18,21,24,65] and [17], respectively. Ramanan et al. [65] have developed a public auditing method that would achieve 3 objectives: (1) Accuracy: A public verifier (i.e., a data user) may accurately validate the security of cloud data. (2) Public auditing: a public verification may audit the accuracy of data without having to download the whole data set from the cloud; (3) Certification less: this implies the accuracy of public auditing that does not require the use of a public verifier to manage certificates.

Jaya et al. [21] have presented a Certificateless Multi-Replica Dynamic Public Auditing Scheme for Shared Data in Cloud Storage to address these issues. The replica version table, a unique data structure in the method, facilitates shared data dynamics. This system also allows for safe user revocation from the organization. Under the hardness premise of the usual CDH & DL challenges in the ROM, they establish that the approach was evidentially safe against type I/II/III opponents using the security analysis. This approach was efficient & practical, as demonstrated by its performance & test assessment.

Imad et al. [14] have suggested an auditing strategy that allows consumers to audit their outsourced data in an expedient & secure manner. By providing the data delegation key & utilizing the auditing key while constructing IDs for the data blocks, they allow the data user to outsource the data auditing process to the TPA. The suggested technique provides a minimal computing cost for users and allows them to outsource the data verification system to a 3rd auditor thereby preserving a low computing & communication cost without endangering the stored data’s confidentiality.

Xiang et al. [18] have proposed a cloud data auditing technique that supports the deduplication of files & authenticators. To their information, the suggested approach would be the first actual implementation of low-entropy security. The malicious cloud cannot fabricate any authenticator to satisfy the auditing verification for a file having insufficient entropy. The suggested technique was simple to use. During the auditing phase, users may not be able to engage with the TPA. To demonstrate that the suggested approach was safe, they provided comprehensive security proof. Extensive tests demonstrate the effectiveness of this strategy.

Jing et al. [24] presented a public cloud auditing approach for smart cities that’s also lightweight & privacy-preserving and does not need bilinear couplings. Mainly, the suggested methodology was pairing-free, enabling a third-party auditor to construct authentication on behalf of users. Additionally, it safeguards data privacy from third-party auditors & cloud service providers. Furthermore, in a multi-user environment, this novel technique might well be readily & naturally extended to batch auditing. When compared to conventional public cloud auditing techniques, detailed performance and security assessments demonstrated that the proposed scheme seems to be simpler and more secure.

Daniel et al. [12] have provided a lightweight auditing approach for outsourced data integrity testing & deduplication. The suggested method integrates hashing as well as symmetric encryption with a distributed hash table data structure that decreases computational and communication costs for integrity verification while also allowing for rapid data dynamic functions. The suggested methodology’s results show that this lightweight data audit method exceeds extant systems in terms of computing, communication, & processing costs.

Gan et al. [17] have offered an efficient auditing strategy for outsourcing large data which uses algebraic signatures as well as an XOR-homomorphic function to accomplish some benefits, including fewer problems and proofs, non-block verification, data privacy protection, reduced computational & communication costs. The proposed approach allows a trustworthy third-party auditor to examine outsourcing data in the cloud on account of DOs. Furthermore, the suggested basic method is secure against different assaults using the security framework. The basic & expanded schemes were both very efficient, according to the performance assessment.

Fig. 3.

Representation of simulation tools used in each paper.

5.2. Review of the simulation tool

65 papers related to data replication in the cloud are taken for review on the simulation tool used in each paper. Figure 3 shows the representation of simulation tools used in each paper. Initially, 8 papers [2,13,43,45,51,63,73] and [19] have used Cloudsim (Java) as a simulation tool. Further, 17 papers [5,10,16,25,26,30,31,46,48–50,52,53,59,67,75,76] has contributed normal cloud sim simulator tool. Moreover, 4 papers [11,20,79] and [72] have adopted Java as a simulation tool. Likewise, 4 papers [11,20,79] and [72] have adopted Java as a simulation tool. Moreover, 6 papers [35,39,62,64,83] and [28] has adopted the MATLAB as a simulation tool. In addition, 4 papers [22,41,61] and [38] have adopted the PYTHON as a simulation tool. Still, 2 papers [6] and [77] have adopted the NS-2 as a simulation tool. Furthermore, 3 papers [9,27], and [1] have used Ubuntu as an operating system. Workflows have been used in 2 papers [78] and [69]. 3 papers [32,36] and [37] have used Amazon cloud datasets. Further, 3 papers [7,8] and [56] have used real cloud datasets. Also, 13 papers [3,4,15,33,34,40,44,65,66,74,81,82] and [80] have used other simulation tools.

6. Research gaps and challenges

The number of individuals using cloud storage has exploded recently.

The reason was that the cloud storage system is difficult to keep and has lower storage costs than other storage options.

It also offers excellent dependability, and availability, and is well-suited to large-scale data storage. The technologies use redundancy to ensure high availability & reliability.

Objects were cloned numerous times in replicated networks, with each copy stored in a distinct place in distributed computing.

As a consequence, Data Replication poses a minor danger to the Cloud Storage System for users, while providing effective Data Storage would be a major difficulty for providers.

Data replication enables users to view data in real-time from many sources including servers, websites, as well as other sources, overcoming the difficulty of ensuring consistent data availability.

The procedure of storing and maintaining multiple copies of critical data across several devices is referred to as data replication.

To ensure a flawless replication process, they need to invest in a variety of hardware & software components, including CPUs, storage drives, and other components, as well as a full technical setup.

Setting up a reaction pipeline is required to complete the arduous work of replication without any defects, errors, or other issues.

Establishing a response pipeline that works correctly might take days, weeks, or even months, based on the replication requirements as well as the task’s complexity.

Furthermore, large firms might find it difficult to maintain patience & maintain all stakeholders on the same page at this time.

A considerable volume of data moves from the data source to the target database during replication.

A significant amount of bandwidth was required to enable a smooth flow of information & prevent data loss.

Even among big enterprises, keeping bandwidth capable of sustaining and processing enormous amounts of complex data in doing replication may be a difficult issue.

It also necessitates investing in more “manpower” with a superior technological background.

All of these constraints make data replication difficult, especially for large enterprises.

As a consequence, the different existing data replication solutions were examined as well and the major challenges caused by data replication were highlighted.

The goal of this study work in the future is to lower the number of replications while maintaining data availability & dependability.

7. Conclusion

This paper offered a complete review of data replication in the cloud. The contribution of this paper is as follows;

This paper determined the reviews in 65 papers related to data replication in the cloud.

The analysis has reviewed the performance measures and its maximum achievements were contributed by different data replication schemes.

Various data handling schemes in the cloud exploited in every reviewed work were also analyzed and determined diagrammatically.

In addition, the data auditing schemes were analyzed in certain papers and the simulation tools were analysed in 65 papers.

In the end, this paper presented different research issues that were helpful for researchers in further work on data replication in the cloud.

In the future, the cost analysis of the replication can be considered.

In addition, real time test bed can be considered.

In order to increase provider profit, it may be possible to balance tenant volume and performance in subsequent work.

In this case, we want to demonstrate that the ‘pay as you go’ model yields the most return for the provider when it is used to serve an ideal number of tenants.

This serves as justification for the suggested strategy’s design.

To determine which replicas should be replicated or removed beforehand, the log of previous queries may also be considered for making these decisions.

To further minimize resource consumption, RSPC could also be assessed while employing data compression or de-duplication in the environment.

References

Abbes,

Louati and

Cérin, Dynamic replication factor model for Linux containers-based cloud systems, J Supercomput 76 (2020), 7219–7241. doi:10.1007/s11227-020-03158-5.

F.M.

Ali,

Latip,

M.A.

Alrshah,

Abdullah and

Ibrahim, Vigorous replication strategy with balanced quorum for minimizing the storage consumption and response time in cloud environments, IEEE Access 9 (2021), 121771–121785. doi:10.1109/ACCESS.2021.3108765.

Ali,

Bilal,

S.U.

Khan,

Veeravalli,

Li and

A.Y.

Zomaya, DROPS: Division and replication of data in cloud for optimal performance and security, IEEE Transactions on Cloud Computing 6(2) (2018), 303–315. doi:10.1109/TCC.2015.2400460.

S.A.

Ali and

Ramakrishnan, Secure provable data possession scheme with replication support in the cloud using Tweaks, Cluster Comput 22 (2019), 1113–1123. doi:10.1007/s10586-017-1075-1.

Awad,

Salem,

Abdelkader and

M.A.

Salam, A novel intelligent approach for dynamic data replication in cloud environment, IEEE Access 9 (2021), 40240–40254. doi:10.1109/ACCESS.2021.3064917.

Boru,

Kliazovich,

Granelli,

Bouvry and

A.Y.

Zomaya, Energy-efficient data replication in cloud computing datacenters, Cluster Comput 18 (2015), 385–402. doi:10.1007/s10586-014-0404-x.

Bowers,

Liao,

Steiert,

Lin,

Squicciarini and

Hurson, Detecting suspicious file migration or replication in the cloud, IEEE Transactions on Dependable and Secure Computing 18(1), 296–309.

Bui,

Hussain,

Huh and

Lee, Adaptive replication management in HDFS based on supervised learning, IEEE Transactions on Knowledge and Data Engineering 28(6) (2016), 1369–1382. doi:10.1109/TKDE.2016.2523510.

Casas,

Taheri,

Albert and Zomaya , A balanced scheduler with data reuse and replication for scientific workflows in cloud computing systems, Future Generation Computer Systems 74 (2017), 168–178. doi:10.1016/j.future.2015.12.005.

10.

Castro-Medina,

Rodriguez-Mazahua,

López-Chau,

M.A.

Abud-Figueroa and

Alor-Hernández, FRAGMENT: A web application for database fragmentation, allocation and replication over a cloud environment, IEEE Latin America Transactions 18(06) (2020), 1126–1134. doi:10.1109/TLA.2020.9099751.

11.

Chen,

Bahsoon and

A.R.H.

Tawil, Scalable service-oriented replication with flexible consistency guarantee in the cloud, Information Sciences 264 (2014), 349–370. doi:10.1016/j.ins.2013.11.024.

12.

Daniel and

N.A.

Vasanthi, LDAP: A lightweight deduplication and auditing protocol for secure data storage in cloud environment, Cluster Comput 22 (2019), 1247–1258. doi:10.1007/s10586-017-1382-6.

13.

E.B.

Edwin,

Umamaheswari and

M.R.

Thanka, An efficient and improved multi-objective optimized replication management with dynamic and cost aware strategies in cloud computing data center, Cluster Comput 22 (2019), 11119–11128. doi:10.1007/s10586-017-1313-6.

14.

El Ghoubach,

R.B.

Abbou and

Mrabti, A secure and efficient remote data auditing scheme for cloud storage, Journal of King Saud University – Computer and Information Sciences 33 (2019), 593–599.

15.

Fan,

Wang,

Wu,

Znati and

Du, Slow replica and shared protection: Energy-efficient and reliable task assignment in cloud data centers, IEEE Transactions on Reliability 70(3) (2019), 931–943. doi:10.1109/TR.2019.2923770.

16.

Fu,

He and

Huan, Developing the cloud-integrated data replication framework in decentralized online social networks, Journal of Computer and System Sciences 82 (2016), 113–129. doi:10.1016/j.jcss.2015.06.010.

17.

Gan,

Wang and

Fang, Efficient and secure auditing scheme for outsourced big data with dynamicity in cloud, Sci. China Inf. Sci. 61 (2018), 122104. doi:10.1007/s11432-017-9410-9.

18.

Gao,

Yu and

Wu, Achieving low-entropy secure cloud data auditing with file and authenticator deduplication, Information Sciences 546 (2020), 177–191. doi:10.1016/j.ins.2020.08.021.

19.

N.K.

Gill and

Singh, A dynamic, cost-aware, optimized data replication strategy for heterogeneous cloud data centers, Future Generation Computer Systems 65 (2016), 10–32. doi:10.1016/j.future.2016.05.016.

20.

Gopinath and

Sherly, A dynamic replica factor calculator for weighted dynamic replication management in cloud storage systems, Procedia Computer Science 132 (2018), 1771–1780. doi:10.1016/j.procs.2018.05.152.

21.

J.R.

Gudeme,

S.K.

Pasupuleti and

Kandukuri, Certificateless multi-replica public integrity auditing scheme for dynamic shared data in cloud storage, Computers & Security 103 (2021), 102176. doi:10.1016/j.cose.2020.102176.

22.

Guo,

Qin,

Lu,

Gao,

Jin and

Wen, Improved proofs of retrievability and replication for data availability in cloud storage, The Computer Journal 63(1) (2020), 1216–1230. doi:10.1093/comjnl/bxz151.

23.

Hamrouni,

Mokadem and

Khelifa, Review on data replication strategies in single vs. interconnected cloud systems: Focus on data correlation-aware strategies, Concurrency and Computation: Practice and Experience (2023), e7758.

24.

Han,

Li and

Chen, A lightweight and privacy-preserving public cloud auditing scheme without bilinear pairings in smart cities, Computer Standards & Interfaces 62 (2019), 84–97. doi:10.1016/j.csi.2018.08.004.

25.

Hasan and

M.S.

Goraya, Flexible fault tolerance in cloud through replicated cooperative resource group, Computer Communications 145 (2019), 176–192. doi:10.1016/j.comcom.2019.06.005.

26.

He,

Qian and

Shang, A novel predicted replication strategy in cloud storage, J Supercomput 76 (2020), 4838–4856. doi:10.1007/s11227-018-2647-4.

27.

G.B.

Heimovski,

R.C.

Turchetti and

E.P.

Duarte Jr, FT-Aurora: A highly available IaaS cloud manager based on replication, Computer Networks 168 (2020), 107041. doi:10.1016/j.comnet.2019.107041.

28.

Jahandideh and

Mirzaei, Allocating duplicate copies for IoT data in cloud computing based on harmony search algorithm, IETE Journal of Research (2021).

29.

Khelifa

et al., Data correlation and fuzzy inference system-based data replication in federated cloud systems, Simulation Modelling Practice and Theory 115 (2022), 102428. doi:10.1016/j.simpat.2021.102428.

30.

Khelifa,

Hamrouni and

F.B.

Charrada, SLA-aware task scheduling and data replication for enhancing provider profit in clouds, Procedia Computer Science 176 (2020), 3143–3152. doi:10.1016/j.procs.2020.09.174.

31.

Khelifa,

Hamrouni,

Mokadem and

F.B.

Charrada, Combining task scheduling and data replication for SLA compliance and enhancement of provider profit in clouds, Appl Intell 51 (2021), 7494–7516. doi:10.1007/s10489-021-02267-9.

32.

K.A.

Kumar,

Quamar,

Deshpande and

Khuller, SWORD: Workload-aware data placement and replica selection for cloud data management systems, The VLDB Journal 23 (2014), 845–870. doi:10.1007/s00778-014-0362-1.

33.

Levitin,

Xing and

Xiang, Minimization of expected user losses considering co-resident attacks in cloud system with task replication and cancellation, Reliability Engineering & System Safety 214 (2021), 107705. doi:10.1016/j.ress.2021.107705.

34.

Liang,

Xing and

Levitin, Optimizing dynamic survivability and security of replicated data in cloud systems under co-residence attacks, Reliability Engineering & System Safety 192 (2019), 106265. doi:10.1016/j.ress.2018.09.014.

35.

Lin,

Chen and

J.M.

Chang, QoS-aware data replication for data-intensive applications in cloud computing systems, IEEE Transactions on Cloud Computing 1(1) (2013), 101–115. doi:10.1109/TCC.2013.1.

36.

Liu,

Shen,

Chi,

H.S.

Narman,

Yang,

Cheng and

Chung, A low-cost multi-failure resilient replication scheme for high-data availability in cloud storage, IEEE/ACM Transactions on Networking 29(4) (2021), 1436–1451. doi:10.1109/TNET.2020.3027814.

37.

Liu,

Shen and

H.S.

Narman, Popularity-aware multi-failure resilient and cost-effective replication for high data durability in cloud storage, IEEE Transactions on Parallel and Distributed Systems 30(10) (2019), 2355–2369. doi:10.1109/TPDS.2018.2873384.

38.

Liu,

Peng,

Wang,

Liu,

Huang and

Pan, Scalable and adaptive data replica placement for geo-distributed cloud storages, IEEE Transactions on Parallel and Distributed Systems 31(7) (2020), 1575–1587. doi:10.1109/TPDS.2020.2968321.

39.

S.Q.

Long,

Y.L.

Zhao and

Chen, MORM: A multi-objective optimized replication management strategy for cloud storage cluster, Journal of Systems Architecture 60 (2014), 234–244. doi:10.1016/j.sysarc.2013.11.012.

40.

G.J.

Maclaurin and

Leyk, Temporal replication of the national land cover database using active machine learning, GIScience & Remote Sensing (2016).

41.

Maheshwari,

Kumar,

Shadi and

Tiwari, Consensus-based data replication protocol for distributed cloud, J Supercomput 77 (2021), 8653–8673. doi:10.1007/s11227-021-03619-5.

42.

S.U.R.

Malik,

S.U.

Khan,

S.J.

Ewen,

Tziritas,

Kolodziej,

A.Y.

Zomaya and

Li, Performance analysis of data intensive cloud systems based on data management and replication: A survey, Distrib Parallel Databases 34 (2016), 179–215. doi:10.1007/s10619-015-7173-2.

43.

Mansouri, Adaptive data replication strategy in cloud computing for performance improvement, Front, Comput. Sci. 10 (2016), 925–935. doi:10.1007/s11704-016-5182-6.

44.

Mansouri and

G.H.

Dastghaibyfard, A dynamic replica management strategy in data grid, J. Netw. Comput. Appl. 35 (2012), 1297303.

45.

Mansouri and

M.M.

Javidi, A hybrid data replication strategy with fuzzy-based deletion for heterogeneous cloud data centers, J Supercomput 74 (2018), 5349–5372. doi:10.1007/s11227-018-2427-1.

46.

Mansouri and

M.M.

Javidi, A new prefetching-aware data replication to decrease access latency in cloud environment, Journal of Systems and Software 144 (2018), 197–215. doi:10.1016/j.jss.2018.05.027.

47.

Mansouri and

M.M.

Javidi, A review of data replication based on meta-heuristics approach in cloud computing and data grid, Soft Comput 24 (2020), 14503–14530. doi:10.1007/s00500-020-04802-1.

48.

Mansouri,

M.M.

Javidi and

B.M.H.

Zade, Using data mining techniques to improve replica management in cloud environment, Soft Comput 24 (2020), 7335–7360. doi:10.1007/s00500-019-04357-w.

49.

Mansouri,

M.M.

Javidi and

B.M.H.

Zade, Hierarchical data replication strategy to improve performance in cloud computing, Front. Comput. Sci. 15 (2021), 152501. doi:10.1007/s11704-019-9099-8.

50.

Mansouri,

M.M.

Javidi and

B.M.H.

Zade, A CSO-based approach for secure data replication in cloud computing environment, J Supercomput 77 (2021), 5882–5933. doi:10.1007/s11227-020-03497-3.

51.

Mansouri,

M.K.

Rafsanjani and

M.M.

Javidi, DPRS: A dynamic popularity aware replication strategy with parallel download scheme in cloud environments, Simulation Modelling Practice and Theory 77 (2017), 177–196. doi:10.1016/j.simpat.2017.06.001.

52.

Mansouri,

B.M.H.

Zade and

M.M.

Javidi, A multi-objective optimized replication using fuzzy based self-defense algorithm for cloud computing, Journal of Network and Computer Applications 171 (2020), 102811. doi:10.1016/j.jnca.2020.102811.

53.

Mansouri,

A.N.

Toosi and

Buyya, Cost optimization for dynamic replication and migration of data in cloud data centers, IEEE Transactions on Cloud Computing 7(3) (2017), 705–718. doi:10.1109/TCC.2017.2659728.

54.

B.A.

Milani and

N.J.

Navimipour, A comprehensive review of the data replication techniques in the cloud environments: Major trends and future directions, Journal of Network and Computer Applications 64 (2016), 229–238. doi:10.1016/j.jnca.2016.02.005.

55.

B.A.

Milani and

N.J.

Navimipour, A systematic literature review of the data replication techniques in the cloud environments, Big Data Research 10 (2017), 1–7. doi:10.1016/j.bdr.2017.06.003.

56.

M.F.

Mohamed,

El-Gayyar and

Nassar, Data reduction in a cloud-based AMI framework with service-replication, Computers & Electrical Engineering 69 (2018), 212–223. doi:10.1016/j.compeleceng.2018.02.042.

57.

B.M.H.

Zade,

Mansouri and

M.M.

Javidi, A new hyper-heuristic based on ant lion optimizer and tabu search algorithm for replica management in cloud environment, Artificial Intelligence Review 56(9) (2023), 9837–9947. doi:10.1007/s10462-022-10309-y.

58.

Mohammadi and

N.J.

Navimipour, A fuzzy logic-based method for replica placement in the peer to peer cloud using an optimization algorithm, Wireless Personal Communications 122(2) (2022), 981–1005. doi:10.1007/s11277-021-08936-9.

59.

Mokadem and

Hameurlain, A data replication strategy with tenant performance and provider economic profit guarantees in cloud data centers, Journal of Systems and Software 159 (2020), 110447. doi:10.1016/j.jss.2019.110447.

60.

Mokadem,

Martinez-Gil,

Hameurlain and

Kueng, A review on data replication strategies in cloud systems, International Journal of Grid and Utility Computing. 13(4) (2022), 347–362. doi:10.1504/IJGUC.2022.125135.

61.

Mseddi,

M.A.

Salahuddin,

M.F.

Zhani,

Elbiaze and

R.H.

Glitho, Efficient replica migration scheme for distributed cloud storage systems, IEEE Transactions on Cloud Computing 9(1) (2021), 155–167. doi:10.1109/TCC.2018.2858792.

62.

Nahir,

Orda and

Raz, Replication-based load balancing, IEEE Transactions on Parallel and Distributed Systems 27(2) (2016), 494–507. doi:10.1109/TPDS.2015.2400456.

63.

Nannai John and

T.T.

Mirnalinee, A novel dynamic data replication strategy to improve access efficiency of cloud storage, Inf Syst E-Bus Manage 18 (2020), 405–426. doi:10.1007/s10257-019-00422-x.

64.

Qiu,

J.F.

Pérez,

Birke,

Chen and

P.G.

Harrison, Cutting latency tail: Analyzing and validating replication without canceling, IEEE Transactions on Parallel and Distributed Systems 28(11) (2017), 3128–3141. doi:10.1109/TPDS.2017.2706268.

65.

Ramanan and

Vivekanandan, Efficient data integrity and data replication in cloud using stochastic diffusion method, Cluster Comput 22 (2019), 14999–15006. doi:10.1007/s10586-018-2480-9.

66.

A.E.A.

Raouf,

Abo-Alian and

N.L.

Badr, A predictive multi-tenant database migration and replication in the cloud environment, IEEE Access 9 (2021), 152015–152031. doi:10.1109/ACCESS.2021.3126582.

67.

Salem,

Abdul Salam,

Abdelkader and

Awad Mohamed, An artificial bee colony algorithm for data replication optimization in cloud environments, IEEE Access 8 (2020), 51841–51852.

68.

Séguéla,

Mokadem and

J.M.

Pierson, Dynamic energy and expenditure aware data replication strategy, in: 2022 IEEE 15th International Conference on Cloud Computing (CLOUD), 2022 Jul 10, IEEE, pp. 97–102.

69.

A.R.

Setlur,

S.J.

Nirmala and

Khoriya, An efficient fault tolerant workflow scheduling approach using replication heuristics and checkpointing in the cloud, Journal of Parallel and Distributed Computing 136 (2019), 14–28. doi:10.1016/j.jpdc.2019.09.004.

70.

Shakarami,

Ghobaei-Arani,

Shahidinejad,

Masdari and

Shakarami, Data replication schemes in cloud computing: A survey, Cluster Comput 24 (2021), 2545–2579. doi:10.1007/s10586-021-03283-7.

71.

Slimani,

Hamrouni and

F.B.

Charrada, Service-oriented replication strategies for improving quality-of-service in cloud computing: A survey, Cluster Comput 24 (2021), 361–392. doi:10.1007/s10586-020-03108-z.

72.

Sookhtsaraei,

Artin,

Ghorbani,

Faraahi and

Adineh, A locality-based replication manager for data cloud, Frontiers Inf Technol Electronic Eng 17 (2016), 1275–1286. doi:10.1631/FITEE.1500391.

73.

D.W.

Sun,

G.R.

Chang,

Gao,

L.Z.

Jin and

X.W.

Wang, Modeling a dynamic data replication strategy to increase system availability in cloud computing environments, J. Comput. Sci. Technol 27 (2012), 256–272. doi:10.1007/s11390-012-1221-4.

74.

Sun,

Yao and

Li, RRSD: A file replication method for ensuring data reliability and reducing storage consumption in a dynamic cloud-P2P environment, Future Generation Computer Systems 100 (2019), 844–858. doi:10.1016/j.future.2019.05.054.

75.

Tos,

Mokadem,

Hameurlain and

Ayav, Achieving query performance in the cloud via a cost-effective data replication strategy, Soft Comput 25 (2021), 5437–5454. doi:10.1007/s00500-020-05544-w.

76.

Tos,

Mokadem,

Hameurlain,

Ayav and

Bora, Ensuring performance and provider profit through data replication in cloud systems, Cluster Comput 21 (2018), 1479–1492. doi:10.1007/s10586-017-1507-y.

77.

Tziritas,

Koziri,

Bachtsevani,

Loukopoulos,

Stamoulis,

S.U.

Khan and

C.Z.

Xu, Data replication and virtual machine migrations to mitigate network overhead in edge computing systems, IEEE Transactions on Sustainable Computing 2(4) (2017), 320–332. doi:10.1109/TSUSC.2017.2715662.

78.

Ulabedin and

Nazir, Replication and data management-based workflow scheduling algorithm for multi-cloud data centre platform, J Supercomput 77 (2021), 10743–10772. doi:10.1007/s11227-020-03541-2.

79.

Vobugari,

D.V.L.N.

Somayajulu and

B.M.

Subaraya, Dynamic replication algorithm for data replication to improve system availability: A performance engineering approach, IETE Journal of Research (2015).

80.

Wiese,

Waage and

Bollwein, A replication scheme for multiple fragmentations with overlapping fragments, The Computer Journal 60(3) (2017), 308–328.

81.

Yi,

Wei and

Song, Efficient integrity verification of replicated data in cloud computing system, Computers & Security 65 (2017), 202–212. doi:10.1016/j.cose.2016.11.003.

82.

Zeng and

Veeravalli, Optimal metadata replications and request balancing strategy on cloud data centers, Journal of Parallel and Distributed Computing 74 (2014), 2934–2940. doi:10.1016/j.jpdc.2014.06.010.

83.

Zhang,

Nie,

Jiang,

Wang,

Xu,

Zhao and

Yao, BDS+: An inter-datacenter data replication system with dynamic bandwidth separation, IEEE/ACM Transactions on Networking 29(2) (2021), 918–934. doi:10.1109/TNET.2021.3054924.

Survey on data replication in cloud systems

Abstract

Keywords

2. Literature review

2.1. Related work

2.1.1. Replication management approach

2.1.2. Replication selection approach

2.1.3. Replication placement approach

2.1.4. Replication creation approach

2.1.5. Replication retirement approach

2.1.6. Replication decision approach

2.1.7. Replication decision approach

2.1.8. Static and dynamic replication

2.1.10. Economic profit

3. Review on data replication models in the cloud, performances and the maximum attainments

3.1. Review on adopted data replication models in the cloud

4. Evaluation of adopted replication handling schemes in the cloud and chronological review

4.1. Review on replication handling in cloud storage systems

5.1. Data auditing schemes

6. Research gaps and challenges

7. Conclusion

References