A combination replication strategy for data-intensive services in distributed geographic information system

Abstract

Distributed geographic information system is a typical service-intensive application which has to store massive data in lots of storages and server for a large number of users. Due to the slow network input/output, replicas can be used to improve system performance. Since all data have the relationships of long-term stability as well as short-term bursty, a comprehensive method which considers not only static replicas and its placement strategy but also dynamic replicas and its selection strategy can achieve more significant improvements and are proposed in this article. First, a general dynamic correlation representation model of all data is designed and implemented. And then replica selection strategies for static copies and dynamic copies are proposed based on their relationships. Also, a comprehensive data placement strategy for all data and all replicas is defined to realize load balance. Finally, the performance of the proposed method has been proved through a series of comparative experiments, and the simulation results demonstrate that the proposed algorithm can meet the requirements of distributed geographic information system in all aspects, including different dataset, different access modes, and different data scales and can achieve an average local storage hit ratio of about 11.55%–45.22% higher than the other methods.

Keywords

Data replication replica selection data placement data mining distributed geographic information system

Introduction

As a typical data-intensive services’ system,¹ geographic information system (GIS) has been widely used in all kinds of fields, especially in land and resources’ investigation, weather forecast and disaster prediction, and urban and road traffic planning.² Due to the massive datasets which have to be stored in all storage nodes of GIS as well as a large number of users who will access all data stored in GIS by network, distributed GIS is an important solution which can be used to provide such services, and in which the data are split into amount of smaller pieces (or chunks) based on the pyramid model and each piece is stored as a single file in a single node or some nodes as replicas and those pieces are called tiles³ such as Google Earth. Replication strategies, as a kind of typical information service technologies which are widely used in lots of information systems,⁴ can also be used by distributed GIS to ensure data availability and data security and improve the performance of the quality of service (QoS) for distributed GIS.

On one hand, replication strategies can be used to ensure data security in most of the storage system which distributed all stored data and their replicas in different storage nodes to realize data loss prevention (DLP).⁵ Reliable checkpoint storage strategy (RCSS) proposed a replication method to realize reliability, availability, and QoS or cloud computational grids by replicating all checkpoints.⁶ RCSS considers the efficiencies not only for data storage but also for data recovery when certain computing nodes fail. Proactive replica checking for reliability (PRCR)⁷ and multi-objective optimized replication management (MORM)⁸ are the other two methods which are always used by cloud storage system. PRCR provided a kind of cost-effective data reliability management mechanism to ensure reliability of the massive cloud data with minimum replication, and MORM considered much more factors influencing replication strategy and optimized its strategy using an improved artificial immune algorithm. Meanwhile, bandwidth-availability-based (BAB) proposed a replication algorithm to realize bandwidth availability in peer-to-peer (P2P) system,⁹ and energy-effective adaptive replication strategy (E²ARS) is mainly to save energies for cloud storage by switching and using different storage nodes.

On the other hand, due to the slow disk input/output (I/O) speed as well as the slow network I/O speed or the variable network connectivity, replication strategy can also be used to speed up the response time for users and improve the QoS in information system. In this case, replication strategy will select part of data and then store them in different storage nodes which can offer high bandwidth service or simply keep them in high-speed cache to provide quick response to users. Cluster-based replication placement (CBRP)¹⁰ designed fixed data replica as well as temporary data replica to improve the performance of load balance. CBRP finds the critical value of triggering replication strategy by computing the historical access frequency of replicas and then it predicts the number of access requests for the next period of time to decide and change the number of predicts. Bandwidth hierarchy replication (BHR) and a modified version of BHR (MBHR)¹¹ proposed similar methods to minimize the access cost as well as utilize network and storage resources as expeditiously as possible. BHR and MBHR select the best node with the highest access frequency to store replica so as to reduce the access data from remote websites. Replication strategy based on correlated patterns (RSCPs) mined the data access patterns by computing their correlations in data grids and then decentralized replicas to improve response time, reduce the bandwidth consumption, and maintain reliability.¹²

Furthermore, Hamrouni et al.^13,14 give some detailed introductions and summary about data replication and replica selection (RS) strategies in data grids, and the results show that the replication strategy not only has to consider the selection of appropriate data as the replicas due to their high access frequencies but also has to find the best locations to store data due to the different network I/O speed;¹⁵ the first strategy is called as RS strategy and the second one is called as replica placement (RP) strategy. Thus, the key to replication strategy is to select appropriate data based on RS strategy and then store them in appropriate storage nodes or high-speed cache based on RP strategy to provide quick response to users’ access which will be discussed in related works.

This article is organized as follows. Section “Related works” introduces the related works about RS strategies and RP strategies. A brand-new combination replication strategy for data-intensive services based on both data’s characteristics and storage nodes’ capabilities is presented in section “Combination replication strategy.” The results of the experiments are presented and discussed in section “Simulations and experiments.” Finally, section “Conclusion and future works” shows a conclusion of this article and discusses our next works in the future.

Related works

Selecting appropriate data as replicas for RS is to find the most suitable subset of data which will be used by applications repeatedly and simultaneously. As an algorithm similar to RS, prefetching strategy is also designed to find some appropriate data in advance based on applications’ requirements which is always used in distributed GIS, such as Google,¹⁶ networked geographic information systems (NGISs),¹⁷ and NASA¹⁸ and have deeply improved system performance.

Hilbert curve–based prefetching (HCBP) is a typical prefetching algorithm which uses several prefetch strategies, such as the Hilbert curve strategy, to predict application’s next requirement assuming that users’ access behavior to geospatial data has spatial locality.¹⁹ Retrospective adaptive prefetch (RAP) is another prefetching algorithm which predicts application’s next possible requirements using a heuristic method and then selects its corresponding data as a caching replica.²⁰ Also, RAP is based on some assumptions, such as stable applications’ behaviors, which show that the application’s behavior will not change in a short period, and otherwise RAP will start a brand-new process. Another kind of famous method is Markov chain model which constructed a Markov chain model by setting recently accessed data as an initial status and the access probability of its all neighboring data as the state transition matrix. Examples of such algorithms include basic Markov model,²¹ prefetching based on previous k movements (PKM),²² and Zipf–Markov model.²³ Basic Markov model used several kinds of Markov chain model to make prediction, such as denoting browser center or browsed data as the transfer status and even using high-order Markov model. PKM uses Markov chain to predict the next objects to prefetch by monitoring the previous k movements, and a graph named as “neighbor selection Markov chain” is used to help its predictions. Zipf–Markov model also uses Markov chain model to prefetch an optimum data by mining the characteristics of the user’s navigation path based on Zipf distribution.

However, all those methods mentioned above only considered the neighboring data based on their current status and cannot meet the requirements of the whole system (for load balance or quick response of whole system) and so can mainly be used for predicting single-user behavior. Thus, some researches are proposed to find the optimum choice among all data so as to meet the global requirements. Examples of such strategies include distributed high-speed caching based on spatial and temporal locality (DCST),²⁴ ordinary least squares (OLS),²⁵ artificial neural network (ANN),²⁶ and Zipf model.²⁷ Zipf model quantitatively analyzed the relationship between the total hit ratio and the size of cache buffer and then obtained an approximate formula to express this relationship based on their distribution parameter of basic Zipf’s law or Zipf-like law. Furthermore, DCST also uses this approximate formula to estimate the number of hot data which will be used repeatedly and then tally the popularities of all data so as to judge which data should be selected as replica based on the election scheme of the US Congress. OLS and ANN are another kind of global prediction methods which make predictions based on geographic features. OLS uses a linear combination of the geographic features and an OLS regression estimator to predict user’s next behaviors, and ANN uses ANN to train and obtain a prediction model and then uses the model to prefetch data. Meanwhile, spatial–temporal attribute prediction (STAP) is designed to prefetch spatiotemporal data for smart city system by analyzing the characteristics of historical access requests.²⁸

As another important aspect of replication strategy, placing replica (RP) into appropriate location is also been discussed. Dynamic computation correlation data placement (DCCP)²⁹ and access pattern–based distributed storage algorithm (APSA)³⁰ are two typical methods. DCCP distributed all stored data which have high dynamic computation correlations to the same data center considering not only the I/O load but also the capacity of data centers. One of the main aims of DCCP is to reduce data transfer rate among remote storage nodes and then to improve the performance of computations. Thus, the hot data will be stored in all storage nodes based on their relationships. APSA also distributed all stored data which have high access correlations to different data center so as to realize concurrent access, and similar to DCCP, all hot data stored are distributed to different nodes so as to balance the access service based on APSA.

Although the above-mentioned algorithms have achieved lots of good results in their own fields, there exist some disadvantages which still need to be further considered. On one hand, with the change in hot topics, some hot data which have high popularities will probably not be requested frequently in the future and so simply selecting those hot data as replicas cannot always meet the requirements of dynamic system. Meanwhile, fixed data distribution strategies can also not meet the dynamic requirements when the hotspot changed, and the large number of data migration among all storage nodes will deeply affect the performance of GIS to adjust data placement strategies synchronously. On the other hand, there exist some intrinsic laws^31,32 which can be used to find the dynamic relationships among each other, and then their dynamic relationships can be used to predict which data will be requested repeatedly and which data will be requested simultaneously on the next step.³³

Based on the above analyses, our model which is called as a dynamic RS and RP strategy (DSP) will propose a new comprehensive method considering both RS and RP strategy where data’s dynamic popularities developed from a prefetching method³¹ are computed and their dynamic relationships are mined based on their different historical access records. Then some data will be selected as static replicas and placed into storage nodes based on their long-term stable relationships, and some data will be selected as dynamic replicas and placed into high-speed cache buffer based on their short-term bursty relationships.

Combination replication strategy

Preliminary concepts

Although replication strategy can be used to solve the problems of data availability, data security, and system performance improvement, algorithm strategy will be completely different according to different aims. Thus, we will mainly focus our research on system performance improvement in this article and try to find all data’s intrinsic correlations so as to place data and their replicas into storage nodes or high-speed cache to improve users’ access performance.

Deeming that data correlations’ mining is the key step in our proposed strategy which will decide whether appropriate data can be found and appropriate locations can be used, data correlations should be well described and should accurately track the changes in data relationships. For this reason, we give the preliminary concepts used in data correlations’ mining and also give the strategy of processing the massive dataset in distributed GIS.

Hotspot data

The fact is that there exist a large amount of data stored in distributed GIS, and considering and using all those data to compute and therefore to find an appropriate subset as replicas is a typically computing intensive application and is hard to get an optimal forecast result. But another fact is that the access to data in distributed GIS is extremely imbalanced and just a small part of them will be requested repeatedly^31,32 and so the hotspot dataset will be a small subset of all data.

According to the Hotmap model,³¹ the requests to geospatial data satisfy a Zipf’s law which can be stated as follows

K_{i} = \frac{θ}{i^{α}}

(1)

where $K_{i}$ is the total number of requests to the ith geospatial data, $θ$ is a constant, and $α$ is a distribution parameter of Zipf’s law. Obviously, a different geospatial dataset will be requested by all users based on different distribution parameter $α$ , and also distribution parameter $α$ will change with the transferring of hotspot regions.

First, $D = {d_{1}, d_{2}, \dots, d_{N}}$ be the set of all data which will be requested by all users in distributed GIS, where N is the total number of data and each element in D is labeled with a natural number [1, N]. Assume that all users access geospatial data synchronously and independently based on their own interest and then the historical data sequence can be chronologically recorded by the system when each data were requested. Let $Q = (q_{1}, q_{2}, \dots, q_{L})$ denote the whole data access sequence, where $q_{k} \in [1, N]$ denotes the label of kth data which are requested by a certain user (i.e. $q_{k} = i$ indicates that the kth requested data are $d_{i}$ $(i = 1, \dots, N)$ ) and $L$ is the total number of requests to all data.

Then for $\forall d_{i} \in D$ $(i \in [1, N])$ , we can compute its $K_{i}$ and obtain current distribution parameter $α$ based on equation (1) and Zipf’s distribution fitting.³² Furthermore, Zipf model²⁷ gives an approximate formula to estimate the number of total hotspot data based on its distribution parameter which is shown in equation (2)

N_{h} = N \times h^{1 / 1 - α}

(2)

where $N_{h}$ is the total number of hotspot data (a subset of all data) which should be considered as replicas and placed into appropriate location, and h is the steady-state replicas’ hit ratio.

Finally, computing the popularities of all data based on $K_{i}$ $(i = 1, \dots, N)$ , the top $N_{h}$ data can be selected as hotspot dataset. Without loss of generality denote hotspot dataset as $H = {h_{1}, h_{2}, \dots, h_{N_{h}}}$ and its corresponding data access sequence as $F = (f_{1}, f_{2}, \dots, f_{L_{h}})$ , where $h_{i} \in D$ , $f_{i} \in Q$ , and $L_{h}$ is the total number of requests to all hotspot data. It is obvious that we have $L_{h} \leq L$ and $N_{h} \leq N$ .

Fission

Based on the analyses above, selecting appropriate data (RS) for replication strategy is to find the most suitable subset of data which will be used by applications repeatedly and simultaneously. Thus, if a certain data $h_{i}$ $(i = 1, \dots, N_{h})$ are requested repeatedly in a short period, then data $h_{i}$ produced one fission. Obviously, larger fissions indicate higher probability of selecting as a replica.

Conflict

To realize the load balance, the data which are requested simultaneously have to be stored separately in different locations based on the requirements of RP strategy. Similarly, if a certain data $h_{i}$ $(i = 1, \dots, N_{h})$ and a certain data $h_{j}$ $(j = 1, \dots, N_{h}, j \neq i)$ are requested simultaneously during a short period, then data $h_{i}$ and data $h_{j}$ produce one conflict. Obviously, larger conflicts indicate higher probability of placing separately.

Static replicas

The static replicas are the replicas which will be stably stored in storage nodes. The static replicas will be adjusted only after hotspot data’s long-term relationships are changed.

Dynamic replicas

The dynamic replicas are the replicas which will be temporarily stored in high-speed cache buffer. The dynamic replicas will be replaced continuously due to the changes in their short-term bursty relationships.

Based on the above preliminary concepts, the aim of DSP can be transferred to get a stable RP strategy for data itself and its static replicas based on all hotspot data’s long-term relationships, and at the same time get a dynamic RS strategy for dynamic replicas based on all hotspot data’s short-term bursty relationships.

Dynamic correlations’ expression mode

Without loss of generality, let $S = {s_{1}, s_{2}, \dots, s_{M}}$ be a set of all servers in distributed GIS and $M$ be the total number of servers. Assume that all servers will provide access services continuously for all users and thus all servers can process $M$ users’ requests simultaneously during a short period of time. In general, denote $a_{k} = (f_{k 1}, f_{k 2}, \dots, f_{kM})$ as the vector of all data’s labels which were requested chronologically by all users at a certain moment $t_{k}$ , where $f_{kl} \in [1, N_{h}]$ $(l = 1, \dots, M)$ , then for $\forall h_{i}, h_{j} \in H$ $(i, j \in [1, N_{h}], i \neq j)$ , their correlations $R_{k} (i, j)$ can be stated as follows based on their accessed data vector during a short period of time³¹

R_{k} (i, j) = \sum_{l = 1}^{M} \sum_{x = l + 1}^{M} (| v_{k x} (j) w_{x} - v_{k l} (i) w_{l} | \times v_{k l} (i) v_{k x} (j)) = v_{k} (i, j) \cdot W

(3)

where $v_{k} (i) = (v_{k 1} (i), v_{k 2} (i), \dots, v_{kM} (i))$ is their relation vector and $W = (w_{1}, w_{2}, \dots, w_{M})$ is their weight vector. Based on the definition in the previous section, we have the following: (1) if $f_{kl} = i$ , then $v_{kl} (i) = 1$ , otherwise $v_{kl} (i) = 0$ ; (2) $v_{k} (i, j) = (v_{k 1} (i, j), v_{k 2} (i, j), \dots, v_{kM} (i, j))$ can be computed using $v_{k} (i)$ as follows: set $v_{kx} (i, j) = 0$ , if $f_{kl} = i$ and $f_{k (l + x)} = j$ , then $v_{kx} (i, j) = v_{kx} (i, j) + 1$ ; and (3) $w_{x} > w_{x + 1}$ indicates that a shorter time distance between data $h_{i}$ and $h_{j}$ when requested simultaneously denotes a higher correlation. It is clear that $v_{k l} (i) v_{k x} (j)$ can be used to judge whether the data $h_{i}$ and $h_{j}$ are requested simultaneously and $| v_{kx} (j) w_{x} - v_{kl} (i) w_{l} |$ can be used to compute the distance between data $h_{i}$ and $h_{j}$ when they are requested simultaneously.

Thus, considering the whole investigation time $T = (t_{1}, t_{2}, \dots, t_{λ})$ , we can get all the access segment vector $A = (a_{1}, a_{2}, \dots, a_{λ})$ , and then we have

\begin{array}{l} R (A, η, i, j) = \sum_{k = 1}^{λ} R_{k} (i, j) ξ_{k} (η) \\ = \sum_{k = 1}^{λ} ξ_{k} (η) v_{k} (i, j) \cdot W = V (A, η, i, j) \cdot W 1 \leq i, j \leq N_{h} \end{array}

(4)

where $λ$ is the total number of investigation time segment, $t_{λ}$ is the newest time segment, $a_{λ}$ is the newest access segment vector, $V (A, η, i, j) = \sum_{k = 1}^{λ} v_{k} (i, j) ξ_{k} (η)$ represents the total relation factor based on the whole investigation time $T$ , and $ξ_{k} (η)$ is a time attenuation factor indicating that older access records have lower impact of correlations³⁷ which can be stated as follows

ξ_{k} (η) = e^{- η (λ - k + 1)} 1 \leq k \leq λ

(5)

where $e$ is a mathematical constant and $η$ is an attenuation coefficient. Obviously, our model is an enhanced model³¹ which considers both data relationships and the timeliness of their relationships which indicates that a smaller $η$ indicates that more access records are selected to compute their correlations and so a long-term relationship is obtained. Hence, we can use a different attenuation coefficient to get a different correlation.

Static replicas’ selection and data placement strategies

Based on the definition of fission, equations (4) and (5) can be used to compute the total fissions for all hotspot data H. Thus, we can compute and obtain all data fissions based on a long time $T$ as well as a small attenuation coefficient $η$ , so as to get a long-term stability data fission strategy to avoid transferring data among storage nodes frequently which can be stated as follows

\begin{matrix} R_{f} \end{matrix} (A, η, H) = (R (A, η, 1, 1), R (A, η, 2, 2), \dots, R (A, η, N_{h}, N_{h}))

(6)

from which we can find the largest $N_{s}$ elements so as to select their corresponding data as the static replicas, where $N_{s}$ can be decided based on the capacity of all storages and the size of each data.

For simplicity, assume that each static replica will have only one copy. For example, if data $h_{i}$ are one of static replicas, then there will be exactly only one copy of data $h_{i}$ which will be placed in one of the storage node (more copies can also be processed). And then the number of total hotspot data which should be stored separately into different storage nodes is $N_{a} = N_{h} + N_{s}$ , where $N_{h}$ is the number of total hotspot data and $N_{s}$ is the number of total static replicas.

Let $H_{a} = {h_{1}, h_{2}, \dots, h_{N_{h}}, h_{1}^{s}, h_{2}^{s}, \dots, h_{N_{s}}^{s}}$ be the set of all data which have to be placed. Then give a new label to each static replica and relabel all data in $H_{a}$ as $H_{a} = {h_{1}, h_{2}, \dots, h_{N_{a}}}$ . Based on the definition of conflict and equation (6), we have

R_{c} (A_{a}, η, H_{a}) = {(R (A_{a}, η, i, j))}_{N_{a} \times N_{a}} 1 \leq i, j \leq N_{a}

(7)

where $R_{c} (A_{a}, η, H_{a})$ is a conflict matrix and $A_{a}$ is the relabeled whole access segment vector which can be easily transferred from $A$ as follows: $\forall a_{k} = (f_{k 1}, f_{k 2}, \dots, f_{kM})$ , if $h_{f_{kl}}$ is a static replica and $f_{k (l + x)} = f_{kl}$ and $f_{k (l + i)} \neq f_{kl}$ $(i < x)$ , then set $f_{k (l + x)}$ as its static replica’s label.

Data placement strategy is to place all data into all storage nodes, so as to realize load balance. Thus, we have to place the data which will be accessed simultaneously in different nodes to improve concurrency and also place the data which will be accessed repeatedly in different nodes to balance the system load. Let $P = {p_{1}, p_{2}, \dots, p_{M}}$ denote the last RP strategies, where $p_{i} = (p_{mn} (i {))}_{N_{a} \times N_{a}}$ gives the details about which data will be placed in storage node $s_{i}$ , and we have

p_{mn} (i) = {\begin{matrix} 1 & h_{m}, h_{n} \in s_{i}, m < n \\ 0 & others \end{matrix} m, n \in [1, N_{a}], i \in [1, M]

(8)

which indicates whether data $h_{i}$ and $h_{j}$ are all placed in storage node $s_{i}$ . Obviously, $P$ is an upper triangular matrix.

Based on the RP strategy $P$ and equation (7), the total average conflicts of each storage node can be computed as follows

\begin{array}{l} R_{c} (A_{a}, η, H_{a}, p_{i}) = \frac{\sum_{m = 1}^{N_{a}} \sum_{n = 1}^{N_{a}} p_{m n} (i) R (A_{a}, η, m, n)}{{‖ p_{i} ‖}_{0}} \\ = \frac{p_{i} \cdot R_{c} (A_{a}, η, H_{a})}{{‖ p_{i} ‖}_{0}} i \in [1, M] \end{array}

(9)

and then the total average conflicts of all storage nodes can be stated as follows

\begin{array}{l} R_{c} (A_{a}, η, H_{a}, P) = \sum_{i = 1}^{M} R_{c} (A_{a}, η, H_{a}, p_{i}) \\ = \sum_{i = 1}^{M} \frac{p_{i} \cdot R_{c} (A_{a}, η, H_{a})}{{‖ p_{i} ‖}_{0}} i \in [1, M] \end{array}

(10)

Since the RP strategy is to find a strategy to realize load balance, the problem can be transferred to find a placement strategy minimizing their difference of average conflicts as much as possible, which can be represented as follows

\begin{matrix} P^{*} = \underset{P}{\arg min} (\sum_{i = 1}^{M} | R_{c} (A_{a}, η, H_{a}, p_{i}) - R_{c} (A_{a}, η, H_{a}, P) |) \\ s . t . ‖ p_{i} ‖_{0} \leq C_{i} and \sum_{i = 1}^{M} {‖ p_{i} ‖}_{0} = N_{a} \end{matrix}

(11)

where $C_{i}$ indicates the capacity of storage node $s_{i}$ . Obviously, this is a multi-objective optimization problem which can be solved by concentrating the small elements along the diagonal using Reverse Cuthill–McKee (RCM) algorithm,^34,35 and then a heuristic algorithm can be used to determine a reasonable solution using a locally approaching search method.³⁰ The pseudo code of multi-objective optimization algorithm procedure based on locally approaching search method can be given as in Table 1.

Table 1.

The pseudo code of multi-objective optimization algorithm procedure.

Algorithm: Find a reasonable solution using locally approaching search method
Input: Conflict matrix $R_{c}$ and capacity $C_{i}$ of storage node $s_{i}$
Output: The replica placement strategy P
1. Find the largest element $r_{max}$ of conflict matrix;
2. Converts matrix $R'_{c} = r_{max} - R_{c}$
3. Compute the degree of each data node in $R'_{c}$
4. Find out the minimal degree $d_{min}$ and the maximum degree $d_{max}$ and then select the strategy dataset V, where the degree of data v in V satisfies $d_{v} \leq (d_{min} + d_{max}) / 2$ ;
5. for each data v in V,
6. Get a new matrix started with vertex v;
7. Compute the matrix bandwidth of this new matrix.
8. if this bandwidth is smaller than the previous matrix,
9. Note down the new permutation started with v.
10. end if
11. Repeat (5) to traverse the strategy set V and export the corresponding matrix ${R ″}_{c}$ .
12. end for
13. for scan each row of $Row (i) = {R ″}_{c}$
14. if Bandwidth( $(Row (i)) \leq C_{i}$ )//Check the bandwidth of this row
15. Set its corresponding data as the elements of $P_{i}$
16. else
17. Get the upper triangular matrix $U_{i}$ from ${R ″}_{c}$
18. while the number of elements in $P_{i}$ is less than $C_{i}$
19. Finding the largest element from $U_{i}$
20. Set its corresponding data as one element of $P_{i}$
21. Delete corresponding element from $U_{i}$ and update $U_{i}$ .
22. end while
23. end if
24. Delete $C_{i}$ rows from ${R ″}_{c}$ and update ${R ″}_{c}$
25. end for

For the static replicas with largest fissions, the data and its corresponding static replica (which is relabeled as new label) will also have the largest conflicts. Therefore, based on the optimal data placement strategies, the data and its corresponding static replica will be placed separately into different storage nodes and the data which have higher conflicts will also be placed separately into different storage nodes.

Dynamic replicas’ selection strategy

Due to the slow disk I/O speed (considering to get data from local storage node) or network I/O speed (considering to get data from remote storage nodes), the dynamic replicas’ strategy is to find the data which will be accessed immediately or in the near future when a certain data are being requested, and then select them as copies and store them in high-speed cache in advance so as to improve access performance.

Similar to static replicas’ selection strategy, we can compute and obtain all data’s conflicts based on a short time $T$ as well as a large attenuation coefficient $η$ , so as to get a short-term bursty correlation to find the most appropriate data which will be accessed immediately in the next step when a certain data $h_{i}$ are being requested, which can be stated as follows

R_{d} (A, η, i) = (R (A, η, i, 1), R (A, η, i, 2), \dots, R (A, η, i, N_{h}))

(12)

from which we can find the largest element so as to select their corresponding data as the dynamic replicas when data $h_{i}$ are being requested.

Simulations and experiments

Simulations’ design

To illustrate the performance of the proposed algorithm in this article, a typical earth observation system similar to Google Earth and NASA World Wind is designed which is called as GlobeSIGht.³² The application will take SRTM90 data (the 90-m resolution global terrain data files from the Shuttle Radar Topography Mission) as terrain Flythrough, where the size of each SRTM90 data is about 44 kB and the cache size of each node is about 200 MB–2 GB.²⁴ Thus, each node can cache about 4000–40,000 data which can be used to guide the selection of cache buffer size in simulations.

There are two parts of access sequence, one is for training and therefore to compute and find the dynamic correlations’ expression mode for RS strategy and RP strategy, and another is used to prove the validity of the model. Also, to use enough information to accurately mine data relationship, training data accounted for 20% of the whole access sequence and testing data accounted for another 80%.

For simplicity, each server has one local storage node, and also all of them are connected by 100 Mbps switching Ethernet network. Obviously, distribution parameter $α$ can be computed based on the training data, and steady-state replicas’ hit ratio h parameter can also be counted dynamically during the simulation and then hotspot dataset can be easily selected based on equation (2). Moreover, some researches show that the users’ access to distributed GIS satisfies a kind of Poisson’s distribution³⁶ and so we also assume that the request arrival rate obeys a Poisson’s distribution in simulation. Since half part of Gaussian distribution is a typical attenuation distribution, the simulation uses half part of Gaussian distribution to distinguish the matching weights of different access record based on their accessing time distance. All parameters will use the above fixed values in the next contrasting experiments. Furthermore, the total replication ratio is decided by cache buffer size, and based on the above-mentioned analyses, the total replication can be selected as 5%–50%.

All contrasting experiments are measured as the average local storage hit ratio (LSHR) which represents the average response speed of all servers, where local storage data include dynamic replicas and static replicas which can be obtained with a smaller access cost than that of obtaining data from remote storage nodes by network.

Meanwhile, the contrasting experiments will be made among pure active dynamic copy (ADC) strategies (such as DCST²⁴ and STAP²⁸ algorithms), pure passive dynamic copy (PDC) strategy (such as least recently used (LRU) which is widely used by distributed GIS¹⁶), pure static copies and data placement (SCP) strategies (such as DCCP algorithm²⁹), and DSP strategy which is proposed in this article. ADC and PDC algorithms will store all data in all storage nodes randomly and then ADC will obtain and store dynamic replicas in high-speed cache according to the behaviors of the users, but PDC will only save the currently accessed data and never prefetch data from storage nodes proactively. SCP algorithm will store related data in different storage nodes in advance and then obtain data from local storage node or remote storage nodes based on their locations. SCP algorithm will never use dynamic replicas. DSP will also store related data in different storage nodes and then synchronously predict and obtain dynamic replicas from storage node in advance when a certain data are being requested. Due to the limited cache buffer size, all ADC, PDC, and DSP methods will use LRU strategy to delete data from high-speed cache buffer so as to save cache space.

Experimental results and discussion

To illustrate the performance of the proposed algorithm considering both static replication strategy and dynamic replication strategy as well as RP strategy, some contrasting experiments are conducted among SCP, PDC, DSP, and ADC, and all of them are scheduled based on their own strategies.

First, Figure 1 gives the contrasting performance of SCP, PDC, DSP, and ADC, measured as the average LSHR, where the number of servers is 10, the dynamic replication ratio of hotspot data is 12%, and the static replication ratio of hotspot data is 10%. In this experiment, all users’ requests to data are distributed to all servers evenly and each server will independently check whether the being requested data were stored in local storage node or high-speed cache and then the average LSHR for all servers can be calculated during each minute.

Figure 1.

Comparative LSHRs obtained from different replication algorithms.

Also, their average LSHR and average response time can be calculated and are shown in Table 2, where the size of each SRTM90 data is about 44 kB.

Table 2.

Comparative performance of different replication algorithms.

Algorithms	SCP	PDC	DSP	ADC
LSHRs (%)	44.27	34.01	49.39	42.15
Response time (ms)	2.187	2.317	1.815	2.033

SCP: strategy based on correlated pattern; PDC: passive dynamic copy; DSP: dynamic replica selection and replica placement strategy; ADC: active dynamic copy; LSHR: local storage hit ratio.

As shown in Figure 1 and Table 2, the performance of all algorithms can remain stable throughout the experiment, DSP can achieve a better performance, and the performance of LSHR is higher about 11.55%–45.22% than the others due to the contributions of RS strategy and RP strategy. Also, the average response time can be reduced to about 11.98%–27.63%.

In this case, PDC obtained the lowest performance due to its single data placement strategy and could not meet the requirement of short-term bursty characteristics of distributed GIS. Also, SCP and ADC can achieve almost the same performance in this situation and their performance difference is less than 5.03%, because the static replicas and data placement strategy which are used by SCP can find most of the data which will be requested repeatedly when replication ratio is very small, and at the same time, the dynamic replication strategy which is used by ADC is difficult to decide which data should be stored in high-speed cache due to the small cache buffer size.

With the increase in static replication ratio or dynamic replication ratio, the performance difference between SCP and ADC will also be changed and expanded due to their different strategies which will be described in the next set of experiments.

Obviously, different sizes of static replication ratio or dynamic replication ratio will provide different sizes of spaces to store data replicas. Thus, two contrasting experiments are constructed based on different ratios of static replication and dynamic replication to demonstrate the performance advantages of the proposed algorithm, and the contrasting experiment results are separately shown in Figures 2 and 3.

Figure 2.

Comparative LSHRs obtained from different dynamic replication ratios.

Figure 3.

Comparative LSHRs obtained from different static replication ratios.

Figure 2 gives the contrasting performance of different algorithms, where the dynamic replication ratio varies significantly from 6% to 42% and the static replication ratio remains as 5%. In this experiment, the performance of PDC, DSP, and ADC can all be improved based on larger dynamic replication ratio, and the performance of SCP has not changed even when the dynamic replication ratio is increased.

It is clear that PDC, DSP, and ADC all use high-speed cache to store dynamic replicas, and larger dynamic replication ratios indicate that more hotspot data can be stored in high-speed cache and therefore a high probability of hit local storage node can be obtained. But SCP never uses high-speed cache to store dynamic replicas and therefore the performance of SCP will remain stable throughout the experiment.

Further analysis indicates that the active dynamic replication strategies (ADC and DSP) can achieve a better performance than passive dynamic replication strategy (PDC) based on the same conditions. That is because active dynamic replication strategies can predict the future behaviors of users and then prepare data for users in advance. Furthermore, DSP can closely track the behaviors of users and so it can achieve the best performance even when the dynamic replication ratio is very small. But the performance advantage of DSP will be gradually reduced when the dynamic replication ratio is large enough, because the accuracy of predictions is not so important when most of hotspot data can be stored in high-speed cache. At the same time, ADC will only save part of hotspot data as dynamic replications to save cache space and so further expanding the size of dynamic replication ratios is no more useful for ADC when the dynamic replication ratio is already big enough.

Similarly, Figure 3 gives the contrasting performance of different algorithms, where the static replication ratio varies significantly from 1% to 10% and the dynamic replication ratio remains as 12%. As shown in Figure 3, the performance of SCP and DSP can be improved based on larger static replication ratio, and the performance of PDC and ADC has not changed even when the static replication ratio is increased.

Apparently the increase in static replication ratio can lead to a noticeable performance improvement for SCP when the static replication ratio is small, but the performance changes in DSP are not pronounced. As mentioned above, the access to data in distributed GIS is extremely imbalanced and just a small part of them will be requested repeatedly^31,32 and so a small number of replicas can bring lots of contributions for average LSHRs because those replicas will be requested repeatedly. But for DSP, those replicas have already been copied based on their dynamic replication strategy and so those small replicas cannot affect its performance deeply. Further observation can be used to find that the performance of DSP can also be improved noticeably with the expansion of static replication ratio continuously.

Meanwhile, to check the performance of all the above-mentioned algorithms under different number of server centers, a contrasting performance experiment is conducted, and the experimental result is shown in Figure 4, where the dynamic replication ratio is 12%, the static replication ratio is 5%, and the server centers’ number varies significantly from 2 to 20.

Figure 4.

Comparative LSHRs obtained from different server centers’ number.

Similar to the above analysis, DSP can remain its best performance, but the performance of algorithms will decrease with the increase in server centers’ number. It is clear that more server centers indicate that less data will be stored in local storage node. Thus, more data have to be obtained from remote storage nodes by the network and then the performance of average LSHR will inevitably be reduced. Observing from Figure 4, the results indicate that some active dynamic replication strategies, such as DSP and ADC, have well-average LSHR performance and also have lower degradation rate than that of other methods when the numbers of server centers are more than 10 and their performance differences are expanded with the increase in server centers’ number. The experimental results and simulations on distributed GIS prove the effectiveness of the proposed algorithm and show that it can be used in large-scale distributed GIS and will gain more performance advantages.

Since distributed GIS is a typical data-intensive services system, lots kinds of dataset which have different scales will distributed stored in their storage nodes. The proposed method is designed for all kinds of geospatial dataset and can meet as well as automatically adapt their different requirements. Thus, two contrasting experiments are processed based on different datasets and different scales of dataset to prove the adaptability and the performance advantages of the proposed algorithm, and also the contrasting experiment results are separately shown in Figure 5 and Table 2.

Figure 5.

Comparative LSHRs obtained from different datasets.

Figure 5 gives the contrasting performance of SCP, PDC, DSP, and ADC, all measured as the average LSHR, where the number of servers is 10, the dynamic replication ratio of hotspot data is 12%, and the static replication ratio of hotspot data is 7%. And the dataset of the right column is NLT Landsat-7³² and the dataset of the left column is SRTM90, which are the two typical geospatial terrain datasets in distributed GIS. The access mode to NLT Landsat-7 is more concentrated than the access mode to SRTM90, and the degree of access concentration is higher about 29.5%. Also, users’ access mode to same dataset will be changed when the hotspot regions changed and so the access mode to SRTM90 which is used by Figure 5 is also more concentrated than Figure 1, and the degree of access concentration is higher about 25%.

Thus, the above-mentioned contrasting experiment and the first contrasting experiment can be combined to demonstrate the adaptability of the proposed algorithm in different datasets or in different access modes. Similar to the first experiment, DSP can achieve the best performance in different datasets and different access modes. It is clear that a more concentrated access mode indicates less hotspot data and then smaller replication ratio will lead to obtain higher enough performance. As shown in Figures 5 and 1, due to the different access modes, the performance of DSP can also be further improved by about 33.4% and 21.2% with the increase in access concentration, respectively.

Meanwhile, Table 3 gives a contrasting experiment result to check the performance of all the above-mentioned algorithms under different scales of dataset, where the dynamic replication ratio of hotspot data is 12% and the static replication ratio of hotspot data is 5%.

Table 3.

Comparative performance of different replication algorithms.

Data size (N × 50,000)	Algorithms’ LSHR (%)
Data size (N × 50,000)	SCP	PDC	DSP	ADC
N = 1	35.30	34.01	45.69	42.09
N = 2	35.75	34.49	46.10	42.41
N = 3	35.95	34.68	46.30	42.56
N = 4	36.08	34.81	46.41	42.66

LSHR: local storage hit ratio; SCP: strategy based on correlated pattern; PDC: passive dynamic copy; DSP: dynamic replica selection and replica placement strategy; ADC: active dynamic copy.

As shown in Table 3, the performance of all algorithms can remain stable throughout the experiment and the performance change rate is less than 1% when the scale of dataset is doubled. Also, DSP can remain its best performance under various scales and this result shows that the proposed method is reliable and can work in large scales of distributed GIS.

Furthermore, it appears that dynamic replication selection and replacement will lead to additional disk I/O, and Table 4 gives a contrasting experiment result to check the total disk access ratio of all the above-mentioned algorithms under different dynamic replication ratios, where the static replication ratio of hotspot data is 4% and the number of servers is 10.

Table 4.

Comparative disk access ratio obtained from different replication algorithms.

Dynamic replication ratio (%)	Algorithms’ total disk access ratio (%)
Dynamic replication ratio (%)	SCP	PDC	DSP	ADC
6	62.16	71.87	57.60	56.08
18	62.16	52.10	42.65	43.27
24	62.16	45.42	35.77	38.73
36	62.16	34.67	25.55	28.27
42	62.16	30.16	22.16	28.27
48	62.16	26.03	19.11	28.26

SCP: strategy based on correlated pattern; PDC: passive dynamic copy; DSP: dynamic replica selection and replica placement strategy; ADC: active dynamic copy.

As shown in Table 4, ADC can achieve better disk access ratio than do the other algorithms, but the performance advantage is very limited. Due to the lower cache buffer size (i.e. lower dynamic replication ratio), DSP must continually update the cached data so as to get higher LSHR to speed up the service. Comparing with Figure 2, SCP and PDC also need to read data from remote disk due to their lower LSHR performance. But DSP can achieve the best disk access ratio performance when the cache buffer size is large enough, because DSP can obtain a very high LSHR and need no more data prefetching dynamically due to their precise dynamic replicas selection algorithm.

Conclusion and future works

Instead of real-time reading data from remote storage on-the-fly, static replicas based on static replication strategy can store hotspot data in more storage nodes as well as dynamic replicas based on dynamic replication strategy can store the data which will be used immediately in high-speed cache, so as to improve the performance of average LSHR and thus to speed up the response time for users’ access.

This article proposed an enhanced combined algorithm for replication strategy based on users’ access behaviors which implied all data correlation as well as their popularities.³¹ The proposed method which is called as DSP combination considered both static replication strategy and data placement strategy (RP) to store static replicas and all data in all storage nodes so as to improve the performance of load balance and also considered both dynamic replication strategy and replication selection strategy (RS) based on the prediction of users’ future behaviors to prepare data in advance.

Also, the performance of the proposed method has been proved through a series of comparative experiments, and the simulation results demonstrate that the proposed algorithm can meet the requirements of distributed GIS in all aspects, including different datasets, different access modes, and different data scales, and can achieve an average LSHR of about 11.55%–45.22% higher than the other methods.

Since the dynamic replication selection and replacement will lead to additional data access from disk, three strategies can be used to reduce the negative effect of service response time in real system: (1) replacing dynamic replicas can be scheduled in parallel with data transferring to avoid access disk latency; (2) using passive dynamic replication strategy to reduce the number of data prefetching when cache buffer size is small; and (3) dynamically adjusting the ratio between dynamic replication and static replication based on the requirements of disk access performance and users’ response time, since high static replication ratio can get a few of additional data accessing from disk as well as low LSHR performance.

Moreover, as mentioned above, the proposed method needs a large enough access sequence to train and obtain the correlation express model and then uses this model to make prediction. It is impossible for distributed GIS when the system is just been started. Thus, a kind of composite method which only uses the current status to make a prediction at the beginning and then uses access sequence to mine the relationship among data after obtaining enough information would be more effective which will be considered in our future works.

Footnotes

Acknowledgements

The authors thank Dr Hang Zhang for providing some experimental data.

Academic Editor: Pierre Leone

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the National Natural Science Foundation of China (grant nos 41671382, 41271398, and 61572372), LIESMARS Special Research Funding, and the Fund of SAST (project no. SAST201425). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the article.

References

Yang

Huang

. Using spatial principles to optimize distributed computing for enabling physical science discoveries. Proc Natl Acad Sci USA 2011; 106(14): 5498–5503.

Goodchild

. Geographical information science. Int J Geogr Inf Syst 1992; 6(1): 31–45.

Jarukasemratana

Murata

. Web caching replacement algorithm based on web usage data. New Generat Comput 2013; 31(4): 311–329.

Benoit

Rehn-Sonigo

Robert

. Replica placement and access policies in tree networks. IEEE T Parall Distr 2008; 12(19): 1614–1627.

Gong

. Research of replica placement strategy in cloud storage system. Nanjing, China: Nanjing University of Posts and Telecommunications, 2015.

Malik S

Nazir B

Qureshi

. A reliable checkpoint storage strategy for grid. Computing 2013; 95: 611–632.

Yang

Yuan

. Ensuring cloud data reliability with minimum replication by proactive replica checking. IEEE T Comput 2016; 65(5): 1494–1506.

Long

S-Q

Zhao

Y-L

Chen

. MORM: a multi-objective optimized replication management strategy for cloud storage cluster. J Syst Architect 2014; 60: 234–244.

Liu

Feng

Huang

. Bandwidth-availability-based replication strategy for P2P VoD systems. Comput J 2013; 57(8): 1211–1229.

10.

Xiong

Xiaojie

Ruchuan

. A cluster based data replication strategy in cloud storage systems. Comput Eng Sci 2014; 36(12): 2296–2304.

11.

Warhade

Dahiwale

Raghuwanshi

. A dynamic data replication in grid system. In: Proceeding of international conference on information security & privacy (ICISP2015), Nagpur, India, 11–12 December 2015, vol. 78, pp.537–544. Elsevier B.V.

12.

Hamrouni

Slimani

Charrada

. A data mining correlated patterns-based periodic decentralized replication strategy for data grids. J Syst Software 2015; 110: 10–27.

13.

Hamrouni

Slimani

Ben Charrada

. A survey of dynamic replication and replica selection strategies based on data mining techniques in data grids. Eng Appl Artif Intel 2016; 48: 140–158.

14.

Hamrouni

Hamdeni

Ben Charrada

. Impact of the distribution quality of file replicas on replication strategies. J Netw Comput Appl 2015; 56: 60–76.

15.

Kingsy Grace

Manimegalai

. Dynamic replica placement and selection strategies in data grids—a comprehensive survey. J Parallel Distr Com 2014; 74: 2099–2108.

16.

Boulos

. Web GIS in practice III: creating a simple interactive map of England’s strategic Health Authorities using Google Maps API, Google Earth KML, and MSN Virtual Earth Map Control. Int J Health Geogr 2005; 4(12): 2269–2272.

17.

Shi

Kindratenko

Yang

. Modern accelerator technologies for geographic information science. New York: Springer, 2013.

18.

Bell

Kuehnel

Maxwell

. NASA World Wind: opensource GIS for mission operations. In: Proceedings of the aerospace conference, Big Sky, MT, 3–10 March 2007, pp.1–9. New York: IEEE.

19.

Park

D-J

Kim

H-J

. Prefetch policies for large objects in a web-enabled GIS application. Data Knowl Eng 2001; 37: 65–84.

20.

Yeşilmurat

İşler

. Retrospective adaptive prefetching for interactive web GIS applications. Geoinformatica 2012; 16: 435–466.

21.

Zhong

Wang

. Markov model in prefetching spatial data. Bulletin of Surveying and Mapping. Bull Surv Mapp 2010; 7: 1–4.

22.

Lee

Kim

. Adaptation of a neighbor selection Markov chain for prefetching tiled web GIS data. In: Proceedings of the second international conference on advances in information systems, Izmir, 23–25 October 2002, vol. 2457, pp.213–222. New York: ACM.

23.

Guo

. A prefetching model based on access popularity for geospatial data in a cluster-based caching system. Int J Geogr Inf Sci 2012; 26(10): 1–14.

24.

Wang

Shi

. A replacement strategy for a distributed caching system based on the spatiotemporal access pattern of geospatial data. Int Arch Photogram Rem Sens Spatial Inform Sci 2014; XL-4: 133–137.

25.

García

de Castro

Verdú

. An OLS regression model for context-aware tile prefetching in a web map cache. Int J Geogr Inf Sci 2012; 27(3): 614–632.

26.

García

Verdú

Regueras

. A neural network based intelligent system for tile prefetching in web map services. Expert Syst Appl 2013; 40: 4096–4105.

27.

Shi

Wei

. Quantitative analysis of Zipf’s law on web cache. In: Pan

Chen

Guo

. (eds) Parallel and distributed processing and applications ISPA 2005 (Lecture notes in computer science). Berlin, Heidelberg: Springer, 2005, pp.845–852.

28.

Xiong

Wang

. Prefetching scheme for massive spatiotemporal data in a smart city. Int J Distrib Sens N 2016; 2: 1–11.

29.

Wang

Yao

. DCCP: an effective data placement strategy for data-intensive computations in distributed cloud computing systems. J Supercomput 2016; 72: 2537–2564.

30.

Pan

. Distributed storage algorithm for geospatial image data based on data access patterns. PLoS ONE 2015; 10(7): e0133029.

31.

Pan

Chong

Zhang

. A global user-driven model for tile prefetching in web geographical information systems. PLoS ONE 2017; 12(1): e0170195.

32.

Wang

Pan

Peng

. Zipf-like distribution and its application analysis for image data tile request in digital earth. Geomat Inform Sci Wuhan Univ 2010; 35(3): 356–359.

33.

Xia

Yang

Liu

. Adopting cloud computing to optimize spatial web portals for better performance to support Digital Earth and other global geospatial initiatives. Int J Digit Earth 2015; 8(6): 451–475.

34.

Chang

. A dynamic data replication strategy using access-weights in data grids. J Supercomput 2008; 45(3): 277–295.

35.

Gibbs

Poole

Stockmeyer

. An algorithm for reducing the bandwidth and profile of a sparse matrix. SIAM J Numer Anal 1976; 13(2): 236–250.

36.

Hao

. Cost based load balancing for network geographic information service. Acta Geod Cartograph Sinica 2009; 38(3): 242–249.

37.

Kanungo

Mount

Netanyahu

. A local search approximation algorithm for k-means clustering. Comp Geom 2004; 28: 89–112.