Abstract
This paper proposes an improved adaptive density-based spatial clustering of applications with noise (DBSCAN) algorithm based on genetic algorithm and MapReduce parallel computing programming framework to improve the poor clustering effect and low efficiency of the DBSCAN algorithm, which due to experiential solving parameters. The size of Intensive Interval Threshold minPts and Scan Radius Eps would be rational planned by genetic algorithm iterative optimization, and it is secondary statute processing with the similarity and variability of the dataset and the efficient computing power of Hadoop Cluster. The data could be reasonable serialization, and the efficient adaptive parallel clustering could be achieved ultimately. Through the experimental results, it is shown that the proposed algorithm in this paper has higher clustering accuracy and execution efficiency than that of the comparison baselines. The trend will continue to grow with the increased volume of dataset. The improved algorithm provides a more accurate implementation method for the threshold of DBSCAN algorithm, and realizes the specific calculation process, which provides practice support for the realization of DBSCAN.
Keywords
Introduction
The density-based spatial clustering of applications with noise (DBSCAN) algorithm has always been the following two problems: for one thing, the threshold (
The emergence of cloud computing and Hadoop platform makes it possible for the computation to be more efficient and time saving. He et al. presented a scalable DBSCAN algorithm using MapReduce to remove three major drawbacks in the existing parallel DBSCAN algorithms.
7
Kim et al. proposed a new density-based clustering algorithm which is robust to find clusters with varying densities and suitable for parallelizing the algorithm with MapReduce.
8
Yu et al. proposed an efficient distributed density-based clustering Cludoop algorithm for big data using Hadoop.
9
However, the above researches do not take into account the adjustment problem of the two parameters (
Improvement of DBSCAN clustering algorithm
DBSCAN clustering algorithm
DBSCAN algorithm is able to find clusters of arbitrary shapes and filter the noise in the dataset. The related concepts are as follows 10 :
Core Point, Boundary Point, Noise Point: For the data object
Direct Density-reachable: Given a sample set

Density-reachable concept diagram: (a) Direct Density-reachable and (b) Density-reachable.
Density-reachable: Within a given sample set
Density connection: If there is an object

Density connection concept diagram.
Clusters: Starting from any of the Core Points, a cluster is formed from all object points whose density is reachable.
As mentioned above, the traditional algorithm is based entirely on the experience to set the values of the parameters
Design and improvement of GA-DBSCAN algorithm
Improvement GA method
In this paper, GA is used to improve the DBSCAN algorithm for determining the value of threshold
As for the group setting problem, the group size influences the final result of genetic optimization and the implementation efficiency of the GA. This paper sets the distribution range of the space occupied by the optimal solution in the whole problem space and then sets the initial population in this range. The method is as follows: establishing the dataset point distance matrix
For the scan radius solution, we can compute the frequency of the distance overlap distance in the matrix
From this we can get the zero matrix of the Core Point to be selected. The initial group of the GA is the set of Core Points to be selected.
The fitness function of GA is not subject to continuous and differentiable, which is a non-negative result that can be calculated for input. Its design directly affects the performance of GA. The number of groups is set to
Thus, the fitness of each selected Core Points in the initial population can be obtained by
The end condition of the algorithm is that when the number of algorithm iterations exceeds the maximum number of iterations
GA-DBSCAN algorithm
After several generations of evolution, if the optimal Core Point is still not found in the search process, it shows that the fathers selected in the previous generations cannot find the optimal point. The traditional GA, based on this method, conducts the continuous regeneration of the group, which leads to the joining of the individuals who are similar to their fathers. Therefore, the proportion of non-optimal individuals in the group is increased, resulting in genetic drift, and ultimately converge to the local optimal solution. Thus, the improvement of the traditional GA not only needs to take the group fitness, but also the dynamic adjustment category into consideration when selecting the fathers. The specific steps to improve the algorithm are as follows:
Define the adjustment coefficient Complete the following steps for the initial group:
Estimate individual fitness of the initial group and perform the following operations
Back up the initial group so that it can be restored when it encounters the local minimum. The probability that the individual
Use the probability 3. Initialize the regenerative calculator 4. Remove the access to the Core Points in the current group and take the maximum fitness value for
if
It means that the search process is still moving in the direction of optimization, replacing the individual in the reserved group with the individual in the current group,
if
It means that the convergence process is slow and may fall into the local optimum. At this time, the current population is replaced with the reserve group so that the population returns to the previous state of
Go to (3).
5. Determine the number of individuals in the group and treat as the approximate value of the threshold
We can determine the Core Point and the corresponding threshold
GA-DBSCAN parallelization based on MapReduce
The GA-DBSCAN algorithm is not efficient when encountering a large amount of data. Therefore, we introduce a new GA-DBSCAN algorithm based on the MapReduce programming framework. Under the premise of ensuring the data partitioning is reasonable, the meshing and data binning technology of FPRBP algorithm 11 is used to exact division for original dataset to ensure that there is no overlapping area. On this basis, we can use Map and Reduce function to complete the parallelization of the algorithm, thus solving the problem of low efficiency of algorithm implementation. 12 Song et al. proposed a task distribution algorithm to optimize the energy consumption of MapReduce system. By adjusting the size of Map task and Reduce task dynamically, the size of the task processing data ensures task parallelism to reduce the overall energy consumption of MapReduce system. 13 Wang et al. proposed a MapReduce data equalization method based on the incremental partition strategy. On one hand, this method produces more Reducer number of fine-grained partition by JobTracker according to the global partition segment information screening part of the unallocated in the Map side. On the other hand, the corresponding fine-grained partition is assigned to each Reducer, which solves the problem of equalization after the data partitioning. 14 Xun et al. introduced the data placement strategy in MapReduce cluster environment and summarized and analyzed the research and development of the data placement strategy optimization method in the current MapReduce cluster environment. 15
Map process
Each data node (including Namenode and Datanode) performs the GA-DBSCAN clustering process on the assigned tasks. The intermediate clustering results are output in the form of key-value pairs <

Map function flow.
Its formal description is as follows:
The initialized value of the variable Select a data object If Traverses all the data objects that are directly accessible from If queue If the data object If all the data objects have a category tag, output the intermediate clustering result <
Reduce process
Reduce processor receives the intermediate data <

Reduce function flow.
Its formal description is as follows:
Select the data object Do the following steps according to the value of If the data is marked as the noise and is not a boundary object, then it is marked as noise data; if neither the noise nor the data of the boundary object, then find all the data objects with the same cluster number; All the data marked as boundary objects then go to the boundary data processing function for further processing; If there is an unprocessed data object, go to step (1); According to the merging result of the boundary processing function, the cluster of the Clustering category in the whole dataset is re-numbered so as to obtain the unified cluster number, end the Reduce process, and output the final result <
Analysis of the experimental results
The simulation experiment is carried out on the cluster of a host master with three slave node-computers. In addition, the master is the DELL PRECISION T1700 workstation. The configuration is as follows: quad-core, eight threads of Core i7-4770CPU, 16 GB memory, and SATA7200 r/min, hard drive with 1 TB capacity; the slave is equipped with the same Dell workstations of 8 GB memory, and 500 GB hard drive. The operating system is Ubuntu12.04, and all nodes were connected by the Gigabit Ethernet switch. The version of the Hadoop is 2.6.0.
The data are derived from the 30 GB Baidu Encyclopedia data obtained by the Nutch crawler, and the similarity of the data is processed. We select seven datasets with different number of data points, which were 3400, 9880, 14,886, 20,350, 24,650, 51,270, and 101,640, to carry out the clustering simulation experiments. The proposed algorithm is experimented for 10 times. And the experimental results are recorded. When the parent is selected, the proposed algorithm uses the fitness function to find the better value of adjustment parameter

The relationship between the parameter and the correct rate: (a) Relationship between
We choose the original DBSCAN algorithm, GA-DBSCAN algorithm, GRIDBSCAN method,
16
and DBSCAN-MR-N method
17
as comparison baselines of the proposed GA-DBSCAN algorithm based on the MapReduce programming framework (abbreviated as GA-DBSCANMR). Those clustering methods are experimented in the selected seven datasets. The values of
Parameter adaptation value.
Accuracy of the algorithm execution (%).
DBSCAN: density-based spatial clustering of applications with noise.
The running time of each method on each data point set in the above experiments is recorded and their average running time is compared when the parameter

Contrast map with Running time/s.

GA-DBSCANMR algorithm clustering results.
Conclusion
In this paper, a new DBSCAN algorithm is proposed, and the algorithm is improved using the genetic algorithm and the MapReduce programming framework. The original algorithm and the improved algorithm are compared experimentally. The following conclusions are drawn from the research process and the experimental results: (1) Based on the combination of GA, the approximation value of
Through the improvement of the classical DBSCAN algorithm, the efficiency of the algorithm is greatly improved, and the accuracy of the algorithm is improved as well. However, it is still time consuming to improve the algorithm for high-dimensional data. Future work will focus on the high-dimensional datasets, so that the algorithm is able to achieve a wide range of applications.
Footnotes
Acknowledgment
The authors would like to thank the editor and all the referees for their useful comments and contributions for the improvement of this paper.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
Acknowledgements
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Natural Science Foundation of Jilin Province (20150101054JC), Postdoctoral Research Fund of Jilin Province (40301919) and the Key Program for Science and Technology Development of Jilin Province (20150204036GX).
