Parallel genetic algorithms on the graphics processing units using island model and simulated annealing

Abstract

To solve a non-deterministic polynomial-hard problem, we can adopt an approximate algorithm for finding the near-optimal solution to reduce the execution time. Although this approach can come up with solutions much faster than brute-force methods, the downside of it is that only approximate solutions are found in most situations. The genetic algorithm is a global search heuristic and optimization method. Initially, genetic algorithms have many shortcomings, such as premature convergence and the tendency to converge toward local optimal solutions; hence, many parallel genetic algorithms are proposed to solve these problems. Currently, there exist many literatures on parallel genetic algorithms. Also, a variety of parallel genetic algorithms have been derived. This study mainly uses the advantages of graphics processing units, which has a large number of cores, and identifies optimized algorithms suitable for computation in single instruction, multiple data architecture of graphics processing units. Furthermore, the parallel simulated annealing method and spheroidizing annealing are also used to enhance performance of the parallel genetic algorithm.

Keywords

Parallel computing genetic algorithm traveling salesman problem graphics processing units computing island model simulated annealing

Introduction

As the most popular type of evolutionary algorithms, the genetic algorithm (GA) is a search technique used in computational mathematics to obtain solutions to optimization problems. In this study, we not only accelerate evolution speed of the GA by the parallel computing capability of the graphics processing units (GPU), but also enable GAs converge to optimal solutions faster with a modified simulated annealing method.

Many models of parallel GAs have been proposed in the literature; the famous ones, for example, include the master–slave model as well as the island model.^1,2 The parallel GAs can be used to solve the issues of the past GAs that are susceptible to converge to local optimal solutions. Although computations of the parallel GAs have been realized in using Central Processing Unit (CPU), but due to the difference between architectures of the GPU and CPU, the parallel GAs cannot transport directly from CPU to execute on the GPU. Therefore, the first focus of this study is how to implement parallel GAs with the GPU hardware for operation and optimization of GAs computations.

Although the use of the above parallel GAs has been shown to be helpful in solving the issues of local optimal solutions, in this study we further enhance and optimize the parallel GAs using the island model by combining the mutation operations in GAs with simulated annealing method^3,4 and spheroidizing annealing.⁵ So the second focus of this article is how to combine parallel GAs with spheroidizing annealing such that GAs can more effectively jump out of local optimal solutions in the face of more complex computation problems.

In this study, we will use the traveling salesman problem (TSP)⁶ as the benchmark, although the TSP can be more efficiently solved by many other algorithms such as geometric algorithms. We aim to compare performances of GAs with different parameters settings (the population size and node number), study the differences between CPU (single-thread and multi-thread) and GPU computing, and investigate how much improvement can be gained in convergence after combining GAs with the simulation annealing method and spheroidizing annealing.

According to our experimental results, performance has been improved by two orders of magnitude: the acceleration rate of the proposed GPU computing over the single-thread CPU is about 300, and the acceleration rate of the proposed GPU computing over the multi-thread CPU is about 100. Also, compared to the parallel GA using the master–slave model on the multi-thread CPU, the proposed modified parallel GA using the island model combined with the simulation annealing method on the GPU has better convergence rates. Furthermore, spheroidizing annealing has been successfully combined into the simulation annealing method to have better convergence rates for the modified parallel GA on the GPU.

Background

GPU parallel computing

The GPU was created primarily for the purpose of handling image rendering and display. In recent years, the GPU is gradually used for scientific computation in addition to graphic display. Formerly called as general-purpose computing on graphics processing units (GPGPU),⁷ the emergence of GPU let GPU possess the ability of scientific computing. CUDA^7,8 and OpenCL⁹ are both GPU programming frameworks that allow the use of GPUs for general-purpose tasks. CUDA is proprietary to NVIDIA and hence limited. The GPU hardware architecture is different from that of traditional CPUs. The GPU owns a large number of cores for computing. However, the GPU adopts single instruction, multiple data (SIMD) architecture.¹⁰ Unlike the multitasking CPU, the GPU runs on one task at a time regardless of the large number of cores. So the GPU is most suitable for algorithms with simple and large amount of computations. The CPU can be used for computing while the GPU is operating. Our algorithm uses not only parallelized SIMD architecture, but also heterogeneous system architecture (HSA)¹⁰ composed with the CPU and GPU for further optimization.

TSP

We use the TSP^6,11 as a benchmark for quantitative comparison of GA. The TSP is probably one of the most intensively studied problems in optimization, in which for a given set of n cities, along with the cost of traveling between each pair of cities, one needs to find a path of minimum cost for the salesman to visit all the cities once and return to the origin city.

TSP is a combinatorial optimization problem. For example, we define a problem space in equation (1) and the objective function in equation (2)

X = {nod e_{0}, nod e_{1}, nod e_{2}, nod e_{3}}

(1)

Minimize (f (x) = dist (nod e_{0} x [0]) + \sum_{i = 0}^{2} dist (x [i], x [i + 1]) + dist (x [3], nod e_{0}))

(2)

In fact, we can see that one cannot easily work out exact solution for such a problem, and a heuristic method, such as the GA, is a more suitable approach.

Parallel GA

At present, several parallel GA computing models have been proposed, such as the island model and master–slave model.^{1–3,12–14} These models use the advantage of parallel computing by dividing population of a large size into smaller partitions and performing simultaneous random searches among them. These models greatly improve issues of premature convergence or convergence to local optimal solutions in past GA methods, where a single large population is used. To take advantage multitasking features of multi-core CPUs,¹⁵ these models implement parallel computing on CPU by allocating unused cores with the blocks not yet been operated. The GPU varies greatly from the CPU in parallel computing. Due to lack of multitasking capabilities of GPU computing cores, and the need to use Peripheral Component Interconnect Express (PCI-E) for communication between the GPU and main memory, much computing time will be spent on memory access for algorithms that frequently require GPUs to access memory. So we propose a method to bring all steps of GAs into the GPU for computation and combine it with parallel optimization models described in the following section.

Methodology

Parallel SIMD-based algorithm

First, the GA consists of five processes: (1) initialization of the population; (2) fitness evaluation; (3) selection and reproduction; (4) crossover; and (5) mutation of chromosomes. Except for fitness evaluation, random numbers are applied in all the other GA processes. A method to generate random numbers in parallel is to simply shift the seed of a random number. Computers usually use long sequences of numbers and a seed for table lookup to generate random numbers. We use NVIDIA’s Curand.kernel^14,15 and treat each thread ID as the seed of this thread, so that each thread can have its own independent random field. A global evaluation is performed after each thread performs evaluation of its own chromosomes and saves the evaluation results into global memory.

Parallel selection with SIMD

Selection of GA^16–18 can be carried out in many ways, of which the roulette-wheel selection method is most famous. However, the roulette-wheel selection method itself has a major flaw for parallel computing with SIMD architecture. First, we can see that in equation (3), which is the probability of being selected in the roulette-wheel selection method, each $x_{i}$ , that is, the fitness of individual $i$ in the summation $\sum_{k = 1}^{n} x_{k}$ , needs to be computed once

probability (x_{i}) = \frac{x_{i}}{\sum_{k = 1}^{n} x_{k}} \times 100 %

(3)

As discussed in the previous section, our aim is to bring all steps of GAs into the GPU for computation, thus, avoiding frequent data transfer between the GPU and main memory (or CPU).¹⁵ Hence, the summation operation in equation (3) will decrease computation speed and performance, since each thread of the GPU with SIMD architecture needs to be executed one time.

To improve performance, equation (4) is used to calculate the weight of chromosome i

ω_{i} = Random (threa d_{id}) mod α \times \frac{1}{x_{i}}

(4)

where $α$ represents precision. Also, integer operations are used for the mod operation. During the reproduction process, chromosomes with highest weight are selected for reproduction in each memory block. This reproduction process of chromosomes is continued until it reaches the preset population size. Figure 1 shows the process flow for parallel selection with SIMD architecture.

Figure 1.

Process flow for parallel selection with SIMD architecture.

Equation (4) is used to compute the weight of chromosome in each thread. Then the computed weight is compared with the stored weight in the memory location associated with the thread ID. If the computed weight is larger than the stored weight, the ID of this thread is entered into ID reproduction pool of the associated memory block. Upon completion of above step, the offset of associated memory location is incremented by one. The mod operation is performed when the sum of the offset and thread ID is more than the population size for finding the next memory location to be accessed.^15,16 The kernel repeats the operation of equation (2) until the value of offset reaches the population size, as shown in Figure 2. Such a procedure can ensure that each cell of the reproduction pool has completed its roulette-wheel selection process and avoid data race due to conflicts of parallel computing threads accessing shared memory. According to equation (4), even chromosomes with lower fitness value also have some chance to be entered the reproduction pool to ensure genetic diversity, and thus, premature convergence can be prevented.

Figure 2.

Memory access method for parallel selection.

Considering the memory access limit of the GPU, adjacent chromosomes are used for mating in the parallel crossover process of chromosomes. Figure 3 shows the flow of the parallel crossover^16–18 process of chromosomes in SIMD architecture.

Figure 3.

Algorithm of parallel crossover processes of chromosomes using SIMD.

We halve the amount of threads to the half of the population size, since every thread is responsive for the crossover process of two chromosomes. Such a framework enables the GPU perform memory access with a half-warp speed.

Parallel mutation and simulated annealing method with SIMD

Parallel mutations are simply achieved by each thread that uses its own random field to perform single-point or multi-point mutations on its chromosomes. We also combine the mutation method with the simulated annealing method.³ The simulated annealing method by definition is referred to annealing in metallurgy. The principle is that when a crystalline solid is heated to a high temperature, its atoms can leave their original positions and move randomly; when the temperature is cooled down, the atoms can slowly converge and the crystalline solid can achieve its most regular possible crystal lattice configuration, that is, its minimum lattice energy state. The simulated annealing method imitates this optimization random searching algorithm to converge to optimal solutions faster. The pseudo-code implementing the simulated annealing method is listed in Table 1.

Table 1.

Pseudo-code of simulated annealing.

S : = s0;
e : = E(s);
while k < kmax && e > emax
sn : = neighbor (s)
de : = E(sn)
if random() < P(e, de, temp(k/kmax)) then
s: = sn
e: = de
k : = k+1
return s

It starts from a state s0 and executes either a maximum of kmax steps or until a state with energy of emax or less is found. The variable e represents the energy of the atom. The call neighbor (s) generates a randomly chosen neighbor of a given state s; the call random() picks and returns a random number uniformly distributed in the range [0, 1). We use equation (5) to compute the probability for changing the energy of the atom to $e'$ and compare this computed probability with the value produced from the function call of random(). The state and energy variables of the atom are updated if the computed probability is greater than the random number

P (e, e', temp (\frac{k}{k \max})) = - \frac{e' - e}{Temp} \times 100 %

(5)

The variable Temp decreases along with time; when it reaches 0, variables of the atom stop updating.

The simulated annealing method is applied to parallel GA by replacing its mutation process. The modified parallel GA is shown in Table 2.

Table 2.

Parallel GA with simulated annealing method.

E: = evaluation(x);
d_E: = evaluation(Mutation(x));
deata: = d_E – E;
if deata > 0 and exp(-deata/Temp) then
E: = d_E;

We substitute atoms with chromosomes and energy with the fitness evaluation value. By combining with the simulated annealing method, the parallel GA can converge to optimal solutions faster. However, two extra variables, that is, the initial temperature and the rate of descent, need to be attuned. When both parameters are adjusted properly, the optimization level of the modified parallel GA with simulated annealing will be better than that of the original one.

Island model–based parallel GA

As described above, all the operations of the parallel GA have been implemented in a SIMD parallel computing architecture. However, we only increase their iterative speed of GAs, without addressing some of the shortcomings inherent in GAs, such as premature convergence to local optimal solutions. The island model, as depicted in Figure 4, has been shown to have better performance than the serial single population model.¹⁷ In the following, we try to adopt the island model¹² as a base to further improve performance of GAs.

Figure 4.

Schematic of a GA using island migration.

Based with the island model, the population of chromosomes is divided into several blocks, and each individual block (or island) individual runs GAs and generates its random search domains. Each island has a certain probability for chromosome migration between islands; chromosomes with higher fitness have higher chances of migration. In this study, after each island completing i-time iterations of GA, a global selection is executed to exchange genes among all islands. Of course, chromosomes with a better fitness value are easy to migrate among islands. In this way, blocks converging into non-optimal solutions before can use chromosomes with a better fitness value to obtain better solutions from the whole random search domain. Thus, the modified parallel GA based on the island model can be implemented by simply inserting division of blocks stages into selection as well as crossover operations. Figure 5 depicts a modified parallel GA based on the island model with SIMD architecture.

Figure 5.

Island model–based GA with parallel blocks using the GPU with SIMD architecture.

We divide the total population of a size $N_{totoal}$ into M blocks; thus, the population of each block: $N_{island} = N_{total} / M$ . As shown in Figure 5, independent random searches are performed i iterations inside each block; then the global selection and crossover are performed to exchange chromosomes among blocks. Then, the independent random searches are again performed i iterations inside each block and so on.

Effect of temperature decreasing rate on the simulated annealing

Although the simulated annealing method can be used to improve random search, it is difficult to decide a proper value for the temperature decreasing rate. If temperature deceases too fast, the GA might be stuck at local optimal solutions, whereas, if too slow, the GA might converge too slowly. Several temperature deceasing methods have been tested. The first method uses a fixed rate of decrease as shown in equation (6)

Temp : = Temp \times ROD

(6)

where ROD represents the rate of decrease.

The experimental results show that when the length of chromosomes increases, the GA has a high probability to be stuck at local optimal solutions with a fixed decreasing rate. The second method refers to spheroidizing annealing⁵ in metallurgy. Based on cyclic spheroidizing annealing shown in Figure 6, we formulate a new temperature decreasing mathematic model in equation (7)

Temp = T_{0} \times RO D_{0} \times | \cos T_{1} \times RO D_{1} | + T_{2} \times RO D_{2}

(7)

where $T_{0}$ represents the amplitude, $T_{1}$ represents the period, and $T_{2}$ represents the fundamental rate of decrease of the temperature.

Figure 6.

Cyclic spheroidizing annealing.

Note that $T_{2}$ has same function as Temp, and $RO D_{2}$ as ROD in the fixed decreasing rate model. Figure 7 depicts the mathematic model of cyclic spheroidizing annealing.

Figure 7.

Mathematic model of cyclic spheroidizing annealing.

This cyclic temperature algorithm solves the issue that the GA is stuck at local optimal solutions while the temperature is too low to provide enough energy to enable the algorithm to escape from local optima. The parameters, that is, $T_{i}$ and $RO D_{i} = 1, 2, 3$ , are manually set, which is relatively simple compared to parameters setting in polynomial models.

In the future, we plan to use two-leveled GA (GA2) to decide the temperature deceasing rate, as shown in Figure 8. Each chromosome represents a parameter of the rate of decrease. Since the fitness evaluation is obtained by multiple iterations using GA2, the population size cannot be set too large. When the proper parameters are decided for a chromosome with certain length, the whole parameters setting of rate of decrease can be directly substitute into GAs to deal with chromosomes with similar length.

Figure 8.

Modified two-leveled GA with simulated annealing.

In the following, we study the convergent efficiency of GA with cyclic spheroidizing annealing method compared to that of the fixed decreasing rate method. In the simulation, the number of nodes for the TSP is set as 256, and the simulation time is 5 min.

Figure 9 shows the relative errors (equation (8)), that is, deviation from optima, of GA with the cyclic spheroidizing annealing model, in which $T_{2}$ (or Temp) is 300; $RO D_{2}$ (or ROD) is 0.999; and varying $T_{0}$ and $T_{1}$

x_{0} = GA fitness output, x = Optimal fitness

relative error = \frac{| Δ x |}{x} \times 100 % = \frac{| x_{0} - x |}{x} \times 100 %

(8)

Figure 9.

Relative errors of GA using the cyclic spheroidizing annealing model ( $T_{2} = 300$ ; $RO D_{2} = 0.999$ ; and varying $T_{0}$ and $T_{1}$ ).

The simulated results are also listed in Table 3. We also found that when ROD₂ = 0.999, the relative error of GA with fixed deceasing rate annealing model is 407.8804%.

Table 3.

Relative errors of GA using the cyclic spheroidizing annealing model ( $T_{2} = 300$ ; $RO D_{2} = 0.999$ ; and varying $T_{0}$ and $T_{1}$ ).

Initialize the value of $T_{0} and T_{1}$
	$T_{1}$
	$T_{1} = 50$	$T_{1} = 100$	$T_{1} = 150$	$T_{1} = 200$
$T_{0}$	Relative errors of GA using the cyclic spheroidizing annealing model (%)
$T_{0} = 50$	385.0921	415.9965	321.2917	367.6405
$T_{1} = 100$	294.9293	250.6498	386.5825	389.6105
$T_{1} = 150$	365.4384	349.1226	349.9237	352.4873
$T_{1} = 200$	357.4735	319.7697	330.7815	368.6747

Figure 10 shows the relative errors, that is, deviation from optima, of GA with the cyclic spheroidizing annealing model, in which $T_{2}$ (or Temp) is 300; $RO D_{2}$ (or ROD) is 0.9999; and varying $T_{0} and T_{1}$ . The simulated results are also listed in Table 4. We also found that when ROD₂ = 0.9999, the relative error of GA with fixed deceasing rate annealing model is 346.7588%

Figure 10.

Relative errors of GA using the cyclic spheroidizing annealing model ( $T_{2} = 300$ ; $RO D_{2} = 0.9999$ ; and varying $T_{0}$ and $T_{1}$ ).

Table 4.

Relative errors of GA using the cyclic spheroidizing annealing model ( $T_{2} = 300$ ; $RO D_{2} = 0.9999$ ; and varying $T_{0}$ and $T_{1}$ .

Initialize the value of $T_{0} and T_{1}$
	$T_{1}$
	$T_{1} = 50$	$T_{1} = 100$	$T_{1} = 150$	$T_{1} = 200$
$T_{0}$	Relative errors of GA using the cyclic spheroidizing annealing model (%)
$T_{0} = 50$	386.2167	271.5436	344.2957	336.2516
$T_{1} = 100$	311.4254	323.2916	405.2371	357.8908
$T_{1} = 150$	394.6275	292.1579	384.4916	396.9173
$T_{1} = 200$	298.4466	268.2143	409.0585	351.9743

From the experimental results, we observe that the proposed cyclic spheroidizing annealing model can effectively offset the decrease of temperature, and thus enable the GA to escape local optimal solutions.

Experimental result and discussion

The experimental environment is listed as follows:

CPU: Intel^® Core™ i7-4770K 3.5 GHz processor with 16 GB memory; GPU: NVIDIA Titan Black.

Operating system: Linux Ubuntu 14.04 (64 bit).

Benchmark: TSP.

To search solutions for the TSP on the CPU, the single-thread program written in MATLAB uses the GA on the website in Vedenev,¹⁸ whereas the multi-thread program written in C# uses a master–slave model.^2,19 In the experiments, we combined the proposed parallel GA on the GPU SIMD architecture with the island model and improved simulated annealing method. We compared performance of the proposed parallel GA on the GPU with those of the single-thread and multi-thread program on the CPU. This experiment only uses GAs to compute optimal solutions for the TSP, which is mainly used as a benchmark and not particularly optimized. Since the purpose of this study is to develop a good computing model for parallel GAs.

The population size and iteration speed

In this experiment, the population size is varied from 512 to 8192, the number of nodes is varied from 16 to 256, and 2000 iterations are performed at each run. Figure 11 shows plots of the execution times of parallel GAs on the CPU using the master–slave model with various population sizes and node numbers. Table 5 lists the experimental results.

Figure 11.

Execution times for 2000 iterations (in seconds) of master–slave parallel GAs on the CPU with various population sizes and node numbers.

Table 5.

Execution times for 2000 iterations (in seconds) of master–slave parallel gas on the CPU with various population sizes and node numbers.

Number of population and nodes
	Nodes
	$16$	$32$	$64$	$128$	$256$
Population	Execution times (seconds)
512	1.23	3.17	6.98	17.9	60.001
$1024$	2.01	4.45	10.73	32.5	88.39
$2048$	5.43	12.72	19.75	43.26	108.5
$4096$	20.56	42.8	56.56	89.04	204.03
$8192$	106.21	216.88	215.45	294.48	578.82

Figure 12 shows plots of the execution times of GAs using the single-core CPU with various population sizes and node numbers. Table 6 lists the experimental results.

Figure 12.

Execution times for 2000 iterations (in seconds) of GAs on the single-core CPU with various population sizes and node numbers.

Table 6.

Execution times for 2000 iterations (in seconds) of gas on the single-core CPU with various population sizes and node numbers.

Number of population and nodes
	Nodes
	$16$	$32$	$64$	$128$	$256$
Population	Execution times (seconds)
$512$	20	25.95	63.25	120.15	251.32
$1024$	28.01	46.43	115.39	232.43	481.84
$2048$	47.03	101.46	224.33	454.25	1000
$4096$	109.62	233.87	438.87	892	1969.5
$8192$	270.75	465.64	885.5	1807.75	3964

As shown in Figures 11 and 12, the execution time of GAs on the CPU is relatively long (>60 s for 256 nodes) and increases sharply when the population size or the node number increases. Figure 13 shows the execution times of modified parallel GAs on the GPU with various population sizes and node numbers. Table 7 lists the experimental results.

Figure 13.

Execution times for 2000 iterations (in seconds) of modified parallel GAs on the GPU with various population sizes and node numbers.

Table 7.

Execution times for 2000 iterations (in seconds) of modified parallel gas on the GPU with various population sizes and node numbers.

Number of population and nodes
	Nodes
	$16$	$32$	$64$	$128$	$256$
Population	Execution times (seconds)
$512$	0.445	0.758	1.245	2.171	5.202
$1024$	0.465	0.789	1.299	2.265	5.355
$2048$	0.502	0.857	1.412	2.574	6.488
$4096$	0.557	1.017	1.731	3.401	8.395
$8192$	0.734	1.326	2.694	7.058	16.399

From the experimental results, we find that the modified parallel GA on the GPU has greatly reduced the execution time. For illustration, we divide the execution time of GAs on the CPU by that of parallel GAs on the GPU and show the acceleration ratios with varied population sizes and node numbers in Figures 14 and 15. Tables 8 and 9 list the acceleration ratios. We observe that as the amount of data (the population or node number) increases, the acceleration ratio increases, since the CPU reaches its limit while the GPU still does not.

Figure 14.

Acceleration ratios of GAs on the GPU over that on the parallel CPU.

Figure 15.

Acceleration ratios of GAs for GPU over single-core CPU.

Table 8.

Acceleration ratios of GAs on the GPU over that on the parallel CPU.

Number of population and nodes
	Nodes
	$16$	$32$	$64$	$128$	$256$
Population	Acceleration ratios
$512$	2.7640	4.1820	5.6064	8.2450	11.534
$1024$	4.3225	5.6400	8.2602	14.348	16.506
$2048$	10.816	14.842	13.987	16.806	16.723
$4096$	36.912	42.084	32.674	26.180	24.303
$8192$	144.70	163.55	79.974	41.722	35.296

Table 9.

Acceleration ratios of gas for GPU over single-core CPU.

Number of population and nodes
	Nodes
	$16$	$32$	$64$	$128$	$256$
Population	Acceleration ratios
$512$	44.94	34.23	50.8	55.34	48.31
$1024$	60.23	58.84	88.82	102.61	89.97
$2048$	93.68	118.38	158.87	176.47	154.13
$4096$	196.8	229.96	253.53	262.27	234.6
$8192$	368.86	351.16	328.69	256.12	241.72

Converging speed over various node numbers

In this experiment, we mainly verify that the modified parallel GA (using the GPU) combined with the island model and simulated annealing shows better converging speed than the original GA (using the CPU). First, we run both GAs for 300 s. Figures 16 and 17 depict the relative errors of GAs using the CPU with various node numbers (16, 32, 64, 128, and 256) for single-core and multi-thread processes, respectively. From both figures, we observe that when the number of nodes is over 64, the GA using the CPU cannot converge into optimal solutions; also the converging speed is exponentially inverse to the number of nodes.

Figure 16.

Relative errors of GAs using the CPU with various node numbers over 300 s.

Figure 17.

Relative errors of GAs using the CPU parallel computing with various node numbers over 300 s.

Figure 18 depicts the relative errors of the modified parallel GAs using the GPU with various node numbers (16, 32, 64, 128, and 256). From the plot, we observe that when the number node is less than 256, the modified parallel GA converges well (within 10 s); also, the overall converging speed of the modified parallel GA using the GPU is obviously faster than that of the original GA using the CPU.

Figure 18.

Relative errors of the modified parallel GAs using the GPU with various node numbers over 300 s.

To verify that the acceleration of converging speed has been improved by the optimized parallel GA besides the high-speed computing power of the GPU, the following experiment we used the same maximum iteration (5000) as termination settings for GAs running on the CPU and GPU.

Figures 19 and 20 depict relative errors of GAs using the CPU with various node numbers for 5000 iterations for single-core and multi-thread processes, respectively. From both figures, we observe that the GA converges relatively slowly on the CPU.

Figure 19.

Relative errors of single-thread GAs using the CPU with various node numbers for 5000 iterations.

Figure 20.

Relative errors of multi-threaded GAs using the CPU with various node numbers for 5000 iterations.

Figure 21 depicts relative errors of modified parallel GAs using the GPU with various node numbers for 5000 iterations. From comparison of plots in Figures 19 –21, we observe that for the same iteration, the modified parallel GA using the GPU converges much faster than the original GA using the CPU.

Figure 21.

Relative errors of modified GAs using the GPU with various node numbers for 5000 iterations.

Conclusion

We experimentally prove that the GAs can be inputted into SIMD architecture for highly parallel computing. We have shown how to modify the GA process and memory access method to adjust for SIMD architecture. Also, to further enhance the converging speed on the GPU with SIMD architecture, the parallel GA has been combined with the island model and simulated annealing, which have been used in traditional CPU for parallelization. The modified parallel GA takes advantage of the blocks feature of NVIDIA CUDA and shows high performance. We replace mutation of GAs with the simulated annealing, which we plan to implement with a two-level GA to determine parameters for the rate of decrease in our future work.

Footnotes

Academic Editor: Artde Lin

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partly supported by Ministry of Science and Technology, Taiwan, under grant number MOST 105-2221-E-029-012-MY2.

References

Whitley

Rana

Heckendorn

. The island model genetic algorithm: on separability, population size and convergence. J Comput Inf Technol 1999; 7: 33–48.

Gumpu, https://github.com/gumpu/TSP_Animation (accessed 19 March 2017).

Aarts

Korst

. Simulated annealing and Boltzmann machines. Hoboken, NJ: John Wiley & Sons, 1988.

Kirkpatrick

Gelatt

Vecchi

. Optimization by simulated annealing. Science 1983; 220: 671–680.

Yang

Zhu

. Effect of cyclic spheroidizing annealing on microstructure and mechanical properties of GCr15 bearing steel. Heat Treat Met 2009; 14: 33–36.

Lawler

Lenstra

Rinnooy Kan

AHG

. The traveling salesman problem: a guided tour of combinatorial optimization (vol. 3). New York: John Wiley & Sons, 1985.

Han

Abdelrahman

. hiCUDA: high-level GPGPU programming. IEEE T Parall Distr 2011; 22: 78–90.

Harris

. High performance computing with CUDA. In: International supercomputing conference, Dresden, Germany, 2008.

Stone

Gohara

Shi

. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 2010; 12: 66–73.

10.

Lindholm

Nickolls

Oberman

. NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 2008; 28: 39–55.

11.

Hoffman

Padberg

Rinaldi

. Traveling salesman problem. In: Gass

(eds) Encyclopedia of operations research and management science. New York: Springer, 2013, pp. 1573–1578.

12.

Pit

. Parallel genetic algorithms (MS Computer Science), 1995, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.6836&rep=rep1&type=pdf

13.

Muhlenbein

. Evolution in time and space-the parallel genetic algorithm. In: Rawlins

GJE

(ed.) Foundations of genetic algorithms. Burlington, MA: Morgan Kaufmann Publishers, 1991, pp. 316–338.

14.

Toolkit

CUDA

. 4.2: CURAND guide. Santa Clara, CA: NVIDIA Corporation, 2012.

15.

Siegel

Ributzka

. CUDA memory optimizations for large data-structures in the gravity simulator. J Algorithm Comput Technol 2011; 5: 341–362.

16.

Negrut

Serban

. Unified memory in CUDA 6.0: a brief overview of related data access and transfer issues. Technical report TR-2014-09, September 2014. Madison, WI: SBEL.

17.

Melab

Talbi

. GPU-based island model for evolutionary algorithms. In: Proceedings of the 12th annual conference on genetic and evolutionary computation, Portland, OR, 7–11 July 2010, pp.1089–1096. New York: ACM Press.

18.

Vedenev

. http://simulations.narod.ru (accessed 19 March 2017).

19.

Cantú-Paz

. A survey of parallel genetic algorithms. Calculate Parall 1998.