Performance analysis of the high-performance conjugate gradient benchmark on GPUs

Abstract

Graphics processing unit accelerated supercomputers have proved to be very effective, especially with regard to power efficiency, for accelerating compute intensive applications like the high-performance Linpack used in the TOP500 list. This paper presents the details of a CUDA implementation of the high-performance conjugate gradient, a new proposed benchmark that better represents modern application workloads which rely more heavily on memory system and network performance than high-performance Linpack. The results obtained at full scale on the largest graphics processing unit supercomputers in the world, Titan, the Cray XK7 at ORNL and Piz-Daint, the Cray XC30 at CSCS, indicate that graphics processing unit accelerated supercomputers are also very effective for this type of workload. A comparison with other architectures is also presented, showing that graphics processing units, with their high memory bandwidth, are the highest performing devices for this new benchmark.

Keywords

GPU computing CUDA HPC parallel computing performance analysis

Get full access to this article

View all access options for this article.

References

Barrett

Heroux

Lin

. (2011) Poster: Mini-applications: Vehicles for co-design. In: Proceedings of the 2011 high-performance computing networking, storage and analysis companion (SC ’11 Companion), New York, USA, pp. 1–2. New York: ACM Press.

Briggs

Henson

McCormick

(2000) A multigrid tutorial. Philadelphia, PA: SIAM.

Cohen

Castonguay

(2012) Efficient graph matching and coloring on the GPU. In: GPU Technology Conference, San Jose, USA, 14–17 May 2012, pp. 1–10.

Dongarra

Heroux

(2013) Toward a new metric for ranking high-performance computing systems. Sandia Report SAND2013-4744, USA.

Dongarra

Luszczek

(2005) Introduction to the HPC challenge benchmark suite. ICL Technical Report ICL-UT-05-01 (also appears as CS Department Technical Report UT-CS-05-544).

Golub

Van Loan

(1996) Matrix Computations, 3rd Edition. Baltimore, MD: John Hopkins University Press.

Heroux

Dongarra

Luszczek

(2013) HPCG technical specification. Sandia Report SAND2013-8752.

Jones

Plassmann

(1992) A parallel graph coloring heuristic. SIAM Journal on Computing 14: 654–669.

Luby

(1986) A simple parallel algorithm for the maximal independent set problem. SIAM Journal on Computing 15(4): 1036–1053.

10.

McCalpin

(1995) Memory bandwidth and machine balance in current high-performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.

11.

Park

Smelyanskiy

(2014) Optimizing Gauss–Seidel smoother in HPCG. In: ASCR HPCG workshop, Bethesda, MD, 25 March 2014.

12.

Phillips

Fatica

(2010) Implementing the Himeno benchmark with CUDA on GPU clusters. In: 2010 IEEE international symposium on parallel and distributed processing, pp. 1–10. IEEE.