Sage Journals: Discover world-class research

Abstract

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. Therefore, prior to porting it is prudent to investigate the predicted performance benefit of accelerators for a given workload. To address this problem we present a performance-modeling framework that predicts the application performance rapidly and accurately for hybrid-core systems. We present predictions for two full-scale HPC applications—HYCOM and Milc. Our results for two accelerators (GPU and FPGA) show that gather/scatter and stream operations can speedup by as much as a factor of 15 and overall compute time of Milc and HYCOM improve by 3.4% and 20%, respectively. We also show that in order to benefit from the accelerators, 70% of the latency of data transfer time between the CPU and the accelerators needs to be overcome.

Keywords

accelerators benchmarking FPGA GPU HPC idioms performance modeling performance prediction

Get full access to this article

View all access options for this article.

References

Alam

Bhatia

Vetter

(2007) An exploration of performance attributes for symbolic modeling of emerging processing devices. In: Proceedings of the 3 rd international high performance computing and communications (HPCC) (eds Perrot .), Houston, USA, 2007, pp. 683–694. Springer: New York.

Alexandrov

Ionescu

Schauser

(1997) LogGP: incorporating long messages into the LogP Model. Journal of Parallel and Distributed Computing 44(1): 71–79.

Asanovic

Bodik

Catanzaro

. (2006) The landscape of parallel computing research: a view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley.

Bailey

Barszcz

Barton

(1991) The NAS parallel benchmarks. International Journal of Supercomputer Applications 5(3): 66–73.

Bakhoda

Yuan

Fung

WWL

(2009) Analyzing CUDA Workloads using a Detailed GPU Simulator. In: IEEE international symposium on performance analysis of systems and software, ISPASS, Boston, USA, 2009.

Bakos

(2010) High-performance heterogeneous computing with the Convey HC-1. Computing in Science and Engineering 12(6): 80–87.

Brewer

(2010) Instruction Set Innovations for the Convey HC-1 Computer. IEEE Micro 30: 70–79.

Binkert

Beckmann

Black

(2011) The gem5 simulator. SIGARCH Computer Architecture News 39(2): 1–7.

Burger

Austin

(1997) The SimpleScalar tool set, version 2.0. SIGARCH Computer Architecture News 25(3): 13–25.

10.

Carrington

Laurenzano

Snavely

. (2005) How well can simple metrics represent the performance of HPC applications? In: Proceedings of the 2005 ACM/IEEE conference on high performance networking and computing (SC’05), Seattle, USA.

11.

Carrington

Tikir

Olschanowsky

. (2011) An idiom-finding tool for increasing productivity of accelerators. In: Proceedings of the 2011 international conference on supercomputing (ICS 2011), Tucson, USA.

12.

Culler

Karp

Patterson

(1996) LogP: a practical model of parallel computation. Communications of the ACM 39(11): 78–85.

13.

Gibbons

Matias

Ramachandran

(1998) The Queue-Read Queue-Write PRAM Model: accounting for contention in parallel algorithms. SIAM Journal of Computing 28(2): 733–769.

14.

Govindaraju

Larson

Gray

. (2006) A memory model for scientific algorithms on graphics processors. In: Proceedings of the 2006 ACM/IEEE conference on high performance networking and computing (SC’06), Tampa, Florida.

15.

Graph500 (2012) Brief introduction to Graph500. Available at: www.graph500.org.

16.

Gregg

Hazelwood

(2011) Where is the data? Why You cannot debate GPU vs. CPU performance without the answer. In: Proceedings of the international symposium on performance analysis of systems and software (ISPASS), Austin, USA.

17.

Snavely

Van Der Wijngaart

(2011) Automatic recognition of performance idioms in scientific applications. In: Proceedings of the 25th IEEE international parallel and distributed processing symposium (IPDPS), 2011, Anchorage, USA.

18.

Hong

Kim

(2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of the 36th international symposium on computer architecture (ISCA), 2009, Austin, Texas.

19.

Hughes

Pai

Ranganathan

(2002) Rsim: simulating shared-memory multiprocessors with ILP processors. Computer 35(2): 40–49.

20.

HYCOM. (2012) HYCOM. Available at: www.hycom.org.

21.

Kerbyson

Hoisie

Wasserman

(2003) Modelling the performance of large scale systems. Keynote paper, UK performance engineering workshop, July 2003.

22.

Laurenzano

Tikir

Carrington

(2010) PEBIL efficient static binary instrumentation for Linux. In: Proceedings of the international symposium on performance analysis of systems and software (ISPASS), 2010, White Plains, USA.

23.

Luszczek

Dongarra

Koester

(2005) Introduction to the HPC Challenge Benchmark Suite. Available at: http://icl.cs.utk.edu/hpcc/pubs.

24.

Magnusson

Christensson

Eskilson

(2002) Simics: A full system simulation platform. Computer 35(2): 50–58.

25.

Mendes

Reed

(1998) Integrated compilation and scalability analysis for parallel systems. In: Proceedings of the 1998 international conference on parallel architectures and compilation techniques (PACT '98), IEEE Computer Society, Washington, USA.

26.

Milc—The MIMD Lattice Computation (MILC) Collaboration (2012) Available at: www.physics.utah.edu/~detar/milc/.

27.

NVIDIA (2009) NVIDIA’s next generation CUDA compute architecture: Fermi. Available at: www.nvidia.com/object/fermi_architecture.html.

28.

Olschanowsky

Snavely

Meswani

. (2010) PIR: a static idiom recognizer. In: Proceedings of the first international workshop on parallel software tools and tool infrastructures (PSTI), San Diego, USA.

29.

Perelman

Hamerly

Biesbrouck

. (2003) Using SimPoint for accurate and efficient simulation. In: Proceedings of ACM SIGMETRICS the international conference on measurement and modeling of computer systems, June 2003.

30.

Pllana

Brandic

Benkner

(2007) Performance modeling and prediction of parallel and distributed computing systems: a survey of the state of the art. In: Proceedings of the first international conference on complex, intelligent and software intensive systems (CISIS '07), IEEE Computer Society, Washington, USA, pp. 279–284.

31.

Saavedra

Smith

(1995) Measuring cache and TLB performance and their effect on benchmark run times. IEEE Transactions on Computers 44(10): 1223–1235.

32.

Saavedra

Smith

(1996) Analysis of benchmark characteristics and benchmark performance prediction. TOCS 14(4): 344–384.

33.

Snavely

Carrington

Wolter

. (2002) A framework for application performance modeling and prediction. In: Proceedings of the 2002 ACM/IEEE conference on high performance networking and computing (SC’02).

34.

Svobodova

(1976) Computer System Performance Measurement and Evaluation Methods: Analysis and Applications. New York: Elsevier.

35.

Tikir

Carrington

Snavely

. (2007) Genetic algorithm approach to modeling the performance of memory-bound codes. In: Proceeding of the 2007 ACM/IEEE Conference on High Performance Networking and Computing (SC’07).

36.

Tikir

Laurenzano

Carrington

. (2009) PSINS: an open source event tracer and execution simulator for MPI applications. In: Proceedings of the European Conference on Parallel Computing (EuroPar).

37.

Vachharajani

Penry

. (2002) Microarchitectural exploration with liberty. In: Proceedings of the international symposium on microarchitecture.

38.

Walters

Qudah

Chaudhary

(2006) Accelerating the HMMER sequence analysis suite using conventional processors. In: Proceedings of the 20th international conference on advanced information networking and applications (AINA'06).

39.

Williams

Waterman

Patterson

(2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52(4): 65–76.

40.

Zhang

Sun

(1996) Semi-empirical multiprocessor performance predictions. Journal of Parallel and Distributed Computing 39: 14–28.

41.

Lilja

(2006) Simulation of computer architectures: simulators, benchmarks, methodologies, and recommendations. IEEE Transactions on Computers 55(3): 268–280.

Modeling and predicting performance of high performance computing applications on hardware accelerators

Abstract

Keywords

Get full access to this article

References