Sage Journals: Discover world-class research

Abstract

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details.

We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed.

Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Keywords

Parallel Programming GPUs CUDA Heterogeneous Programming

Get full access to this article

View all access options for this article.

References

Alonso-Mayo

Ortega-Arranz

Gonzalez-Escribano

(2016) Communicators: An abstraction to ease the use of accelerators. In: High-level programming for heterogeneous and hierarchical parallel systems (HLPGPU’2016), Prague, Czech Republic, 19 January 2016, Workshop co-located with HIPEAC 2016.

Baskaran

Bondhugula

Krishnamoorthy

. (2008) Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP), Salt Lake City, UT, USA, 20–23 February 2008, pp.1–10. New York, NY, USA: ACM.

Chen

Zhang

(2009) A stream processor cluster architecture model with the hybrid technology of MPI and CUDA. In: International conference on information science and engineering (ICISE’2009), Nanjing, China, 26–28 December 2009, pp. 86–89. Washington, DC, USA: IEEE. doi:10.1109/ICISE.2009.171.

Dastgeer

Enmyren

Kessler

(2011) Auto-tuning SkePU: A multi-backend skeleton programming framework for multi-GPU systems. In: Proceedings of the 4th international workshop on multicore software engineering (IWMSE’11), Waikiki, Honolulu, HI, USA, 21 May 2011, pp. 25–32. New York, NY, USA: ACM.

Dathathri

Reddy

Ramashekar

. (2013) Generating efficient data movement code for heterogeneous architectures with distributed-memory. In: Proceedings of the 22nd international conference on parallel architectures and compilation techniques (PACT), Edinburgh, Scotland, UK, 7–11 September 2013, pp. 375–386. Washington, DC, USA: IEEE.

Gonzalez–Escribano

Torres

Fresno

. (2014) An extensible system for multilevel automatic data partition and mapping. IEEE Transactions on Parallel and Distributed Systems 25(5): 1145–1154.

Haidl

Gorlatch

(2014) PACXX: Towards a unified programming model for programming accelerators using C++14. In: Proceedings LLVM compiler infrastructure in HPC (LLVM-HPC’14), New Orleans, Louisiana, 16–21 November 2014, pp. 1–11. Washington, DC, USA: IEEE.

Han

Abdelrahman

(2009) hiCUDA: a high-level directive-based language for GPU programming. In: Kaeli

Leeser

(eds.) Workshop on General Purpose Processing on Graphics Processing Units (GPGPU’09), Washington, DC, USA, 8 March 2009, New York, NY, USA: ACM, vol. 383 pp. 52–61.

Howison

Bethel

Childs

(2012) Hybrid parallelism for volume rendering on large-, multi-, and many-core systems. IEEE Transactions on Visualization and Computer Graphics 18(1): 17–29.

10.

Hugo

Guermouche

Wacrenier

. (2013) Composing multiple starPU applications over heterogeneous machines: A supervised approach. In: Proceedings international parallel and distributed processing symposium workshops (IPDPSW’13), PhD Forum, Boston, MA, USA, 20–24 May 2013, pp. 1050–1059. Washington, DC, USA: IEEE.

11.

Karimi

Dickson

Hamze

(2010) A performance comparison of CUDA and OpenCL. Technical report, arXiv preprint arXiv:1005.2581.

12.

Karunadasa

Ranasinghe

(2009) Accelerating high performance applications with CUDA and MPI. In: International conference on industrial and information systems (ICIIS’2009), Peradeniya, Sri Lanka, 28–31 December 2009, pp. 331–336. Washington, DC, USA: IEEE. doi:10.1109/ICIINFS.2009.5429842.

13.

Liang

Chiu

(2012) Enabling mixed OpenMP/MPI programming on hybrid CPU/GPU computing architecture. In: Proceedings parallel and distributed processing symposium workshops (IPDPSW’12), PhD Forum, Shanghai, China, 21–25 May 2012, pp. 2369–2377. Washington, DC, USA: IEEE. doi:10.1109/IPDPSW.2012.294.

14.

Newburn

Dmitriev

Narayanaswamy

. (2013) Offload compiler runtime for the intel R xeon phi coprocessor. In: Parallel and distributed processing symposium workshops & PhD Forum (IPDPSW), 2013 IEEE 27th international, Boston, MA, USA, 20–24 May 2013, pp. 1213–1225. Washington, DC, USA: IEEE.

15.

Ortega–Arranz

Torres

Gonzalez–Escribano

. (2014) Optimizing an APSP implementation for NVIDIA GPUs using kernel characterization criteria. The Journal of Supercomputing 70(2): 786–798.

16.

Ortega–Arranz

Torres

Gonzalez–Escribano

. (2015) TuCCompi: A multi-layer model for distributed heterogeneous computing with tuning capabilities. International Journal of Parallel Programming 43(5): 939–960.

17.

Reyes

de Sande

(2012) Optimization strategies in different CUDA architectures using llCoMP. Microprocessors and Microsystems 36(2): 78–87.

18.

Steuwer

Gorlatch

(2013) SkelCL: Enhancing OpenCL for high-level programming of multi-GPU systems. In: Malyshkin

(ed) Parallel Computing Technologies, LNCS, volume 7979. Berlin: Springer Berlin Heidelberg, pp. 258–272.

19.

Stratton

Stone

Hwu

WMW

(2008) MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In: Amaral

(ed) LCPC’2008, pp. 16–30. Berlin: Springer–Verlag.

20.

Torres

Gonzalez–Escribano

Llanos

(2013) uBench: Exposing the impact of CUDA block geometry in terms of performance. The Journal of Supercomputing 65(3): 1150–1163.

21.

Yang

Huang

Lin

(2011) Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Computer Physics Communications 182(1): 266–269.

Controllers: An abstraction to ease the use of hardware accelerators

Abstract

Keywords

Get full access to this article

References