Software pipelining for graphic processing unit acceleration: Partition,scheduling and granularity

Abstract

The graphic processing unit (GPU) is becoming increasingly popular as a performance accelerator in various applications requiring high-performance parallel computing capability. In a central processing unit (CPU) or GPU hybrid system, software pipelining is a major task in order to deliver accelerated performance, where hiding CPU–GPU communication overheads by splitting a large task into small units is the key challenge. In this paper, we carry out a systematic investigation into task partitioning in order to achieve maximum performance gain. We first validate the advantage of even partition strategy, and then propose the optimal scheduling, with detailed study into how to achieve optimal unit size (data granularity) in an analytical framework. Experiments on AMD and NVIDIA GPU platforms demonstrate that our approaches achieve around 31 – 59% performance improvement using software pipelining.

Keywords

Parallel computing high-performance computing GPU programming software pipelining optimal scheduling

Get full access to this article

View all access options for this article.

References

Aiken

Nicolau

Novack

(1995) Resource-constrained software pipelining. IEEE Transactions on Parallel and Distributed Systems 6(12): 1248–1270.

Bartkewitz

Lemke-Rust

(2011) A high-performance implementation of differential power analysis on graphics cards. In: Smart Card Research and Advanced Applications, Leuven, Belgium, 14–16 September 2011. Berlin: Springer, pp.252–265.

Boyer

Meng

Kumaran

(2013) Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE international symposium on parallel and distributed processing, workshops and PhD forum, Cambridge, USA, May 20–24, 2013, pp.1097–1106. IEEE.

Buck

Foley

Horn

. (2004) Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics 23(3): 777–786.

Xue

. (2012) Parallelizing SOR for GPGPUs using alternate loop tiling. Parallel Computing 38(6–7): 310–328.

Xue

(2011) Model-Driven tile size selection for DOACROSS loops on GPUs. In: Euro-Par (2011), Bordeaux, France, 29 August–2 September 2011, pp.401–412. Berlin: Springer.

Dongen

Gao

Ning

(1992) A polynomial time method for optimal software pipelining. In: Parallel Processing: CONPAR-VAPP V, Lyon, France, 1–4 September 1992, pp. 613–624. Berlin: Springer.

Fan

Qiu

Kaufman

. (2004) GPU cluster for high performance computing. In: Proceedings of the 2004 ACM/IEEE conference on supercomputing, Pittsburgh, PA, USA, 6–12 November 2004, p.47. IEEE.

Fernandes

GFP

Yamagiwa

da Silva

VMM

. (2009) Parallel LDPC decoding on GPUs using a stream-based computing approach. Journal of Computer Science and Technology 24(5): 913–924.

10.

Gasperoni

Schwiegelshohn

Ebcioglu

(1989) On optimal loop parallelization. In: 22nd Annual Workshop and Symposium on Microprogramming and Microarchitecture, Dublin, Ireland, 14–16 August 1989. IEEE. pp.141–147. IEEE.

11.

Gómez-Luna

González-Linares

Benavides

. (2012) Performance models for asynchronous data transfers on consumer graphics processing units. Journal of Parallel and Distributed Computing 72(9): 1117–1126.

12.

Govindarajan

Altman

Gao

(1996) A framework for resource-constrained rate-optimal software pipelining. IEEE Transactions on Parallel and Distributed Systems 7(11): 1133–1149.

13.

Guevara

Gregg

Hazelwood

. (2009) Enabling task parallelism in the CUDA scheduler. In: Workshop on programming models and emerging architectures, pp. 69–76, 2009. IEEE.

14.

Hong

Chen

. (2012) Providing source code level portability between CPU and GPU with mapCG. Journal of Computer Science and Technology 27(1): 42–56.

15.

Hong

Kim

(2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ISCA proceedings of the 36th annual international symposium on computer architecture, Austin, TX, USA, 20–24 June 2009, pp.152–163. New York: ACM.

16.

Huynh

Hagiescu

Wong

. (2012) Scalable framework for mapping streaming applications onto multi-GPU systems. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, New Orleans, LA, USA, 25–29 February 2012 pp.1–10. New York: ACM.

17.

Iwai

Nishikawa

Kurokawa

(2012) Acceleration of AES encryption on CUDA GPU. International Journal of Networking and Computing 2(1): 131–145.

18.

Johns

Brokenshire

(2007) Introduction to the cell broadband engine architecture. IBM Journal of Research and Development 51(5): 503–519.

19.

Kato

Lakshmanan

Kumar

. (2011) RGEM: A responsive GPGPU execution model for runtime engines. In: Proceedings of the 32nd IEEE real-time systems symposium, RTSS, Vienna, Austria, 29 November–2 December 2011, pp.57–66. IEEE.

20.

Kato

McThrow

Maltzahn

. (2012) Gdev: First-class GPU resource management in the operating system. In: 2012 USENIX annual technical conference, Boston, USA, 13–15 June 2012. pp.401–412. USENIX.

21.

Kuck

Kuhn

Padua

. (1981) Dependence graphs and compiler optimizations. In: Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on principles of programming languages.Williamsburg, Virginia, USA, January 1981, pp.207–218. New York: ACM.

22.

Lam

(1988) Software pipelining: An effective scheduling technique for VLIW machines. In: Proceedings of the ACM SIGPLAN 1988 conference on programming language design and implementation, Atlanta, Georgia, USA, 22–24 June 1988, pp.318–328. New York: ACM.

23.

Lee Allan (1992) Advanced software pipelining and the program dependence graph. IEEE symposium on parallel and distributed processing, Arlington, Texas, USA, 1–4 December 1992, pp.208–211. IEEE.

24.

Zhao

Chu

. (2012) Speeding up k-means algorithm by GPUs. Journal of Computer and System Sciences 79(2) 216–219.

25.

Luk

Hong

Kim

(2009) Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. MICRO-42, New York, USA, 12–16 December 2009, pp.45–55. IEEE.

26.

Manavski

(2007) CUDA compatible GPU as an efficient hardware accelerator for AES cryptography. In: IEEE international conference on signal processing and communications, Sofia, Bulgaria, 24–27 November 2007, pp.65–68. IEEE.

27.

Mei

Jiang

Jenness

(2010) CUDA-based AES parallelization with fine-tuned GPU memory utilization. In: 2010 IEEE international symposium on parallel distributed processing, workshops and PhD forum (IPDPSW), Atlanta, Georgia, USA, 19–23 April 2010, pp.1–7. IEEE.

28.

Munshi

(2008) Opencl parallel computing on the GPU and CPU. SIGGRAPH, Tutorial, August11–15.

29.

NVIDIA Corporation (2007) CUDA Programming Guide. June

30.

Osvik

Bos

Stefan

. (2010) Fast software AES encryption. In: Proceedings of the 17th international conference on fast software encryption, Seoul, Korea, 7–10 February 2010, pp.75–93. Berlin: Springer.

31.

Pienaar

Chakradhar

Raghunathan

(2012) Automatic generation of software pipelines for heterogeneous parallel systems. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, Salt Lake City, UT, USA, 11–15 November 2012, p.24. IEEE.

32.

Rau

Glaeser

(1981) Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In: Proceedings of the 14th annual workshop on microprogramming, Massachusetts, USA, 25 September 2014, pp.183–198. USA: IEEE Press.

33.

Stone

Hardy

Ufimtsev

. (2010) GPU – Accelerated molecular modeling coming of age. Journal of Molecular Graphics and Modelling 29(2): 116–125.

34.

Stratton

Stone

Hwu

WMW

(2008) MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In: Languages and Compilers for Parallel Computing, Edmonton, Canada, 31 July15 November 20122 August 2008, pp.16–30. Berlin: Springer.

35.

Udupa

Govindarajan

Thazhuthaveetil

(2009) Software pipelined execution of stream programs on GPUs. In: Proceedings of the 7th annual IEEE/ACM international symposium on code generation and optimization, Seattle, Washington, USA, 22–25 March 2009 pp.200–209. IEEE.

36.

van Werkhoven

Maassen

Seinstra

. (2014) Performance models for CPU-GPU data transfers. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing, Chicago, IL, USA, May 26–29, 2014. pp.11–20. IEEE.

37.

Villarreal

Najjar

(2008) Compiled hardware acceleration of molecular dynamics code. In: International conference on field programmable logic and applications, Heidelberg, Germany, 8–10 September 2008, pp.667–670. IEEE.

38.

Wang

Yang

. (2011) Optimizing linpack benchmark on GPU-accelerated petascale supercomputer. Journal of Computer Science and Technology 26(5): 854–865.

39.

Wang

Xue

Yang

(2010) Reuse-aware modulo scheduling for stream processors. In: Design, automation and test in Europe conference and exhibition, Dresden, 8–12 March 2010, pp.1112–1117. IEEE.

40.

Wei

. (2012) Software pipelining for stream programs on resource constrained multicore architectures. IEEE Transactions on Parallel and Distributed Systems 23(12): 2338–2350.

41.

Yan

Grossman

Sarkar

(2009) JCUDA: A programmer-friendly interface for accelerating java programs with CUDA. In: Euro-Par 2009 Parallel Processing, Delft, The Netherlands, 25–28 August 2009, pp.887–899 Berlin: Springer.

42.

Yang

Goodman

(2007) Symmetric key cryptography on modern graphics hardware. In: ASIACRYPT, Kuching, Malaysia, 2–6 December 2007, pp.249–264. Berlin: Springer.