SWIRL: High-performance many-core CPU code generation for deep neural networks

Abstract

Deep neural networks (DNNs) have demonstrated effectiveness in many domains including object recognition, speech recognition, natural language processing, and health care. Typically, the computations involved in DNN training and inferencing are time consuming and require efficient implementations. Existing frameworks such as TensorFlow, Theano, Torch, Cognitive Tool Kit (CNTK), and Caffe enable Graphics Processing Unit (GPUs) as the status quo devices for DNN execution, leaving Central Processing Unit (CPUs) behind. Moreover, existing frameworks forgo or limit cross layer optimization opportunities that have the potential to improve performance by significantly reducing data movement through the memory hierarchy. In this article, we describe an alternative approach called SWIRL, a compiler that provides high-performance CPU implementations for DNNs. SWIRL is built on top of the existing domain-specific language (DSL) for DNNs called LATTE. SWIRL separates DNN specification and its schedule using predefined transformation recipes for tensors and layers commonly found in DNN layers. These recipes synergize with DSL constructs to generate high-quality fused, vectorized, and parallelized code for CPUs. On an Intel Xeon Platinum 8180M CPU, SWIRL achieves performance comparable with Tensorflow integrated with MKL-DNN; on average 1.00× of Tensorflow inference and 0.99× of Tensorflow training. It also outperforms the original LATTE compiler on average by 1.22× and 1.30× on inference and training, respectively.

Keywords

Compilers code generation optimization deep neural networks code transformations

Get full access to this article

View all access options for this article.

References

Abadi

Agarwal

Barham

, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Available at: http://tensorflow.org/ (accessed 6 January 2019).

Agarwal

Akchurin

Basoglu

, et al. (2014) An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report MSR-TR-2014-112. Available at: http://research.microsoft.com/apps/pubs/default.aspx?id=226641.

Alkar

Thomas

Shanbhag

, et al. (2017) Weld: a common runtime for high performance data analytics. In: 8th Biennial Conference on Innovative Data Systems Research (CIDR), CIDR ’17, CA, USA, 8–11 January 2017.

Bergstra

Breuleux

Bastien

, et al. (2010) Theano: a CPU and GPU math expression compiler. In: Proceedings of the Python for Scientific Computing Conference (SciPy), Austin, TX, 10 June, Oral Presentation.

Catanzaro

Kamil

Lee

, et al. (2009) SEJITS: Getting productivity and performance with selective embedded JIT specialization. Programming Models for Emerging Architectures 1(1): 1–9.

Chafi

Sujeeth

Brown

, et al. (2011) A domain-specific approach to heterogeneous parallelism. ACM SIGPLAN Notices 46(8): 35–46.

Chen

Moreau

Jiang

, et al. (2018) Tvm: end-to-end optimization stack for deep learning. arXiv preprint arXiv:1802.04799.

Chetlur

Woolley

Vandermersch

, et al. (2014) cuDNN: efficient primitives for deep learning. CoRR abs/1410.0759. Available at: http://arxiv.org/abs/1410.0759.

Chintala

(2015) Convnet Benchmarks. Available at: https://github.com/soumith/convnet-benchmarks (accessed 14 March 2019).

10.

Collobert

Kavukcuoglu

Farabet

(2011) Torch7: a MATLAB-like environment for machine learning. In: BigLearn, NIPS Workshop, EPFL-CONF-192376, 2011.

11.

Donadio

Brodman

Roeder

, et al. (2005) A language for the compact representation of multiple program versions. In: Workshop on Languages and Compilers for Parallel Computing (LCPC), Hawthorne, NY, 20–22 October 2005, pp. 136–151. Berlin, Heidelberg: Springer.

12.

Dukhan

(2016) NNPACK. Available at: https://github.com/Maratyszcza/NNPACK (accessed 14 March 2019).

13.

Google (2011) Improving the speed of neural networks on CPUs. Available at: https://research.google.com/pubs/pub37631.html (accessed 14 March 2019).

14.

Google (2016) TensorFlow XLA. Available at: https://www.tensorflow.org/versions/master/experimental/xla/ (accessed 14 March 2019).

15.

Hall

Chame

Chen

, et al. (2009) Loop transformation recipes for code generation and auto-tuning. In: Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing, Newark, DE, 08–10 October 2009, pp. 50–64. Berlin, Heidelberg: Springer.

16.

Hartono

Norris

Sadayappan

(2009) Annotation-based empirical performance tuning using Orio. In: IPDPS, Washington, DC, USA, 23–29 May 2009.

17.

Hezaveh

Levasseur

Marshall

(2017) Fast automated analysis of strong gravitational lenses with convolutional neural networks. Nature 548: 555–557.

18.

Intel (2018) Intel mkl-dnn. Available at: https://github.com/01org/mkl-dnn (accessed 14 March 2019).

19.

Jia

Shelhamer

Donahue

, et al. (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.

20.

Jin

Wang

, et al. (2014) Training large scale deep neural networks on the intel xeon phi many-core coprocessor. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014, pp. 1622–1630. IEEE.

21.

Krizhevsky

Sutskever

Hinton

(2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 1: 1097–1105.

22.

Kurth

Zhang

Satish

, et al. (2017) Deep learning at 15pf: supervised and semi-supervised classification for scientific data. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC ’17, New York, NY, USA, pp. 7:1–7:11. ACM. ISBN 978-1-4503-5114-0, DOI:10.1145/3126908.3126916. Available at: http://doi.acm.org/10.1145/3126908.3126916.

23.

Latte (2016) Latte. Available at: https://github.com/IntelLabs/Latte.jl (accessed 14 March 2019).

24.

Lavin

Gray

(2015) Fast algorithms for convolutional neural networks. CoRR abs/1509.09308. Available at: http://arxiv.org/abs/1509.09308.

25.

Liu

Racah

Prabhat , et al. (2016) Application of deep convolutional neural networks for detecting extreme weather in climate datasets. CoRR abs/1605.01156. Available at: http://arxiv.org/abs/1605.01156.

26.

Mathieu

Henaff

LeCun

(2013) Fast training of convolutional networks through ffts. CoRR abs/1312.5851. Available at: http://arxiv.org/abs/1312.5851.

27.

MathWorks (2018) im2col in matlab. Available at: https://www.mathworks.com/help/images/ref/im2col.html (accessed 14 March 2019).

28.

Milova

Sveshnikova

Gankevich

(2016) Speedup of deep neural network learning on the mic-architecture. In: 2016 international conference on high performance computing simulation (HPCS). pp. 989–992. DOI: 10.1109/HPCSim.2016.7568443.

29.

Nervana (2016) NEON. Available at: https://github.com/NervanaSystems/neon (accessed 14 March 2019).

30.

NVIDIA (2016) NVIDIA GPU Inference Engine. Available at: www.devblogs.nvidia.com/production-deep-learning-nvidia-gpu-inference-engine (accessed 14 March 2019).

31.

Ragan-Kelley

Barnes

Adams

, et al. (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48(6): 519–530.

32.

Russakovsky

Deng

Krause

, et al. (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115(3): 211–252.

33.

Sermanet

Eigen

Zhang

, et al. (2013) OverFeat: integrated recognition, localization and detection using convolutional networks. CoRR abs/1312.6229. Available at: http://arxiv.org/abs/1312.6229.

34.

Shashank Kaira

Yang

De Andrade

, et al. (2018) Automated correlative segmentation of large transmission x-ray microscopy (txm) tomograms using deep learning. Materials Characterization 142: 203–210.

35.

Silver

Huang

Maddison

, et al. (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587): 484–489.

36.

Simonyan

Zisserman

(2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.

37.

Szegedy

Liu

Jia

, et al. (2014) Going deeper with convolutions. CoRR abs/1409.4842. Available at: http://arxiv.org/abs/1409.4842.

38.

Teixeira

TSFX

Ancourt

Padua

, et al. (2019) Locus: a system and a language for program optimization. In: Proceedings of the 2019 IEEE/ACM international symposium on code generation and optimization, CGO 2019, Piscataway, NJ, USA, pp. 217–228. Available at: http://dl.acm.org/citation.cfm?id=3314872.3314898.

39.

Truong

Barik

Totoni

, et al. (2016) Latte: a language, compiler, and runtime for elegant and efficient deep neural networks. In: Proceedings of the 37th ACM SIGPLAN conference on programming language design and implementation, PLDI ‘16, New York, NY, USA, pp. 209–223. ACM. ISBN 978-1-4503-4261-2, DOI:10.1145/2908080.2908105. Available at: http://doi.acm.org/10.1145/2908080.2908105.

40.

UCB-SEJITS (2017) Ctree. Available at: https://github.com/ucb-sejits/ctree (accessed 14 March 2019).

41.

Vasilache

Zinenko

Theodoridis

, et al. (2018) Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730.

42.

Warden

(2015) Why GEMM is at the heart of deep learning. Available at: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/ (accessed 14 March 2019).

43.

Zhang

(2015) Mocha.jl. Available at: https://github.com/pluskid/Mocha.jl (accessed 14 March 2019).

44.

Zlateski

Lee

Seung

(2015) ZNN - a fast and scalable algorithm for training 3d convolutional networks on multi-core and many-core shared memory machines. CoRR abs/1510.06706. Available at: http://arxiv.org/abs/1510.06706.