Sage Journals: Discover world-class research

Abstract

Enhanced-precision global sums are key to reproducibility in exascale applications. We examine two classic summation algorithms and show that vectorized versions are fast, good and reproducible at exascale. Both 256-bit and 512-bit implementations speed up the operation by almost a factor of four over the serial version. They thus demonstrate improved performance on global summations while retaining the numerical reproducibility of these methods.

Keywords

Reproducibility vectorization self-compensated summation enhanced precision reproducible sums

Get full access to this article

View all access options for this article.

References

Ahrens

Nguyen

Demmel

(2015) Efficient reproducible floating point summation and BLAS. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-229.

Bailey

(2005) High-precision floating-point arithmetic in scientific computation. Computing in Science Engineering 7(3): 54–61.

Chapp

Johnston

Taufer

(2015) On the need for reproducible numerical accuracy through intelligent runtime selection of reduction algorithms at the extreme scale. In: 2015 IEEE International Conference on Cluster Computing. 1em plus 0.5em minus 0.4em. IEEE, 2015, pp. 166–175.

Collange

Defour

Graillat

, et al. (2015) Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing 49: 83–97. [Online]. Available at: http://www.sciencedirect.com/science/article/pii/S0167819115001155 (accessed 21 August 2019).

Demmel

Nguyen

Ahrens

(2015) Cost of floating-point reproducibility. Available at: https://www.nist.gov/sites/default/files/documents/itl/ssd/is/NRE-2015-07-Nguyen_slides.pdf (accessed 7 August 2019).

Estérie

Falcou

Gaunard

, et al. (2014) Boost. SIMD: generic programming for portable simdization. In: Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing. 1em plus 0.5em minus 0.4em. ACM, 2014, pp. 1–8.

Fog

(2019a) VCL C++ vector class manual. Available at: https://www.agner.org/optimize/vcl_manual.pdf (accessed 7 August 2019).

Fog

(2019b) VCL C++ vector class source code. Available at: https://github.com/vectorclass (accessed 7 August 2019).

GCC (n.d.) GCC vector extensions. Available at: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html (accessed 22 July 2019).

10.

Gopalakrishnan

Hovland

Iancu

, et al. (2017) Report of the hpc correctness summit, jan 25–26, 2017. Washington, DC.

11.

Ding

(2001) Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. The Journal of Supercomputing 18(3): 259–277.

12.

Higham

(1993) The accuracy of floating point summation, SIAM Journal on Scientific Computing 14(4): 783–799.

13.

Higham

(2002) Accuracy and Stability of Numerical Algorithms. vol. 80. 1em plus 0.5em minus 0.4em. New Delhi: Siam.

14.

Iakymchuk

Collange

Defour

, et al. (2015) ExBLAS: reproducible and accurate BLAS library. Available at: https://www.nist.gov/sites/default/files/documents/itl/ssd/is/NRE-2015-04-iakymchuk.pdf (accessed 24 July 2019).

15.

Kahan

(1965) Further remarks on reducing truncation errors. Communications of the ACM 8(1): 40.

16.

Klein

(2006) A generalized kahan-babuška-summation-algorithm. Computing 76(3-4): 279–293.

17.

Knuth

(1969) The Art of Computer Programming. Vol. 2, chap. 4. 1em plus 0.5em minus 0.4em. Boston: Addison-Wesley Press.

18.

McCracken

Dorn

(1964) Numerical methods and Fortran Programming: With Applications in Engineering and Science. 1em plus 0.5em minus 0.4em. Hoboken: Wiley.

19.

McCalpin

(2016) Memory bandwidth and system balance in HPC systems. Invited Talk, Supercomputing.

20.

Neumaier

(1974) Rundungsfehleranalyse einiger verfahren zur summation endlicher summen. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik 54(1): 39–51.

21.

Pouchard

Baldwin

Elsethagen

, et al. (2019) Computational reproducibility of scientific workflows at extreme scales. The International Journal of High Performance Computing Applications 33(5): 1–14.

22.

Robey

(2019) Global sum examples. Available at: https://github.com/LANL/GlobalSums (accessed 8 July 2019).

23.

Robey

Zamora

(2019a) Vectorization examples. Available at: https://github.com/EssentialsofParallelComputing/Chapter6 (accessed 8 July 2019).

24.

Robey

Zamora

(2019b) Openmp examples. Available at: https://github.com/EssentialsofParallelComputing/Chapter7 (accessed 8 July 2019).

25.

Robey

Zamora

(n.d.) Parallel and High Performance Computing. 1em plus 0.5em minus 0.4em. Shelter Island: Manning Publications. Available online under an early access program.

26.

Robey

(2015) Computational reproducibility in production physics applications. Los Alamos National Lab.(LANL), Los Alamos, NM (United States), Tech. Rep.

27.

Robey

Aulwes

(2011) In search of numerical consistency in parallel programming. Parallel Computing 37(4-5): 217–229.

28.

Taufer

Padron

Saponaro

, et al. (2010) Improving numerical reproducibility and stability in large-scale numerical simulations on GPUs. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). 1em plus 0.5em minus 0.4em IEEE, 2010, pp. 1–9.

29.

Wikipedia Contributors (2019a) Kahan summation algorithm—Wikipedia, the free encyclopedia. Available at: https://en.wikipedia.org/w/index.php?title=Kahan_summation_algorithmoldid=910078822 (accessed 27 August 2019).

30.

Wikipedia Contributors (2019a) Pairwise summation—Wikipedia, the free encyclopedia. Available at: https://en.wikipedia.org/w/index.php?title=Pairwise_summationoldid=899870482 (accessed 27 August 2019).

31.

Wilkinson

(1994) Rounding Errors in Algebraic Processes. 1em plus 0.5em minus 0.4em. Mineola: Dover Publications.

32.

Yamada

Ina

Sasa

, et al. (2017) Quadruple-precision BLAS using bailey’s arithmetic with FMA instruction: its performance and applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2017, pp. 1418–1425.

Fast,good,and repeatable: Summations,vectorization,and reproducibility

Abstract

Keywords

Get full access to this article

References