Enhanced-precision global sums are key to reproducibility in exascale applications. We examine two classic summation algorithms and show that vectorized versions are fast, good and reproducible at exascale. Both 256-bit and 512-bit implementations speed up the operation by almost a factor of four over the serial version. They thus demonstrate improved performance on global summations while retaining the numerical reproducibility of these methods.
AhrensPNguyenHDDemmelJ (2015) Efficient reproducible floating point summation and BLAS. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-229.
2.
BaileyDH (2005) High-precision floating-point arithmetic in scientific computation. Computing in Science Engineering7(3): 54–61.
3.
ChappDJohnstonTTauferM (2015) On the need for reproducible numerical accuracy through intelligent runtime selection of reduction algorithms at the extreme scale. In: 2015 IEEE International Conference on Cluster Computing. 1em plus 0.5em minus 0.4em. IEEE, 2015, pp. 166–175.
4.
CollangeSDefourDGraillatS, et al. (2015) Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Computing49: 83–97. [Online]. Available at: http://www.sciencedirect.com/science/article/pii/S0167819115001155(accessed 21 August 2019).
EstériePFalcouJGaunardM, et al. (2014) Boost. SIMD: generic programming for portable simdization. In: Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing. 1em plus 0.5em minus 0.4em. ACM, 2014, pp. 1–8.
GopalakrishnanGHovlandPIancuC, et al. (2017) Report of the hpc correctness summit, jan 25–26, 2017. Washington, DC.
11.
HeYDingCH (2001) Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. The Journal of Supercomputing18(3): 259–277.
12.
HighamNJ (1993) The accuracy of floating point summation, SIAM Journal on Scientific Computing14(4): 783–799.
13.
HighamNJ (2002) Accuracy and Stability of Numerical Algorithms. vol. 80. 1em plus 0.5em minus 0.4em. New Delhi: Siam.
KahanW (1965) Further remarks on reducing truncation errors. Communications of the ACM8(1): 40.
16.
KleinA (2006) A generalized kahan-babuška-summation-algorithm. Computing76(3-4): 279–293.
17.
KnuthDE (1969) The Art of Computer Programming. Vol. 2, chap. 4. 1em plus 0.5em minus 0.4em. Boston: Addison-Wesley Press.
18.
McCrackenDDDornWS (1964) Numerical methods and Fortran Programming: With Applications in Engineering and Science. 1em plus 0.5em minus 0.4em. Hoboken: Wiley.
19.
McCalpinJD (2016) Memory bandwidth and system balance in HPC systems. Invited Talk, Supercomputing.
20.
NeumaierA (1974) Rundungsfehleranalyse einiger verfahren zur summation endlicher summen. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik54(1): 39–51.
21.
PouchardLBaldwinSElsethagenT, et al. (2019) Computational reproducibility of scientific workflows at extreme scales. The International Journal of High Performance Computing Applications33(5): 1–14.
RobeyRZamoraY (n.d.) Parallel and High Performance Computing. 1em plus 0.5em minus 0.4em. Shelter Island: Manning Publications. Available online under an early access program.
26.
RobeyRW (2015) Computational reproducibility in production physics applications. Los Alamos National Lab.(LANL), Los Alamos, NM (United States), Tech. Rep.
27.
RobeyRWRobeyJMAulwesR (2011) In search of numerical consistency in parallel programming. Parallel Computing37(4-5): 217–229.
28.
TauferMPadronOSaponaroP, et al. (2010) Improving numerical reproducibility and stability in large-scale numerical simulations on GPUs. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). 1em plus 0.5em minus 0.4emIEEE, 2010, pp. 1–9.
WilkinsonJ (1994) Rounding Errors in Algebraic Processes. 1em plus 0.5em minus 0.4em. Mineola: Dover Publications.
32.
YamadaSInaTSasaN, et al. (2017) Quadruple-precision BLAS using bailey’s arithmetic with FMA instruction: its performance and applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2017, pp. 1418–1425.