Abstract
We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.
Keywords
Get full access to this article
View all access options for this article.
