Abstract
We introduce a novel algorithm-based fault-tolerance scheme to detect and repair soft transient faults (silent data corruption, bitflips) in multigrid solvers: by applying the full approximation scheme (FAS) variant of multigrid to linear systems, we prove invariants that enable fault detection and correction, and ultimately lead to a black-box protection of the smoothing stage. A statistical analysis for a wide range of prototypical problems demonstrates the efficiency of our approach, especially compared with full checksum protection. In particular, the overhead of our new method is negligible in the fault-free case, since we only employ readily available quantities.
Get full access to this article
View all access options for this article.
