Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
Abstract
Hypercube multiprocessors have recently offered a cost effective and feasible approach to supercomputing through parallelism at the processor level by directly connecting a large number of low-cost processors with local memories which communicate by message-passing instead of shared variables. This paper discusses the design of a fault-tolerant hypercube multiprocessor architecture. Most of the recently proposed schemes of fault tolerance in parallel architectures address mainly the issue of reconfiguration of a parallel architecture once a faulty processor is identified. The schemes assume the existence of an off-line diagnosis strategy which locates the faulty processor. We propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection. We have implemented system-level error detection mechanisms for three parallel applications on a 16-processor Intel iPSC hypercube multiprocessor: 1) matrix multiplication, 2) Gaussian elimination, and 3) fast Fourier transform. Schemes for other applications are under development. We have performed extensive studies of error coverage of our system-level error detection schemes in the presence of finite precision arithmetic which affects our system-level encodings. Finally, the paper proposes two reconfiguration schemes that allow us to isolate and replace faulty processors with spare processors. These schemes of reconfiguration are integrated with the error detection schemes to form a truly fault-tolerant hypercube multiprocessor. © 1990 IEEE