Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
In parallel computer systems an interconnection network is used to either share memory between processors and/or exchange information between the processors. This means that a lot of the system's data and control information is communicated across this network. Therefore, to avoid severe performance degradation it is important for the network to be resilient to soft errors (transient and intermittent errors). In this paper we propose mechanisms for recovery from soft errors in multistage interconnection networks (MINs). In order to reduce the work done by these mechanisms, localized concurrent error detection and recovery is proposed.
Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
Raymond Wu, Jie Lu
ITA Conference 2007
Pradip Bose
VTS 1998
Ehud Altman, Kenneth R. Brown, et al.
PRX Quantum