Understanding large system failures - A fault injection experiment
Abstract
Fault injection is used to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system. The authors (1) introduce the idea of failure acceleration to conduct such experiments; (2) estimate total loss of the primary service to occur in only 16% of the faults; (3) reveal errors termed potential hazards that do not affect short-term availability but cause a catastrophic failure following a change in operating state; and (4) identify at least 41% of errors as potential candidates for repair before total failure. The results enhance the understanding of large system failures and provide a foundation for design enhancements and modeling of availability.