Reliability modeling of RAID storage systems with latent errors
Abstract
The reliability of disk storage systems is adversely affected by the presence of latent sector errors. Disk scrubbing and intradisk redundancy are two schemes proposed to cope with unrecoverable or latent media errors and enhance the reliability of RAID storage systems. Two recent studies have investigated the effectiveness of these schemes, but they have reached opposing conclusions. These studies were conducted using two different modeling approaches. We present a detailed investigation which reveals that this discrepancy originates from the difference in the approach adopted, and the level of detail incorporated by the two models. We show that, as a consequence, these models provide reliability results which may differ by orders of magnitude therefore leading to contradicting conclusions. We develop a common analytical framework within which we investigate the details, merits, weaknesses, and applicability of each model. We resolve this discrepancy by deriving enhanced models that incorporate inherent characteristics of the latent-error process and provide realistic reliability results that are in good agreement. We subsequently reassess the reliability results and conclusions presented in previous studies regarding the disk scrubbing and the intradisk redundancy scheme.