Probabilistic QoS guarantees for supercomputing systems

A.J. Oliner; L. Rudolph; R.K. Sahoo; J.E. Moreira; M. Gupta

doi:10.1109/DSN.2005.80

DSN 2005

Conference paper

09 Nov 2005

Probabilistic QoS guarantees for supercomputing systems

View publication

Abstract

Supercomputing systems must be able to reliably and efficiently complete their assigned workloads, even in the presence of failures. This paper proposes a system that allows the system and users to negotiate a mutually desirable risk strategy; in order to accomplish this, the system makes probabilistic guarantees on quality of service (QoS), of the form, "Job j can be completed by deadline d with probability p." In order to make such guarantees, the system uses event prediction (forecasting) in conjunction with fault-aware job scheduling and cooperative check-pointing strategies. Using job logs and failure traces from actual high performance computing systems, we employ trace-based simulations to assess the effects of the prediction accuracy (a) and user risk strategy (U) on a variety of performance metrics. Compared to a system that does not use event prediction, a high forecasting accuracy resulted in QoS and utilization improvements of as much as 6%, along with an 89% reduction in the amount of lost work. Therefore, our results show that a system that makes probabilistic QoS guarantees using a market-based scheduling approach can increase both system performance and reliability. © 2005 IEEE.

Paper