Evaluating cooperative checkpointing for supercomputing systems

Adam Oliner; Ramendra Sahoo

doi:10.1109/IPDPS.2006.1639693

IPDPS 2006

Conference paper

25 Apr 2006

Evaluating cooperative checkpointing for supercomputing systems

View publication

Abstract

Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative check-pointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, risk-based checkpointing with event prediction accuracy as low as 10% is able to significantly improve system utilization and reduce average bounded slowdown by a factor of 9, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance benefits in the face of large check-point overheads. © 2006 IEEE.

Conference paper