Abstract
Cloud-management stacks have become an increasingly important element in cloud computing, serving as the resource manager of cloud platforms. While the functionality of this emerging layer has been constantly expanding, its fault resilience remains under-studied. This paper presents a systematic study of the fault resilience of OpenStack - a popular open source cloud-management stack. We have built a prototype fault-injection framework targeting service communications during the processing of external requests, both among OpenStack services and between OpenStack and external services, and have thus far uncovered 23 bugs in two versions of OpenStack. Our findings shed light on defects in the design and implementation of state-of-the-art cloud-management stacks from a fault-resilience perspective. Copyright 2013 ACM.