Study of throughput degradation following single node failure in a data sharing system
Abstract
The data sharing approach to building distributed database systems is becoming more common because of its potentially higher processing power and flexibility compared to a data partitioning approach. However, due to the large amounts of hardware and complex software involved, the likelihood of a single node failure in the system increases. Following a single node failure, some processing has to be done to determine the set of locks held by transactions which were executing at the failed node. These locks cannot be released until database recovery has completed on the failed node. This phenomenon can cause throughput degradation even if the processing power on the surviving nodes is adequate to handle all incoming transactions. This paper studies the throughput dropoff behavior following a single node failure in a data sharing system through simulations and analytic modeling. The analytic model reveals several important factors affecting post-failure behavior and is shown to match simulations quite accurately. The effect of hot locks (locks which are frequently accessed) on post-failure behavior is observed through simulations and analytic modeling. Simulations are performed to observe system behavior after the set of locks held by transactions on the failed node has been determined and show that if the delay in obtaining this information is too large, the system is prone to thrashing.