Balancing Stragglers Against Staleness in Distributed Deep Learning
Abstract
Synchronous SGD is frequently the algorithm of choice for training deep learning models on compute clusters within reasonable time frames. However, even if a large number of workers (CPUs or GPUs) are at disposal for training, hetero-geneity of compute nodes and unreliability of the interconnecting network frequently pose a bottleneck to the training speed. Since the workers have to wait for each other at every model update step, even a single straggler/slow worker can derail the whole training performance. In this paper, we propose a novel approach to mitigate the straggler problem in large compute clusters. We cluster the compute nodes into multiple groups where each group updates the model synchronously stored in its own parameter server. The parameter servers of the different groups update the model in a central parameter server in an asynchronous manner. Few stragglers in the same group (or even separate groups) have little effect on the computational performance. The staleness of the asynchronous updates can be controlled by limiting the number of groups. Our method, in essence, provides a mechanism to move seamlessly between a pure synchronous and a pure asynchronous setting, thereby balancing between the computational overhead of synchronous SGD and the accuracy degradation of a pure asynchronous SGD. We empirically show that with increasing delay from straggler nodes (more than 300% delay in a node), progressive grouping of available workers still finishes the training within 20% of the no-delay case, with the limit to the number of groups governed by the permissible degradation in accuracy (≤ 2.5% compared to the no-delay case).