Slide 14 - Distributed Deep Learning
Slide 14 - Distributed Deep Learning
Trong-Hop Do
Distributed deep learning
What is distributed ML?
Why distributed ML?
• Large amounts of data to train
• Explosion in types, speed, and scale of data
• Types: image, time-series, structured, sparse
• Speed: sensors, feed, financial
• Scale: amount of data generated growing exponentially
• Public datasets: processed splice genomic dataset is 250 GB and data subsampling is
unhelpful
• Private datasets: Google, Baidu perform learning over TBs of data
• Model sizes can be huge
• Models with billions of parameters do not fit in a single machine
• Other benefits:
• Faster IO
• Hardware failure tolerance
• Economical to use commondity hardware
Naïve MapReduce is not good for ML
Distributed deep learning
• Data parallelism is preferred
• Easy implementation
• Fault tolerance
• Benefits:
• First, we can potentially gain higher throughput in our distributed system: workers can spend
more time performing useful computations, instead of waiting around for the parameter
averaging step to be completed.
• Second, workers can potentially incorporate information (parameter updates) from other
workers sooner than when using synchronous
Asynchronous Stochastic Gradient Descent
• Problem: gradient staleness
Asynchronous Stochastic Gradient Descent
• Some approaches to dealing with stale gradients include:
• Scaling the value λ separately for each update $∆Wi,j$ based on the staleness of the
gradients
• Use synchronization to bound staleness. For example, delay faster workers when necessary,
to ensure that the maximum staleness is below some threshold
Decentralized Asychronous Stochastic Gradient Descent
• Sparse: only some of the gradients are communicated in each vector (the remainder are
assumed to be 0) - sparse entries are encoded using an integer index
• Quantized to a single bit: each element of the sparse update vector takes value +τ or −τ.
• Integer indexes (used to identify the entries in the sparse array) are optionally compressed
using entropy coding to further reduce update sizes
Decentralized Asychronous Stochastic Gradient Descent
• Compression and quantization is not free: these processes result in extra computation time per minibatch, and a small
amount of memory overhead per executor
• The process introduces two additional hyperparameters to consider: the value for τ and whether to use entropy coding
for the updates or not (though notably both parameter averaging and async SGD also introduce additional
hyperparameters)
Distributed Neural Network Training: Which Approach is Best?
• Best accuracy per epoch, and the overall attainable accuracy, especially with small averaging
periods (N = 1 averaging period most closely approximates single machine training)
• greatest issue with parameter averaging (and synchronous approaches in general) is the so-
called ‘last executor’ effect: that is, synchronous systems have to wait on the slowest
executor before completing each iteration.
• Consequently, synchronous systems are less viable as the total number of workers increases.
Distributed Neural Network Training: Which Approach is Best?
• Asynchronous stochastic gradient descent is a good option for training and has been shown to
work well in practice, as long as gradient staleness is appropriately handled
• Async SGD implementations with a centralized parameter server may introduce a communication
bottleneck. Utilizing $N$ parameter servers, each handling an equal fraction of the total
parameters is a conceptually straightforward solution to this problem.
• Yahoo TensorFlowOnSpark