Deep Learning With Multiple GPUs
Deep Learning With Multiple GPUs
1
without express permission of Run:ai. www.run.ai
In this guide, you will learn:
GPU Server 5
GPU Cluster 5
2
without express permission of Run:ai. www.run.ai
Multi GPU Deep Learning Strategies
Model Parallelism
Machine 4
Machine 1 Machine 2
Machine 2 Machine 3
Machine 1
Machine 3 Machine 4
3
without express permission of Run:ai. www.run.ai
How Does Using Multiple Replicas of your model are created and assigned to a
GPU. Then, a subset of the dataset is assigned to that
Deep Learning Frameworks? The subset for each GPU is processed and gradients
are produced
When working with deep learning models, there The gradients from all model replicas are averaged
are several frameworks you may use, including and the result is used to update the original model
Keras, PyTorch and TensorFlow. Depending on the
framework you choose, there are different ways to
implement multi GPU systems. The process repeats until your model is fully trained
TensorFlow PyTorch
TensorFlow is an open source framework, created PyTorch is an open source scientific computing
by Google, that you can use to perform machine framework based on Python. You can use it
learning operations. The library includes a variety of to train machine learning models using tensor
machine learning and deep learning algorithms and computations and GPUs. This framework supports
models that you can use as a base for your training. distributed training through the torch.distributed
It also includes built-in methods for distributed backend.
training using GPUs.
Through the API, you can use the tf.distribute. With PyTorch, there are three parallelism (or
distribution) classes that you can perform with
Strategy method to distribute your operations
GPUs. These include:
across GPUs, TPUs or machines. This method
enables you to create and support multiple user
segments and to switch between distributed Data Parallel: enables you to distribute model
strategies easily. replicas across multiple GPUs in a single machine.
You can then use these models to process different
subsets of your data set.
Two additional strategies that extend the distribute
method are MirroredStrategy and TPUStrategy.
Both of these enable you to distribute your Distributed Data Parallel: extends the DataParallel
workloads, the former across multiple GPUs and class to enable you to distribute model replicas across
machines in addition to GPUs. You can also use this
the latter across multiple Tensor Processing Units
class in combination with model_parallel to perform
(TPUs). TPUs are units available through Google both model and data parallelism.
Cloud Platform that are specifically optimized for
training with TensorFlow.
Model Parallel: enables you to split large models
across multiple GPUs with partial training happening
on each. This requires syncing training data
Both of these methods use roughly the same between the GPUs since operations are performed
data-parallel process, summarized as follows: sequentially.
4
without express permission of Run:ai. www.run.ai
Multi GPU Deployment Replicas of your model are created and assigned to a
GPU. Then, a subset of the dataset is assigned to that
Models replica
GPU Cluster
GPU clusters are computing clusters with Here are some of the capabilities you gain when
nodes that contain one or more GPUs. These using Run:ai:
clusters can be formed from duplicates of the
same GPU (homogeneous) or from different Advanced Visibility: create an efficient pipeline of
GPUs (heterogeneous). Each node in a cluster resource sharing by pooling GPU compute resources
is connected via an interconnect to enable the
transmission of data.
No More Bottlenecks: you can set up guaranteed
quotas of GPU resources, to avoid bottlenecks and
optimize billing.
GPU Cluster
Kubernetes is an open source platform you
can use to orchestrate and automate container A Higher Level of Control: Run:ai enables you to
deployments. This platform offers support for dynamically change resource allocation, ensuring
each job gets the resources it needs at any given time
the use of GPUs in clusters to enable workload
acceleration, including for deep learning.
Run:ai simplifies machine learning infrastructure
When using GPUs with Kubernetes, you can pipelines, helping data scientists accelerate their
deploy heterogeneous clusters and specify your productivity and the quality of their models.
resources, such as memory requirements. You
can also monitor these clusters to ensure reliable Learn more about the Run:ai GPU virtualization
performance and optimize GPU utilization. platform.
5
without express permission of Run:ai. www.run.ai