0% found this document useful (0 votes)
13 views

AIM301 Deep Learning With TensorFlow PyTorch and MXNet on AWS

Uploaded by

anishaman6206
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

AIM301 Deep Learning With TensorFlow PyTorch and MXNet on AWS

Uploaded by

anishaman6206
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

AMERICAS

AIM301

Deep learning with TensorFlow,


PyTorch, and MXNet on AWS
Shashank Prasanna
Sr. Developer Advocate, AI/ML
AWS

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Machine learning and deep learning

Popular deep learning frameworks: TensorFlow, PyTorch, and MXNet

Getting the most out of deep learning frameworks with


Amazon SageMaker

Summary

Resources

Q&A
Deep learning: ML with deep neural networks
Recommendations Forecasting Image recognition …

K-nearest neighbors XGBoost


Random forest
Data K-means Factorization machines
Linear learner PCA
Support vector machines

Machine
learning Deep neural networks

• Recommendations
Results •

Forecasts
Predictions
• Trends and patterns
Challenges with deep learning
Many model architectures – difficult to get started
VGG, ResNet, ResNeXt, DenseNet, SqueezeNet, R-CNN, Faster R-CNN,
SSD, YOLO, Seq2Seq, Transformers, and custom model architecture

Computationally intensive to train and deploy ...


• Needs high-performance CPUs and GPUs ...
...
• Needs fast access to GBs and TBs of data for training ...
• Training on hundreds of CPUs and GPUs requires infrastructure
management

Difficult to host and manage models in production


• Difficult to deliver high-performance and low-latency predictions
• Scaling to thousands and millions of users requires infrastructure
management
Deep learning frameworks
Building blocks for designing, training, and validating deep neural networks

• High-level programming • Low-level functions for research


APIs with Keras and Gluon and development
• Performance optimizations • Ability to run training at scale (but you
to take advantage of GPUs will have to manage infrastructure)
Deep learning needs more than just frameworks
SageMaker Studio (IDE)
ML services
Built-in SageMaker SageMaker Model SageMaker SageMaker Model SageMaker
algorithms notebooks experiments tuning Debugger Autopilot hosting Model Monitor

Frameworks

Compute
networking
storage
Deep learning on AWS
Amazon SageMaker + deep learning frameworks + infrastructure services =
record-setting performance at low cost

Low cost

27 minutes 62 minutes 40% lower


is the record-setting time to train is the record-setting time to train cost per inference for Inf1
Mask R-CNN with TensorFlow BERT with TensorFlow using 256 instances compared to G4
using 24 P3dn.24xlarge instances P3dn.24xlarge instances with instances – the lowest cost
with 192 total GPUs 2,048 GPUs per inference in the cloud
Amazon SageMaker framework optimizations
Full-stack optimizations: compute + networking + storage + frameworks

High-performance Cost-effective Every framework

Amazon EC2 p3dn,


Amazon EC2 G4 instances Amazon EC2 Inf1 instances
Deep learning framework containers

Amazon S3, Amazon FSx for Lustre Amazon Elastic Inference


AWS Neuron SDK
Getting the most out of deep learning frameworks for
training with Amazon SageMaker
Fully managed and optimized
Amazon SageMaker cluster

SageMaker SDK

Training scripts


Two ways to scale deep learning with
Amazon SageMaker
1 2
Bring your own training script Bring your own Docker container
(script mode) (BYOC)

Training scripts

Code files
1 Bring your own training script

AWS Deep
Learning
Code files Containers Amazon ECR

Container
registry
Amazon SageMaker SDK

Amazon S3
Fully managed
SageMaker cluster
2 Bring your own Docker container
Custom container

Docker build Amazon ECR


Code files
Container
registry

Amazon SageMaker SDK

Amazon S3
Fully managed
SageMaker cluster
Large training datasets: What are my options?
TensorFlow, PyTorch, and MXNet
Moderate and • File mode: Copy entire
Fully managed and optimized 1 large datasets
dataset to local volume
Amazon SageMaker cluster
• Pipe mode: Stream
Amazon S3 dataset from Amazon S3

Scalable shared
2 file system • No downloading or
streaming
• Share file system with
Amazon EFS other services

High-performance • Optimized for


3 file system high-performance
computing
… • Natively integrated
FSx for Lustre
file system with Amazon S3
How do I choose the right instance for training?
P3.2xlarge P3.8xlarge P3.16xlarge P3dn.24xlarge Highest performance
optimized for
GPUs 1 x V100 4 x V100 8 x V100 8 x V100 distributed training
• 32 GB memory
GPU • 100 Gbps bandwidth
16 GB / GPU 16 GB / GPU 16 GB / GPU 32 GB / GPU
memory • Record-setting
vCPUs 8 32 64 96 performance on Mask
R-CNN and BERT
Mem 61 244 488 768

Distributing training and large-scale experiments

Distributing training and multiple experiments

Local mode training and prototyping P3


Choosing the right instance for inference deployments
• What is your target latency SLA for your application?
• Real-time inference or batch predictions?
• Popular deep learning framework model or custom code?

CPU instances Elastic Inference GPU instances Custom chip


Network
attached
P3 G4
C5 M5
Inference
accelerator eia1.medium

Large models, high High throughput,


Small models, Mid-sized models, throughput, and best cost and
low throughput low-latency budget low-latency performance in
with tolerance limits access to CUDA the cloud

Start small and size up if you need more capacity


TensorFlow on AWS
https://aws.amazon.com/tensorflow/

AWS optimizations for TensorFlow available on


Amazon EC2 and Amazon SageMaker
• AWS Deep Learning Containers for training and inference
• AWS Deep Learning AMIs (DLAMI)

Amazon SageMaker benefits for TensorFlow Amazon Elastic Inference

• Built-in support for TensorBoard, Debugger, local mode,


hyperparameter tuning, Managed Spot Training, Pipe
mode, and Amazon Elastic Inference
• Distributed training – parameter server and Horovod
• Performance optimizations – GPUs, CPUs, and storage
TensorBoard
Demo: TensorFlow + Amazon SageMaker
• Develop and test using local mode
• Large-scale hyperparameter optimization
• Large-scale distributed training
• Model hosting
PyTorch on AWS
https://aws.amazon.com/pytorch/

AWS optimizations for PyTorch available on


Amazon EC2 and Amazon SageMaker TorchServe
• AWS Deep Learning Containers for training and inference
• AWS Deep Learning AMIs (DLAMI)
TorchElastic
PyTorch on Amazon SageMaker
• Debugger, local mode, hyperparameter tuning, Managed
Spot Training, Pipe mode, and Amazon Elastic Inference
• Serving framework – TorchServe
Amazon Elastic Inference
• Distributed training – TorchElastic
• Performance optimizations – GPUs, CPUs, and storage
TorchServe
An open-source model serving library for PyTorch, built and maintained by AWS
in collaboration with Facebook

aws.amazon.com/blogs/machine-learning/deploying-pytorch-models-for-inference-at-scale-using-torchserve/
Apache MXNet on AWS
https://aws.amazon.com/mxnet/
GluonCV
AWS-optimized Apache MXNet
GluonTS
• AWS Deep Learning Containers for training and inference
• AWS Deep Learning AMIs (DLAMI) GluonNLP

Apache MXNet on Amazon SageMaker


• Debugger, local mode, hyperparameter tuning, Managed
Spot Training, Pipe mode, Amazon Elastic Inference, and
distributed training
• C++, JavaScript, Python, R, Julia, Scala, Clojure, and Perl
• Performance optimizations – GPUs, CPUs, EFA, and storage
Amazon Elastic Inference
Gluon domain-specific tools and libraries
Computer vision Natural language Probabilistic time
processing series modeling
GluonCV
GluonNLP GluonTS
Deep factor
DeepAR
DeepState
Gaussian Processes Forecaster
Non-Parametric Time Series
Forecaster
Feedforward (MLP)
Transformer model
Wavenet
Seq-2-seq
Prophet
R Forecast
AutoGluon: Open-source AutoML
github.com/awslabs/autogluon

Tabular prediction Image classification

Text classification Object detection

https://aws.amazon.com/blogs/opensource/machine-
learning-with-autogluon-an-open-source-automl-library/
Recap: Challenges and solutions
Many model architectures
• TensorFlow, PyTorch, and MXNet offer pretrained models
• Gluon and Keras make it easy to develop custom networks
• Gluon libraries include over 200 pretrained models in CV and NLP

Computationally intensive to train and deploy ...


...
Amazon SageMaker let’s you leverage full-stack optimizations:
...
compute + networking + storage + frameworks for ...
state-of-the-art performance

Difficult to host and manage models in production


• Deploy high-performance, low-latency inference endpoints with SageMaker using
TensorFlow serving, TorchServe, and Multi Model Server
• Reduce cost with Amazon Elastic Inference and Inf1 instances
Resources: Amazon SageMaker

https://github.com/awslabs/a https://docs.aws.amazon.com/sage https://sagemaker.readthedocs.i


mazon-sagemaker-examples maker/latest/dg/whatis.html o/en/stable/overview.html
Resources: Deep learning frameworks
aws.amazon.com/tensorflow aws.amazon.com/pytorch aws.amazon.com/mxnet

Deep Learning Containers images


docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
Resources: Blog posts and videos

https://towardsdatascience.com/how-to- https://towardsdatascience.com/a-quick- https://www.youtube.com/watch?v=02Ft-


debug-machine-learning-models-to-catch- guide-to-distributed-training-with- rCssRs
issues-early-and-often-5663f2b4383b tensorflow-and-horovod-on-amazon-
sagemaker-dae18371ef6e
Learn machine learning with AWS Training and Certification
Resources created by the experts at AWS to help you build and validate machine learning skills

Explore ​tailored machine learning (ML​) paths for ​business decision


maker​s, data platform engineers, data scientists​, and developers​ ​

Learn at your convenience with 65+ free digital courses, or register


for a live instructor-led class featuring hands-on labs and
opportunities for practical application

Take the AWS Certified Machine Learning – Specialty exam


to validate expertise in building, training, tuning, and deploying
ML models

Visit the ML learning paths at https://aws.training/ML


Thank you!
Shashank Prasanna
@shshnkp

linkedin.com/in/shashankprasanna

medium.com/@shashankprasanna

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

You might also like