Tensorflow PDF
Tensorflow PDF
Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g
§ TensorFlow
§ Caffe/Caffe2
§ Torch
§ SparkNet
§ TensorFrame
§ DeepLearning4J
§ BigDL
§ CNTK
§ mmlspark
§ Many others…
§ Google TensorFlow
§ Microsoft CNTK
§ Facebook Caffe2 and PyTorch
Big Data
HPC (Hadoop, Spark,
(MPI, RDMA, HBase,
Lustre, etc.) Memcached,
etc.)
Deep Learning
(Caffe, TensorFlow,
BigDL, etc.)
§ Key Features:
• Widely used for Deep Learning
• Open source software library for numerical
computation using data flow graphs
• Graph edges represent the multidimensional data
arrays
• Nodes in the graph represent mathematical
operations
• Flexible architecture allows to deploy computation
to one or more CPUs or GPUs in a desktop, server,
or mobile device with a single API
• Used by Google, Airbnb, DropBox, Snapchat,
Twitter
• Communication and Computation intensive Architecture of TensorFlow
Source: https://www.tensorflow.org/
§ Key Features:
• Simple service definition
• Works across languages and platforms
• C++, Java, Python, Android Java etc
• Linux, Mac, Windows
• Start quickly and scale
• Bi-directional streaming and integrated
authentication
• Used by Google (several of Google’s cloud
products and Google externally facing APIs,
TensorFlow), NetFlix, Docker, Cisco, Juniper
Networks etc.
• Uses sockets for communication! Large-scale distributed systems composed
of micro services
Source: http://www.grpc.io/
Worker
/job:PS/task:0
CPU GPU
Client Master
gRPC server/ client
Worker
/job:Worker/task:0
CPU GPU
§ Can similar designs be done for gRPC and TensorFlow to achieve significant
performance benefits by taking advantage of native RDMA support?
§ How do we benchmark gRPC and TensorFlow for both deep learning and system
researchers?
§ What kind of performance benefits we can get through native RDMA-based designs in
gRPC and TensorFlow?
§ Rendezvous protocol
• TensorFlow worker (tensor receiving process) actively
requests for tensors to the parameter server (tensor
sending process)
§ Worker issues Tensor RPC request that
to Parameter Server (PS)
§ PS finds the requested tensor, and
responds to the worker
§ gRPC core uses recvmsg and sendmsg
primitives for receiving and sending
payloads
§ Tensor Transmission uses iovec
structures
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
§ gRPC + Verbs
• Dedicated verbs channel for tensor communication
• gRPC channel for administrative task communication
§ gRPC + MPI
• Dedicated MPI channel for tensor communication
• gRPC channel for administrative task communication
§ Uber Horovod
• Uber’s approach of MPI based distributed TensorFlow
§ Baidu Tensorflow-Allreduce
• Baidu’s approach of MPI based distributed TensorFlow
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate iovec Buffer Distribution Observed for
gRPC for TensorFlow: Early Experiences, BPOE, 2018.
TensorFlow training over gRPC
RDMA-Polling
RDMA Comm Engine AR-gRPC Core Adaptive RDM Receive Tensor
Library Serialize A
B2 B1 B0 B2 B1 B0 B0
B2 B1
Global Buffer
gRPC Server RDMA-Endpoint-Write AR-gRPC Core
pool
Deserialize
RDMA Write / Read
Tensor
(gRPC Byte Buffer)
RDMA / InfiniBand / RoCE RDMA-Endpoint-Read
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
22 OpenFabrics Alliance Workshop 2019
OUTLINE
90 1000 18600
Default gRPC (IPoIB-56Gbps)
Default gRPC (IPoIB-56Gbps) Default gRPC (IPoIB-56Gbps)
75
AR-gRPC (RDMA-56Gbps) 800 14900
Latency (us)
60 AR-gRPC (RDMA-56 Gbps)
Latency (us)
Latency (us)
0 0 100
2
4
8
16
32
64
1K
2K
4K
8K
128
256
512
Latency (ms)
4
2
3
1 2
1
0 0
Uniform Random Skew Uniform Random Skew
Payload Generation Scheme Payload Generation Scheme
OSU-RI2-IB-EDR SDSC-Comet-IB-FDR
• OSU-RI2-IB-EDR: AR-gRPC (RDMA) reduces latency by 59% and 56% compared to Default gRPC over 40G Ethernet and IPoIB
• SDSC-Comet-IB-FDR: AR-gRPC (RDMA) reduces 78% latency compared to 10G (Default gRPC) Ethernet and 69% compared to
IPoIB (Default gRPC)
3000
RPC/ second
2000
2500
1500 2000
1000 1500
1000
500 500
0 0
Uniform Random Skew Uniform Random Skew
OSU-RI2-IB-EDR SDSC-Comet-IB-FDR
• OSU-RI2-IB-EDR: AR-gRPC (RDMA) gRPC achieves a 3.4x speedup compared to Default gRPC over IPoIB for uniform scheme
• SDSC-Comet-IB-FDR: AR-gRPC (RDMA) achieves 3.6x bandwidth compared to Default gRPC over IPoIB for uniform scheme
Calls/Second
AR-gRPC (RDMA-56Gbps)
Latency (ms)
18 300
12 200
6 100
0 0
2M 4M 8M 2M 4M 8M
payload (Bytes) payload (Bytes)
Images / Second
300 gRPC AR-gRPC 600 gRPC AR-gRPC
200 400
100 200
0 0
8 16 32 8 16 32 8 16 32 8 16 32
GoogleNet AlexNet GoogleNet AlexNet
Batch Size / GPU Batch Size / GPU
8 Nodes 12 Nodes
GoogleNet & AlexNet Evaluation on OSU-RI2-IB-EDR (Higher Better); TotalBatchSize = (BatchSize/GPU)×NUMofGPUs
• GoogleNet has only 5 Million parameters, whereas AlexNet has about 60 Million parameters
• AR-gRPC scales better as we go from 4 nodes to 8 nodes
• For large batch size (32/GPU, total 224) the GoogleNet improvement is about 15% (597 vs 517)
• GoogleNet results in less network intensive gradient updates
• However, AR-gRPC shows 89% (124 vs 65) performance improvement for Alexnet compared to default gRPC
28 OpenFabrics Alliance Workshop 2019
EVALUATION OF TENSORFLOW: INCEPTION-V4
50 50 150
gRPC gRPC + Verbs gRPC gRPC + Verbs gRPC gRPC + Verbs
40 40 120
Images / Second
Images / Second
gRPC + MPI AR-gRPC gRPC + MPI AR-gRPC gRPC + MPI AR-gRPC
Images / Second
30 30 90
20 20 60
10 10 30
0 0 0
8 16 32 8 16 32 8 16 32
Batch Size/GPU Batch Size/GPU Batch Size/GPU
80 120
Images / Second
gRPC + MPI AR-gRPC gRPC + MPI AR-gRPC
Images / Second
30 60 90
20 40 60
10 20 30
0 0 0
8 16 32 8 16 32 8 16 32
Batch Size / GPU Batch Size / GPU Batch Size / GPU
4 Nodes 8 Nodes 12 Nodes
Resnet152 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU)×NUMofGPUs
• AR-gRPC accelerates TensorFlow by 62% (batch size 8/GPU) more compared to default gRPC on 4 nodes
• AR-gRPC improves Resnet152 performance by 32% (batch size 32/GPU) to 147% on 8 nodes
• AR-gRPC incurs a maximum speedup of 3x (55 vs 18 images) compared to default gRPC 12 nodes
• Even for higher batch size of 32/GPU (total 352) AR-gRPC improves TensorFlow performance by 82% 12 nodes
• AR-gRPC processes a maximum of 40%, 35%, and 30% more images, on 4, 8, and 12 nodes, respectively, than Verbs
• AR-gRPC achieves a maximum speedup of 1.61x, 3.3x and 4.5x compared to MPI channel on 4, 8, and 12 nodes, respectively
30 OpenFabrics Alliance Workshop 2019
AR-GRPC SPEEDUP COMPARED TO DEFAULT GRPC
3
Speedup
0
AlexNet GoogleNet VGG16 Resnet50 Resnet152 Inception4
CNNs
• http://hidl.cse.ohio-state.edu
THANK YOU
Xiaoyi Lu, Dhabaleswar K. (DK) Panda
The Ohio State University