Benchmarking_Contemporary_Deep_Learning_Hardware_and_Frameworks_A_Survey_of_Qualitative_Metrics
Benchmarking_Contemporary_Deep_Learning_Hardware_and_Frameworks_A_Survey_of_Qualitative_Metrics
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.
Fig. 2 Milestones of Deep learning on the Gartner hyper cycle. We inserted some
deep learning historical milestones, modifying the figure of Gartner [1].
A. GPU Devices
II. DEEP LEARNING HARDWARE
GPUs are specified unitary processors that are
AI algorithms often benefit from many-core hardware dedicated to accelerating real time three-dimensional (3D)
and high bandwidth memory, in comparison to many non- graphics. GPUs contain an internal cache, high speed
AI algorithms that are often encountered. Thus bandwidth, and quick parallel performance. The GPU
computational power is not just a one-dimensional concept. cache accelerates matrix multiplication routines because
The type of computations the hardware design is best suited
these routines do not need to access global memory.
for must be considered, since a hardware platform can have
GPUs are universal hardware devices for deep
more or less computational power depending on the type of
computation on which it is measured. GPUs (graphics learning. After testing neural networks including ones with
processing units) do well on the kind of parallelism often 200 hidden layers on MNIST handwritten data sets, GPU
beneficial to AI algorithms, in comparison to CPUs (central performance was found to be better than CPUs [11]. The
processing units), and thus tend to be well suited to AI test results show NVIDIA GeForce 6800 Ultra has a 3.3X
applications. FPGAs (field programmable gate arrays), speedup compared to the Intel 3GHz P4; ATI Radeon X800
being configurable, can be configured to perform well on AI has 2.4-3.4X speedup. NVIDIA GPUs increase FLOPS
algorithms although currently they lack the rich software (floating point operations per second) performance. In
layer needed to fully achieve their potential in the AI [12], a single NVIDIA GeForce 8800 GTX, released in
domain. ASICs (application specific integrated circuits) are November 2006, had 575 CUDA cores with 345.6
similar to FPGAs in this regard, since in principle a specially gigaflops, and its memory bandwidth was 86.4 GB/s; by
configured FPGA is a kind of ASIC. Thus GPUs, FPGAs September 2018, a NVIDIA GeForce RTX 2080 Ti [13]
and ASICs have the potential to expedite machine learning had 4,352 CUDA cores with 13.4 Teraflops, and its
algorithms in part because of their capabilities for parallel memory bandwidth had increased to 616 GB/s.
computing and high-speed internal memory.
B. FPGA Devices
Nevertheless, while earlier generation CPUs have had
performance bottlenecks while training or using deep FPGAs have dynamical hardware configurations, so
learning algorithms, cutting edge CPUs can provide better hardware engineers developed FPGAs using hardware
performance and thus better support for deep learning description language (HDL), including VHDL or Verilog
algorithms. In 2017, Intel released Intel Xeon Scalable [14][15]. However, some use cases will always involve
processors, which includes Intel Advance Vector Extension energy-sensitive scenarios. FPGA devices offer better
512 (Intel AVX-512) instruction set and Intel Math Kernel performance per watt than GPUs. According to[16], while
Library for Deep Neural Networks (Intel MKL-DNN) [10]. comparing gigaflops per watt, FPGA devices often have a
The Intel AVX-512 and MKL-DNN accelerate deep 3-4X speed-up compared to GPUs. After comparing
learning algorithms on lower precision tasks. Comparing performances of FPGAs and GPUs [17] on ImageNet 1K
mainstream 32-bit floating point precision (fp32) on GPUs, data sets, Ovtcharov et al. [18] confirmed that the Arria 10
the 16-bit and 8-bits floating-point precisions (fp16/fp8) are GX1150 FPGA devices handled about 233 images/sec.
lower in precision, but can be sufficient for the inference of while device power is 25 watts. In comparison, NVIDIA
deep learning application domain. In addition, Lower K40 GPUs handled 500-824 images/sec. with device power
precision also can enhance usage of cache and memory, and of 235 watts. Briefly, [17] demonstrates FPGAs can process
can maximize memory bandwidth. Let us look specifically 9.3 images/joule, but these GPUs can only process 2.1-3.4
at GPUs, FPGAs, and ASICs next. images/joule.
149
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.
C. ASIC Devices architectures. These devices contain low-energy on-chip
Usually, ASIC devices have high throughout and low memory, so that reusable dataflow algorithms provide
energy consumption because ASICs are fabricated chips solutions for reducing data movements. Weight stationary
designed for special applications instead of generic tasks. dataflow, output stationary dataflow, no local reuse
While testing AlexNet, one of the convolutional neural dataflow, and row stationary dataflow were developed for
networks, the Eyeriss consumed 278 mW [18]. decreasing energy consumption of FPGAs and ASICs [21].
Furthermore, the Eyeriss achieved 125.9 images/joule (with In addition, co-design of deep learning algorithms and
batch size N=4) [19]. In [12], Google researchers confirm hardware devices are other approaches. According to [21],
that the TPU 1.0, based on ASIC technologies, has about there are two solutions. 1) Decrease precision: There are
15-30X speed-up compared to GPUs or CPUs during the several algorithms to decrease precision of operations and
same period, with TOPS/watt of about 30-80X better. operands of DNN, such as 8-bit fixed point, binary weight
sharing, and log domain quantization. 2) Reduce number of
D. Enhance Hardware Performance operations and model size: Some algorithms need to be
Even though multiple cores, CPUs, and hyper-threading highlighted, such as exploiting activation statistics, network
are mainstream technologies, these technologies still show pruning algorithms, and knowledge distillation algorithms.
weaknesses in the big data era. For example, deep learning E. Qualitative Benchmarking Metrics on Machine
models usually have products and matrix transpositions Learning Hardware
[11], so that these algorithms require intensive computing
resources. GPUs, FPGAs, and ASICs have better computing GPUs, FPGAs, and ASICs can be used in different
performance with lower latency than conventional CPUs domains besides deep learning, including cloud servers and
because these specialized chipsets consist of many cores and edge devices. There are 11 qualitative benchmarking
on-chip memory. The memory hierarchy on these hardware metrics we distinguish on machine learning devices, as
devices is usually separated into two layers: 1) off-chip follows. In addition, results of the benchmarks are shown in
memory, named global memory or main memory; and 2) Table I.
on-chip memory, termed local memory or shared memory.
After copying data from global memory, deep learning TABLE I. QUALITATIVE BENCHMARKING HARDWARE
algorithms can use high-speed shared memory to expedite FOR MACHINE LEARNING ([10]-[20])
computing performance. Specific program libraries provide # Attributes ASICs FPGAs GPUs
dedicated application programming interfaces (APIs) for 1 Computing Performance High Low Moderate
hardware devices, abstract complex parallel programming, 2 Low Latency High Moderate Low
and increased executive performance. For instance, the 3 Energy efficiency High Moderate Good
CuDNN library, released by NVIDIA, can improve 4 Compatibility Low Moderate High
performance of the Apache MXNet and the Caffe on 5 Research Costs High Moderate Low
6 Research Risks High Low Moderate
NVIDIA GPUs [20][17]. 7 Upgradability Low Moderate High
Traditionally, multiple cores, improved I/O bandwidth, 8 Scalability High Low Moderate
and increased core clock speed can improve hardware 9 Chip Price Low High Moderate
10 Ubicomp Low High High
speeds [21]. In Figure 3, Arithmetic Logic Unit (ALU), 11 Time-to-Market Low High High
single instruction, multiple data (SIMD), and single
instruction, multiple thread (SIMT) systems concurrently
1) Computing Performance can be measured by
execute multiply-accumulate (MACs) tasks based on shared
FLOPS. For measuring ASICs and GPUs, a
memory and configuration files.
quadrillion (thousand trillion) FLOPS (petaflops)
However, there are new algorithms to improve are used in testing modern chipsets. In May 2017,
computing performance. GPUs are low-latency temporary Google announced Tensor Processor Unit 2.0
storage architectures, so the Toeplitz matrix, fast Fourier (TPU 2.0), which provides 11.5 petaflops per chip
transform (FFT), and Winograd and Strassen algorithms can [22]. TPU 3.0, released in May 2018, offers 23.0
be used for improving GPU performance [21]. Data petaflops [23]. However, NVIDIA GeForce RTX
movement consumes energy. FPGAs and ASICs are spatial 2080 Ti has only 13.4 Teraflops [13]. According to
[24] and [25], ASICs have the most FLOPs, and
GPUs are better than FPGAs.
2) Low latency describes an important chipset
capability [26], and is distinguished from
throughout [12]. In [12][24], ASICs have the lowest
latency, while FPGAs are lower than GPUs.
3) Energy efficiency in computing is particularly
important for edge nodes because mobile devices
generally have limited power. In [12][24] ASICs
have the highest energy efficiency, and FPGAs and
GPUs come in second and third, respectively.
4) Compatibility means devices can be supported by
multiple deep learning frameworks and popular
programming languages. FPGAs needs specially
developing libraries, so that FPGAs are not that
good with respect to compatibility. GPUs have the
Fig. 3. Parallel Chipsets and memory diagrams (after [21]) best compatibilities [24]. ASICs currently are
150
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.
second. For example, TPUs support TensorFlow, and [27], FPGAs and GPUs have lower
cafe, etc. development time than ASICs.
5) Research costs refer to the total costs for
developing devices incurred from designing III. MAINSTREAM DEEP LEARNING FRAMEWORKS
architectures, developing algorithms, and Open source deep learning frameworks allow engineers
deploying chip sets on hardware devices. GPUs are and scientists to define activation functions, develop special
affordable devices [24]. ASICs are expensive, and algorithms, train big data, and deploy neural networks on
FPGAs are between GPUs and ASICs. different hardware platforms, from x86 servers to mobile
6) Research risks are determined by hardware devices.
architectures, development risks, and deployed
chip sets. ASICs have the highest risks before Based on the wide variety of usages, support teams, and
market scaling. FPGAs are very flexible, so that development interfaces, we split 18 frameworks into three
their risks are limited. GPUs are in the middle. sets including mature frameworks, developing frameworks,
7) Upgradability is a challenge for most hardware and inactive frameworks. The 10 mature frameworks can be
devices. In [24], GPUs are the most flexible after used currently to enhance training speed, improve scalable
deployment, and are better than FPGAs. ASICs are performance, and reduce development risks. The
the most difficult to update after delivery. developing frameworks are not yet broadly used in
8) Scalability means hardware devices can scale up industries or research projects, but some developing
quickly with low costs. Scalability is vital for clouds frameworks could be used in specific fields. Retired
and data centres. ASICs have excellent scalability. frameworks are largely inactive.
GPUs have good scalability, but not as good as A. Mature Frameworks
ASICs. FPGAs are the lowest on this dimension.
9) Chip Price means price of each unit chip after 1) Caffe and Facebook Caffe2: Caffe [28] was
industrial-scale production. In [27], FPGAs have developed at the University of California, Berkeley
the highest chip cost after production scale-up. in C++. According to [29], Caffe can be used on
ASICs have the lowest cost, and GPUs are in the FPGA platforms. Caffe 2 [30] is an updated
middle. framework supported by Facebook.
10) Ubicomp (also named ubiquitous computing) 2) Chainer Framework: Chainer [31], written in
indicates hardware devices used extensively for Python, can be extended to multiple nodes and
varied use cases including e.g. large scale clouds GPU platformws through the CuPy and
and low energy mobile devices. FPGAs are very MPI4Python libraries [32][33].
flexible, so that the devices can be used in different
industries and scientific fields. ASICs usually are 3) DyNet Framework: DyNet [34] was written in
dedicated to specific industry needs. GPUs like C++. The framework can readily define dynamic
FPGAs can be developed for many research fields computation graphs, so DyNet can help improve
and industry domains. development speed. Currently, DyNet only
11) Time-to-market means the length of time from supports single nodes and not multiple node
design to sale of products. According to [15],[24], platforms.
Fig. 4. Popular Deep learning Frameworks. From right column to left one is hardware, frameworks, core codes, license types, and API codes
151
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.
4) MXNet: the Apache MXNet [35][36] is a well 1) License Type: Open source software licenses
known deep learning framework. This framework impose a variety of restrictions. In [64], degree of
was built in C++, and MXNet supports NVIDIA openness is used as a metric for ranking open
GPUs through the NVIDIA CuDNN library. In source licenses. Apache license 2.0 has relatively
[37], the GLUNO is a development interface for few restrictions. The MIT license requires the most
MXNet. limitations. BSD is in the middle. So, in comparing
degree of openness, Apache 2.0 > BSD > MIT.
5) Microsoft CNTK: The Microsoft Cognitive Toolkit
(Microsoft CNTK) [38][39], funded by Microsoft 2) Interface Codes (also called the API): The more
and written in C++, supports distributed functionality the API offers, the better it tends to
platforms. support development. A good API can increase
development productivity, reduce development
6) Google TensorFlow: In 2011, Google released cost and enhance functionality of the framework.
DistBelief [40], but the framework was not an open
source project. In 2016, the project was merged 3) Compatible Hardware: Computing hardware
with TensorFlow [41][42], an open source deep devices including CPUs and GPUs constitute the
learning framework. underlying support for deep learning frameworks.
The more different hardware devices a deep
7) Keras [43][44] is a Python library for TensorFlow, learning framework can run on, the better it is on
Theano, and Microsoft CNTK. Keras has a this dimension.
reasonable development interface that can help
developers to quickly develop demo systems and 4) Reliability: No single point of failure (NSPOF) is a
reduce development costs and risks. risk minimizing design strategy. This approach
ensures that one fault in a framework will not break
8) Neon and PlaidML are partially supported by an entire system. For avoiding single points of
Intel: Neon [45], supported by Nervana Systems failure, a mature framework might run on multi-
and Intel, may improve performance for deep server platforms rather than a single node.
learning on diverse platforms. PLaidML[46] was
released by Vertex.AI in 2017; Intel will soon fund 5) Tested Deep Learning Networks: Evaluating
PlaidML. software could discover potential problems,
measure performance metrics, and highlight
9) PyTorch Framework: PyTorch [47][48], written in strengths and weaknesses. If a framework can be
Python, can be integrated with Jupyter Notebook. officially verified by a variety of deep learning
FastAI [49] is another development interface for networks, then the framework is correspondingly
PyTorch. more suitable as a mainstream production
10) Theano Framework: The core language of Theano framework.
[50][51] is Python with a BSD license. Lasagne 6) Tested Datasets: Image datasets, voice datasets,
[52][53] is an additional development library for and text datasets are among those used for training
Theano. and testing deep learning networks. If a framework
B. Developing Frameworks was verifed by diverse datasets, we are able to
In addition, some deep learning frameworks are less know its performance, strengths, and weaknesses.
frequently mentioned by academic papers because of their Consistent with these six metrics, there are 16
limited functions. For example, mainstream deep learning frameworks as shown in Figure 4
1. Apache SINGA [54] was developed in C++. The and Table II (shown after the references).
framework is supported by the Apache group [44] IV. A MACHINE LEARNING BENCHMARK ORGANIZATION
[45].
MLPerf is a machine learning benchmark organization
2. BigDL [46][47], built with Scale codes, is a deep that offers useful benchmarks that evaluate training and
learning framework that can run on Apache Spark inference on deep learning hardware devices. MLPerf and
and Apache Hadoops. its members are associated with advanced chip hardware
3. In [59], the authors mentioned DeepLearning4J companies and leading research universities. Hardware
(DL4J), which can be accelerated by cuDNN. companies include Google, Nvidia, and Intel. Research
universities include Stanford University, Harvard
4. The PaddlePaddle deep learning framework was
University, and University of Texas at Austin.
developed by Baidu using Python [60].
MLPerf members share their benchmarking results.
C. Inactive Frameworks Benchmark results, source codes, deep learning algorithms
We mention two of these. (1) Torch [61], was written in (also called deep learning models), and configuration files
Lua. It is inactive. (2) Purine [53][54] is open source and not are submitted to a website on github.com. Currently MLPerf
updated since 2014. members already have submitted the MLPerf Training
Results v0.5 and MLPerf Training Results v0.6, and the
D. Qualitative Benchmarking Metrics for Deep deep learning reference results v0.5 will be released soon.
Learning Frameworks MLPerf benchmarks involve benchmark metrics,
Benchmarking metrics for frameworks for deep learning datasets, deep learning algorithms, and deep learning
include six qualitative metrics described next. frameworks. MLPerf members execute deep learning
algorithms on hardware devices, then record execution time,
152
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.
deep learning algorithms, deep learning frameworks, and benchmarking metrics, dataset evaluation, test codes, and
tested open datasets. Time is a critical metric for measuring result sharing.
MLPerf training or inference benchmarks [65]. Short run
time is associated with high performance of deep learning VI. FUTURE WORK
devices. Benchmark datasets consist of image datasets, Deep learning technology including supporting
translation datasets, and recommendation datasets. hardware devices and software frameworks is increasing in
ImageNet and COCO [66] are among the image datasets. importance, so scientists and engineers are developing new
WMT English-German [67] and MovieLens-20M [68] are hardware and creative frameworks. We are planning a
translation and recommendation datasets, respectively. website named Benchmarking Performance Suite
MLPerf benchmark frameworks are TensorFlow, PyTorch, (http://www.animpala.com/research.html) for collecting
MXNet, Intel Caffe, and Sinian. MLPerf deep learning and updating results of benchmarking hardware and
algorithms benchmarked [69] include ResNet50-v1.5, frameworks. Users will be able to access the website for
MobileNet-v1, SSD-MobileNet, and SSD-ResNet34. sharing deep learning knowledge.
V. CONCLUSIONS ACKNOWLEDGMENT
Deep learning has increased in popularity dramatically We are grateful to Google for partial support of this
in recent years. This technology can be used in image project in 2019.
classification, speech recognition, and language translation.
In addition, deep learning technology is continually REFERENCES
developing. Many innovative chipsets, useful frameworks, [1] J. Hare and P. Krensky, “Hype Cycle for Data Science and
Machine Learning, 2018,” Gartner Company, 2018. [Online].
creative models, and big data sets are emerging, resulting in Available: https://www.gartner.com/doc/3883664/hype-cycle-
extending the markets and uses for deep learning. data-science-machine.
While deep learning technology is expanding, it is [2] W. Dai, K. Yoshigoe, and W. Parsley, “Improving data quality
useful to understand the dimensions and methods for through deep learning and statistical models,” in Advances in
measuring deep learning hardware and software. Intelligent Systems and Computing, 2018.
Benchmarking principles include representativeness, [3] W. Dai and N. Wu, “Profiling essential professional skills of chief
data officers through topic modeling algorithms,” in AMCIS 2017
relevance, equity, repeatability, affordable cost, scalability, - America’s Conference on Information Systems: A Tradition of
and transparency. Major deep learning hardware platform Innovation, 2017, vol. 2017-Augus.
types include CPUs, GPUs, FPGAs, and ASICs. We [4] R. Keefer and N. Bourbakis, “A Survey on Document Image
discussed machine learning platforms, and mentioned Processing Methods Useful for Assistive Technology for the
approaches that enhance performance of these platforms. In Blind,” Int. J. Image Graph., 2015.
addition, we listed 11 qualitative benchmarking features for [5] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny
images: A large data set for nonparametric object and scene
comparing deep learning hardware. recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2008.
AI algorithms often benefit from many-core hardware [6] LeCun Yann, Cortes Corinna, and Burges Christopher, “THE
and high bandwidth memory, in comparison to many non- MNIST DATABASE of handwritten digits,” Courant Inst. Math.
AI algorithms that are often encountered in practice [70]. Sci., 1998.
Thus it is not just the computational power of hardware as [7] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei,
a one-dimensional concept that makes it more (or less) “ImageNet: A large-scale hierarchical image database,” in 2009
IEEE Conference on Computer Vision and Pattern Recognition,
suited to AI applications, but also the type of computations 2009.
the hardware excels in. A hardware platform can have more [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
or less computational power depending on the type of Zisserman, “The pascal visual object classes (VOC) challenge,”
computation on which it is measured. GPUs (graphics Int. J. Comput. Vis., 2010.
processing units) often do comparatively well on the kind [9] O. E. D. Online, “Oxford English Dictionary Online,” Oxford
English Dict., 2010.
of parallelism often beneficial to AI algorithms, and thus
[10] A. Rodriguez, E. Segal, E. Meiri, E. Fomenko, Y. J. Kim, and H.
tend to be well suited to AI applications. FPGAs, being Shen, “Lower Numerical Precision Deep Learning Inference and
configurable, can be configured to perform well on AI Training,” Intel White Pap., 2018.
algorithms although currently they lack the rich software [11] D. Steinkraus, I. Buck, and P. Y. Simard, “Using GPUs for
layer needed to be as useful for this as they could become. machine learning algorithms,” in Eighth International Conference
ASICs are similar to FPGAs in this regard, since in on Document Analysis and Recognition (ICDAR’05), 2005.
principle a specially configured FPGA is a kind of ASIC. [12] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J.
C. Phillips, “GPU computing,” Proc. IEEE, vol. 96, no. 5, pp. 879–
Software frameworks for deep learning are diverse. 899, 2008.
We split 18 deep learning frameworks into three categories: [13] “Graphics Reinvented: NVIDIA GeForce RTX 2080 Ti Graphics
mature deep learning frameworks, developing frameworks, Card,” NVIDIA. NVIDIA COMPANY.
and retired frameworks. Mature deep learning frameworks [14] D. Galloway, “The Transmogrifier C hardware description
and developing frameworks impact academia and industry. language and compiler for FPGAs,” Proc. IEEE Symp. FPGAs
So, we carefully compared the16-framework through Cust. Comput. Mach., 1995.
license types, compliant hardware devices, and tested deep [15] G. Lacey, G. W. Taylor, and S. Areibi, “Deep Learning on FPGAs:
Past, Present, and Future,” arXiv Prepr. arXiv1602.04283, 2016.
learning algorithms.
[16] E. Nurvitadhi et al., “Can FPGAs beat GPUs in accelerating next-
Deep learning benchmarks can help link industry and generation deep neural networks?,” in FPGA 2017 - Proceedings
academia. MLPerf is a new and preeminent deep learning of the 2017 ACM/SIGDA International Symposium on Field-
benchmark organization. The organization offers Programmable Gate Arrays, 2017.
[17] K. Ovtcharov, O. Ruwase, J. Kim, J. Fowers, K. Strauss, and E. S.
153
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.
Chung, “Accelerating Deep Convolutional Neural Networks Using [42] M. Abadi et al., “TensorFlow : A System for Large-Scale Machine
Specialized Hardware,” Microsoft Res. Whitepaper, 2015. Learning,” Proc 12th USENIX Conf. Oper. Syst. Des. Implement.,
[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet 2016.
Classification with Deep Convolutional Neural Networks,” Adv. [43] C. François, “Keras,” https://github.com/fchollet/keras, 2015.
Neural Inf. Process. Syst., 2012. [Online]. Available: https://keras.io/.
[19] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An [44] Chollet François, “Keras: The Python Deep Learning library,”
Energy-Efficient Reconfigurable Accelerator for Deep keras.io. 2015.
Convolutional Neural Networks,” IEEE J. Solid-State Circuits, [45] “Neon, GitHub Repository,” 2018. [Online]. Available:
2017. https://github.com/NervanaSystems/neon.
[20] Mxn. Developers, “Apache MXNet(incubating) - A Flexible and [46] C. Ng, “Announcing PlaidML: Open Source Deep Learning for
Efficient Library for Deep Learning,” Apache, 2018. [Online]. Every Platform,” 2017.
Available: https://mxnet.apache.org/.
[47] “PyTorch: An open source deep learning platform,” 2018.
[21] V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer, “Efficient [Online]. Available: https://pytorch.org/.
Processing of Deep Neural Networks: A Tutorial and Survey,” [48] A. Paszke et al., “Automatic differentiation in PyTorch,” 31st
Proceedings of the IEEE. 2017. Conf. Neural Inf. Process. Syst., 2017.
[22] “Google reveals more details about its second-gen TPU AI chips,” [49] “FastAI, GitHub Repository,” 2018. [Online]. Available:
techcrunch. [Online]. Available: https://www.fast.ai/.
https://www.theinquirer.net/inquirer/news/3023202/google-
reveals-more-details-about-its-second-gen-tpu-ai-chips. [50] “Theano,” 2018. [Online]. Available:
http://deeplearning.net/software/theano/.
[23] “Google announces a new generation for its TPU machine learning
hardware,” techcrunch. [Online]. Available: [51] R. Al-Rfou et al., “Theano: A Python framework for fast
https://techcrunch.com/2018/05/08/google-announces-a-new- computation of mathematical expressions,” arXiv Prepr., 2016.
generation-for-its-tpu-machine-learning-hardware/. [52] “Lasagne, GitHub Repository,” 2018. [Online]. Available:
[24] BERTEN DSP, “GPU vs FPGA Performance Comparison,” 2016. https://github.com/Lasagne/Lasagne.
[Online]. Available: [53] B. Van Merriënboer et al., “Blocks and fuel: Frameworks for deep
http://www.bertendsp.com/pdf/whitepaper/BWP001_GPU_vs_FP learning,” arXiv Prepr. arXiv1506.00619, 2015.
GA_Performance_Comparison_v1.0.pdf. [54] B. C. Ooi et al., “SINGA: A Distributed Deep Learning Platform,”
[25] M. Parker, “Understanding Peak Floating-Point Performance Proc. 23rd ACM Int. Conf. Multimed. - MM ’15, 2015.
Claims,” Intel FPGA White Paper, 2016. [55] W. Wang et al., “SINGA : Putting Deep Learning in the Hands of
[26] D. A. Patterson, “LATENCY LAGS BANDWITH.,” Commun. Multimedia Users,” Multimedia, 2015.
ACM, 2004. [56] A. G. T. N. Daniel Dai Ted Dunning, “Apache SINGA,” Apache,
[27] E. Vansteenkiste, “New FPGA Design Tools and Architectures,” 2018. [Online]. Available: https://singa.incubator.apache.org/.
Ghent University. Faculty of Engineering and Architecture, 2016. [57] “BigDL,” Apache, 2018. [Online]. Available:
[28] Y. Jia et al., “Caffe: Convolutional architecture for fast feature https://github.com/intel-analytics/BigDL.
embedding,” in Proceedings of the 22nd ACM international [58] Yiheng Wang et al., “BigDL: A Distributed Deep Learning
conference on Multimedia, 2014, pp. 675–678. Framework for Big Data,” arXiv Prepr. arXiv1804.05839, 2018.
[29] J. Xu, Z. Liu, J. Jiang, Y. Dou, and S. Li, “CaFPGA: An automatic [59] “Deeplearning4j: Open-source distributed deep learning for the
generation model for CNN accelerator,” Microprocess. Microsyst., jvm,” Apache Softw. Found. Licens., 2018.
2018. [60] B. Company, “PaddlePaddle-based AI.” [Online]. Available:
[30] “Caffe2, GitHub Repository,” 2018. [Online]. Available: http://en.paddlepaddle.org/.
https://caffe2.ai/. [61] “Torch,GitHub repository,” 2018. [Online]. Available:
[31] C. Developers, “Chainer Repository,” GitHub repository, 2018. https://github.com/torch/torch7.
[Online]. Available: https://github.com/chainer. [62] “Purine, GitHub Repository,” 2018. [Online]. Available:
[32] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a Next- https://github.com/purine/purine2.
Generation Open Source Framework for Deep Learning,” in [63] M. Lin, S. Li, X. Luo, and S. Yan, “Purine: A bi-graph based deep
Proceedings of Workshop on Machine Learning Systems learning framework,” arXiv Prepr. arXiv1412.6249, 2014.
(LearningSys) in The Twenty-ninth Annual Conference on Neural
[64] Y. H. Lin, T. M. Ko, T. R. Chuang, and K. J. Lin, “Open source
Information Processing Systems (NIPS), 2015.
licenses and the creative commons framework: License selection
[33] T. Akiba, K. Fukuda, and S. Suzuki, “ChainerMN: Scalable and comparison,” J. Inf. Sci. Eng., 2006.
Distributed Deep Learning Framework,” in Proceedings of
[65] C. Coleman et al., “Analysis of DAWNBench, a Time-to-Accuracy
Workshop on ML Systems in The Thirty-first Annual Conference
Machine Learning Performance Benchmark,” arXiv Prepr.
on Neural Information Processing Systems (NIPS), 2017.
arXiv1806.01427, 2018.
[34] G. Neubig et al., “Dynet: The dynamic neural network toolkit,”
[66] T. Y. Lin et al., “Microsoft COCO: Common objects in context,”
arXiv Prepr. arXiv1701.03980, 2017.
in Lecture Notes in Computer Science (including subseries Lecture
[35] T. Chen et al., “MXNet: A Flexible and Efficient Machine Notes in Artificial Intelligence and Lecture Notes in
Learning Library for Heterogeneous Distributed Systems,” arXiv Bioinformatics), 2014.
Prepr. arXiv1512.01274, 2015.
[67] W.-2016 and 2017, “Third Conference on Machine Translation.”
[36] Mxn. J. Developers, “MXNetJS Deep Learning in Browser,” [Online]. Available: http://www.statmt.org/wmt18/WMT-
GitHub repository, 2018. [Online]. Available: 2018.pdf.
https://github.com/dmlc/mxnet.js/.
[68] “GroupLens: Movielens-20m data sets.” [Online]. Available:
[37] “GLUON,” 2018. [Online]. Available: https://grouplens.org/datasets/%0Amovielens/20m/.
https://gluon.mxnet.io/index.html.
[69] P. Mattson, “MLPerf Training Algorithms.” [Online]. Available:
[38] “Microsoft2018CNTK,” 2018. [Online]. Available: https://github.com/mlperf/training.
https://www.microsoft.com/en-us/cognitive-toolkit/.
[70] G. Anadiotis, “AI chips for big data and machine learning: GPUs,
[39] F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep- FPGAs, and hard choices in the cloud and on-premise,” ZDNet,
learning toolkit,” in 22nd ACM International Conference on 2019. [Online]. Available: https://www.zdnet.com/article/ai-chips-
Knowledge Discovery and Data Mining (KDD), 2016. for-big-data-and-machine-learning-gpus-fpgas-and-hard-choices-
[40] J. Dean et al., “Large Scale Distributed Deep Networks,” Adv. in-the-cloud-and-on-premise/.
Neural Inf. Process. Syst., 2012.
[41] “Tensorflow: An open source library,” 2018. [Online]. Available:
https://www.tensorflow.org/.
154
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.
TABLE II. COMPARING POPULAR DEEP LEARNING FRAMEWORKS
# Frameworks a License Type b Core Codes API Codes Hardware Devices Reliability Tested Networks Related Datasets
1 BigDL Apache 2.0 C/C++ Scala CPU/GPU Multi-Server VGG,Inception,ResNet,GoogleNet ImageNet, CIFAR-10
Python, C++ CPU/GPU
2 Caffe/Caffe2 BSD License C/C++ MATLAB /FPGA/Mobile Multi-Server LeNet, RNN CIFAR-10,MNIST, ImageNet
3 Chainer MIT License C/C++ Python CPU/GPU Multi-Server RNN CIFAR-10, ImageNet
AlexNet,LeNet,Inception,
Java, Scala, Clojure, ResNet, RNN, LSTM,
4 DeepLearning4j Apache 2.0 Java Python, Kotlin CPU/GPU Multi-Server VGG,Xception, ImageNet
5 DyNet Apache 2.0 C/C++ C++, Python CPU/GPU Single Node RNN, LSTM ImageNet
6 FastAI Apache 2.0 Python Python CPU/GPU Multi-Server ResNet CIFAR-10, ImageNet
7 Keras MIT License Python Python, R CPU/GPU Multi-Server CNN, RNN CIFAR-10,MNIST
CIFAR-10,
8 Microsoft CNTK MIT License C/C++ C++, C#, Python, Java CPU/GPU Multi-Server CNN, RNN,LSTM MNIST,ImageNet,P-VOC
C++, Python, Clojure,
Julia, Perl, R, Scala, CPU/GPU CIFAR-10,
9 MXNet Apache 2.0 C/C++ Java,JavaScript,Matlab /Mobile Multi-Server CNN, RNN,Inception MNIST,ImageNet,P-VOC
10 Neon Apache 2.0 Python Python CPU/GPU Multi-Server AlexNet, ResNet, LSTM CIFAR-10, mnist,ImageNet
CPU/GPU
11 PaddlePaddle Apache 2.0 C/C++ Python /Mobile Multi-Server AlexNet,GoogleNet,LSTM CIFAR-10, ImageNet
Inception, ResNet, VGG,
Xception, MobileNet, DenseNet,
12 PlaidML Apache 2.0 C/C++ Python, C++ CPU/GPU Multi-Server ShuffleNet, LSTM CIFAR-10, ImageNet
AlexNet,Inception, ResNet,
13 PyTorch BSD License Python Python CPU/GPU Multi-Server VGG, DenseNet, SqueezeNet CIFAR-10, ImageNet
RNN, AlexNet,DenseNet,
GoogleNet, Inception,
14 SINGA Apache 2.0 C/C++ Python CPU/GPU Multi-Server ResidualNet,VGG MNIST, ImageNet
Python, C++, Java,
Go, JavaScript, Scala, CPU/GPU AlexNet,Inception, ResNet,
15 TensorFlow Apache 2.0 C/C++ Julia, Swift /TPU/Mobile Multi-Server VGG, LeNet, MobileNet CIFAR-10, mnist,ImageNet
16 Theano BSD License Python Python (Keras) CPU/GPU Multi-Server AlexNet, VGG, GoogleNet CIFAR-10, ImageNet
a.
alphabetical order
b.
In License Type column, Apache 2.0 means the Apache 2.0 license
155
Authorized licensed use limited to: PUC GO - Universidade Católica de Goiás. Downloaded on May 18,2022 at 18:44:36 UTC from IEEE Xplore. Restrictions apply.