w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
Machine Learning
Hardware Design for
Efficiency, Flexibility,
and Scalability
Jie-Fang Zhang, Member, IEEE, and Zhengya Zhang, Senior Member, IEEE
D
eep neural network (DNN)-based machine learn- along with their model size and complexity in terms of
ing (ML) methods have become the dominant number of parameters and operation counts. The evolu-
way to solve problems in the fields of computer tion of these models are shown in Fig. 2.
vision (CV), natural language processing (NLP), autono- The widespread use of DNNs has made DNN compu-
mous driving, and robotics [1], [2], [3], [4], [5], [6], [7], tation a workload class of itself. General-purpose graph-
[8], [9], [10], [11]. The effectiveness of the DNN-based ics processing units (GPUs) and central processing units
methods leads to the proliferation of DNN models, from (CPUs) equipped with large compute parallelism and
AlexNet [12] in 2012 for object detection and image memory bandwidth are popular hardware platforms for
classification to GPT-3 [7] in 2020 for natural language accelerating DNN workloads in servers and clouds, but
processing. In the quest towards higher accuracy and GPUs and CPUs are not the most suitable for edge use
expanded capabilities, newer models often grow in size cases due to their high cost and energy consumption.
and require more memory and computation complexity. To fill the void, designing domain-specific accelerators
Figure 1. Top-1 accuracy, size, and complexity of modern DNN models. Adapted from [9] ©2018 IEEE.
Jie-Fang Zhang was with the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA. He is now
with Nividia Corporation, Santa Clara, CA 95051 USA (e-mail: [email protected]).
Zhengya Zhang is with the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA (e-mail:
[email protected]).
II. Background
In general, we can broadly categorize DNN models into
four types based on its network structure and com-
putation: 1) multi-layer perceptron (MLP), 2) convolu-
tional neural network (CNN), 3) recurrent neural net- Figure 2. Evolution of model size in the fields of (a) CV and
(b) NLP. Adapted from [11].
work (RNN), and 4) transformer. Here, we present the
A. SIMD Architecture
In general, a single instruction multiple data (SIMD) ar-
chitecture consists of an array of parallel processing
elements (PEs) or functional units (FUs) and performs
vector operations across an array of data. Only one in-
struction is decoded and issued to trigger the computa-
tion on multiple data across the array of PEs. Fig. 4(a)
illustrates the SIMD architecture for vector processing.
A SIMD array can be used to compute the dot-product
between two data vectors. Each PE receives a pair of
data from the memory or register file for multiplication,
Figure 3. Core computations of DNNs: (a) vector-matrix mul- then the result from each PE is written back to the mem-
tiplication in MLP and RNN, (b) 2D convolution in CNN, and ory for the next summation instruction. Alternatively,
(c) multi-head attention in transformers.
the results may be directly summed using an adder tree.
Figure 4. Illustration of the (a) SIMD array architecture, (b) matrix-matrix multiplication (MMM) operation, and (c) convolution
(CONV) layer operations on SIMD array.
Figure 5. Illustration of an example of SIMD architecture in Figure 6. Illustration of the (a) systolic array architecture and
Nvidia A100 GPU. Adapted from [19]. (b) PE architecture in the systolic array.
Figure 7. Illustration of the operations on systolic array: (a) input and weight matrices, (b) weight data configuration, (c) input
streaming (early-stage), (d) input streaming (general), and (e) output collection.
Table 1.
Processing architecture summary.
SIMD Array Systolic Array
Architecture 1D/2D PE array 2D PE array with
with shared neighboring
instructions connectivity
Operations VMM, MMM MMM
Data More memory Mostly local
movement access data movement
Compute Lower Higher
density
Flexibility Higher Lower
Hardware Higher Lower
Utilization
D. Single-Operand Sparsity
Some of the earliest sparse ar-
chitectures leveraged sparsity
from either IA, e.g., Cnvlutin [37],
or W, e.g., Cambricon-X [38], but
not both. By limiting the support
to single-operand sparsity, these
designs could adopt an existing
dense DNN accelerator architec-
ture and dataflow [39], and add
a frontend to discover IA-W pairs
for computation. Fig. 13 shows the
frontend designs for Cnvlutin [37]
and Cambricon-X [38]. Both used
Figure 13. Sparse architectures for single operand sparsity: (a) Cnvlutin adapted from
indirect access to fetch dense data [37] and (b) Cambricon-X adapted from [38].
(W in Cnvlutin, IA in Cambricon-X)
using the indices of nonzero data
(IA in Cnvlutin, W in Cambricon-X) decoded from the offset is used as the index to fetch W data from the W
compressed format. data array.
Cnvlutin supports IA sparsity, where the IA data Cambricon-X supports W sparsity, where the W
are compressed in the COO format, as illustrated in data are compressed in the RLC format. For each W
Fig. 13(a). For each nonzero IA data, an IA offset is stored data, a W step index stores the number of zeros pre-
to represent the original location of the IA data in the ceding it, i.e., the run length, as shown in Fig. 13(b). To
uncompressed format. To discover IA-W pairs, the IA discover IA-W pairs, the run lengths are accumulated
Figure 19. Processing mechanism of Nvidia A100 GPU for fine-grained structured sparse model weights. Adopted from [19].
Figure 23. Illustration of the concept of integrating an FPGA with the PETRA chiplet. Figure reused from [55] ©2021 IEEE.
References
Acknowledgment [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
This work was supported in part by ACE, one of the sev- no. 7553, pp. 436–444, May 2015.
[2] K. He et al., “Deep residual learning for image recognition,” in Proc.
en centers in JUMP 2.0, a Semiconductor Research Cor- Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
poration (SRC) Program sponsored by DARPA. [3] C. Szegedy et al., “Rethinking the inception architecture for com-
puter vision,” in Proc. the Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 2818–2826.
Jie-Fang Zhang (Member, IEEE) re- [4] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers
ceived the B.S. degree in electrical en- for image recognition at scale,” 2020, arXiv:2010.11929.
[5] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
gineering from National Taiwan Uni- Process. Syst. (NIPS), 2017, pp. 6000–6010.
versity, Taipei, Taiwan, in 2015, and the [6] J. Devlin et al., “BERT: Pretraining of deep bidirectional transform-
ers for language understanding,” in Proc. Conf. North Amer. Chapter As-
M.S. degree in computer science and
soc. Comput. Linguistics, Human Language Technol., vol. 1, Jun. 2019, Art.
engineering and the Ph.D. degree in no. 41714186.
electrical and computer engineering from the University [7] T. Brown et al., “Language models are few-shot learners,” in Proc.
Adv. Neural Inf. Process. Syst. (NIPS), 2020, pp. 1877–1901.
of Michigan, Ann Arbor, MI, USA, in 2018 and 2022, re- [8] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neu-
spectively. He joined NVIDIA in 2022 as a Deep Learning ral network models for practical applications,” 2017, arXiv:1605.07678.