0% found this document useful (0 votes)
12 views

w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

w1--Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Feature

IMAGE LICENSED BY INGRAM PUBLISHING

Machine Learning
Hardware Design for
Efficiency, Flexibility,
and Scalability
Jie-Fang Zhang, Member, IEEE, and Zhengya Zhang, Senior Member, IEEE

accelerators. To enable a higher performance and efficiency,


Abstract sparse DNN hardware can be designed to gain from data spar-
The widespread use of deep neural networks (DNNs) and DNN- sity. We present common approaches from compressed storage
based machine learning (ML) methods justifies DNN computation to processing sparse data to reduce memory and bandwidth
as a workload class itself. Beginning with a brief review of DNN usage and improve energy efficiency and performance. To ac-
workloads and computation, we provide an overview of single in- commodate the fast evolution of new models of larger size and
struction multiple data (SIMD) and systolic array architectures. higher complexity, modular chiplet integration can be a promis-
These two basic architectures support the kernel operations for ing path to meet the growing needs. We show recent work on
DNN computation, and they form the core of many flexible DNN homogeneous tiling and heterogeneous integration to scale up
and scale out hardware to support larger models of more com-
plex functions.
Digital Object Identifier 10.1109/MCAS.2023.3302390 Index Terms—ML hardware, DNN accelerator, sparse DNN
Date of current version: 11 October 2023 architecture, DNN chiplet, heterogeneous integration.

THIRD QUARTER 2023 1531-636X/23©2023IEEE IEEE CIRCUITS AND SYSTEMS MAGAZINE 35


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
I. Introduction Fig. 1 presents the accuracy of modern network models

D
eep neural network (DNN)-based machine learn- along with their model size and complexity in terms of
ing (ML) methods have become the dominant number of parameters and operation counts. The evolu-
way to solve problems in the fields of computer tion of these models are shown in Fig. 2.
vision (CV), natural language processing (NLP), autono- The widespread use of DNNs has made DNN compu-
mous driving, and robotics [1], [2], [3], [4], [5], [6], [7], tation a workload class of itself. General-purpose graph-
[8], [9], [10], [11]. The effectiveness of the DNN-based ics processing units (GPUs) and central processing units
methods leads to the proliferation of DNN models, from (CPUs) equipped with large compute parallelism and
AlexNet [12] in 2012 for object detection and image memory bandwidth are popular hardware platforms for
classification to GPT-3 [7] in 2020 for natural language accelerating DNN workloads in servers and clouds, but
processing. In the quest towards higher accuracy and GPUs and CPUs are not the most suitable for edge use
expanded capabilities, newer models often grow in size cases due to their high cost and energy consumption.
and require more memory and computation complexity. To fill the void, designing domain-specific accelerators

Figure 1. Top-1 accuracy, size, and complexity of modern DNN models. Adapted from [9] ©2018 IEEE.

Jie-Fang Zhang was with the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA. He is now
with Nividia Corporation, Santa Clara, CA 95051 USA (e-mail: [email protected]).
Zhengya Zhang is with the Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 USA (e-mail:
[email protected]).

36  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
for DNN workloads is of importance to answer new ap- high-level structures of each model and explain its core
plication needs. A prime example of a domain-specific computation.
accelerator for DNN is Google’s TPU [13].
Designing DNN-based ML accelerators has been a A. Multi-Layer Perceptron (MLP)
rapid-growing field. In general, we identify three major An MLP consists of multiple feedforward fully-connect-
challenges that need to be addressed in designing these ed (FC) layers cascaded one after another. The compu-
ML accelerators. tation of an FC layer can be formulated into a vector-
■■ Flexibility: The design needs to be flexible to sup- matrix multiplication (VMM) between the input vector
port a variety of computation types and models in x ∈ C and the weight matrix W ∈  K ×C to obtain the
the DNN workload class, not only for the current output vector y ∈  K , as described in Fig. 3(a).
generation of DNN models, but also for future gen-
erations as the models are evolving more quickly B. Convolutional Neural Networks (CNNs)
than hardware upgrades. CNNs are mostly specialized for 2D image processing
■■ Efficiency: The design needs to optimize both the in vision applications, e.g., image classification, object
processing and the memory access to provide a detection, and semantic segmentation [2], [3], [4], [12],
competitive advantage over GPUs and CPUs and [14], [15], [16], [17], [18]. A CNN uses convolution (CONV)
answer new application needs. layers for spatial feature extraction and FC layers for
■■ Scalability: The design needs to provide a way to feature classification. The input and output are often
support larger models with higher memory and referred as input activation (IA) and output activation
computation requirements, and new variations of (OA). A CONV layer has a weight (W) of size R × S × C × K ,
the current models to remain relevant. which can be understood as K 3D kernels of R × S × C .
In this review article, we discuss three important di- A CONV layer processing takes an IA of size H × W × C
rections to address the computation challenges in sup- and performs 2D convolutions between the IA and the K
porting modern ML models and workloads. First, we 3D kernels to obtain an OA of size H × W × K , as shown
describe the common processing architectures and the in Fig. 3(b). The model hyperparameters C and K are
data reuse opportunities for ML computation. Then, we the input and output channel sizes, respectively. The
present the benefit of exploiting data-level sparsity to
improve computation efficiency. Lastly, we provide an
overview of scaling-up and scaling-out approaches to
answer the scalability challenge.
This article is organized as follows. In Section II, we
present the primary types of computation used in ML
and DNN workloads. We then describe two common
processing architectures for DNN computation and
common stationary dataflows to exploit data reuse in
Section III. To gain better performance and efficiency, an
effective approach is by exploiting data sparsity, which
is explained in Section IV along with examples of sparse
compression formats and sparse architectures. To scale
up designs and scale out its functionalities, a chiplet-
based approach can be effectively employed. We review
examples of homogeneous tiling and heterogeneous in-
tegration of chiplets in Section V to demonstrate prom-
ising recent results. Finally, we conclude this article in
Section VI.

II. Background
In general, we can broadly categorize DNN models into
four types based on its network structure and com-
putation: 1) multi-layer perceptron (MLP), 2) convolu-
tional neural network (CNN), 3) recurrent neural net- Figure 2. Evolution of model size in the fields of (a) CV and
(b) NLP. Adapted from [11].
work (RNN), and 4) transformer. Here, we present the

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 37


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
output channel size is also known as the (weight) kernel keep track of features that are relevant in long term and
number. improves accuracy over traditional recurrent units. The
The 2D convolutions can be viewed as drawing an computation of an LSTM can also be formulated into a
R × S × C cube inside the IA’s H × W × C volume, and slid- VMM (Fig. 3(a)), where the input vectors are the input
ing it across the IA’s volume to obtain cubes of IA values. sequence xt and the hidden sequence ht −1, the matrix
For each cube, the dot-product between the cube with is the concatenation of i, f , o, c matrices with respect to
each of the K R × S × C kernels is performed to obtain the input or hidden sequences, and the output vector is
K OA values. Therefore, each 2D convolution can be for- the hidden sequence ht .
mulated as VMM with the R × S × C IA cube reshaped
to a vector, and the K R × S × C kernels reshaped as a D. Transformers
matrix of K vectors. Recently, transformer architectures that use self-
attention and multi-head attention mechanisms are
C. Recurrent Neural Networks (RNNs) getting increasingly better performance compared to
An RNN and its more popular variants, gated recurrent traditional LSTM in sequence and language applica-
unit (GRU) and long short term memory (LSTM), are tions [5], [6], [7] and CNN in vision applications [4].
used for sequence processing in speech recognition, The multi-head attention computation is described in
keyword detection, and natural language processing. An Fig. 3(c). First, feed-forward operation, or matrix mul-
RNN uses recurrent connections to process the input tiplication, is applied on the input sequence to obtain
sequence of the current timestep t and the output se- the key ( K ) , query ( Q ) , and value (V ) matrices with
quence from the previous timestep t − 1. An LSTM uses its weights, WK , WQ , and WV , respectively. The K , Q, V
input, output, forget gates, and a cell, i.e., i, f , o, c, to matrices are split into smaller matrices for multi-head
attention. Each attention block of the multi-head atten-
tion performs the self-attention on its K , Q, V matrices.
The outputs from each attention block are concatenat-
ed, then another feed-forward operation is applied to
obtain the final output sequence. The whole computa-
tion can also be mapped into a series of matrix-matrix
multiplication (MMM).

III. Classic DNN Processing Architectures


and Dataflows
Single instruction multiple data (SIMD) and systolic ar-
ray are the basic architectures for computing VMM and
MMM. These two architectures and their variants form
the core of most of the DNN accelerators. In the follow-
ing, we review the two basic architectures and the com-
mon dataflows for performing the computation.

A. SIMD Architecture
In general, a single instruction multiple data (SIMD) ar-
chitecture consists of an array of parallel processing
elements (PEs) or functional units (FUs) and performs
vector operations across an array of data. Only one in-
struction is decoded and issued to trigger the computa-
tion on multiple data across the array of PEs. Fig. 4(a)
illustrates the SIMD architecture for vector processing.
A SIMD array can be used to compute the dot-product
between two data vectors. Each PE receives a pair of
data from the memory or register file for multiplication,
Figure 3. Core computations of DNNs: (a) vector-matrix mul- then the result from each PE is written back to the mem-
tiplication in MLP and RNN, (b) 2D convolution in CNN, and ory for the next summation instruction. Alternatively,
(c) multi-head attention in transformers.
the results may be directly summed using an adder tree.

38  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
The SIMD architecture is flexible and can be pro- R × S × C convolution weight (W) on a SIMD array. Due
grammed to support diverse computation. In particular, to the sliding window nature of the CONV operation, the
we illustrate how the VMM, MMM, and CONV operations IA and W are first converted into 2D matrices: 1) IA is
can be mapped onto a SIMD array for DNN processing. converted to a ( X × Y ) × ( C × R × S ) matrix, where each
VMM and MMM Operations: Fig. 4(b) shows the row matches the size of a R × S × C weight kernel; and
MMM operation between an input matrix and a weight the IA values are partially replicated from one row to
matrix on a SIMD array. For an MMM operation, the vec- the next row to correspond to the sliding window from
tors of the input and weight matrices are loaded into the one step to the next step; and 2) W is converted to a
SIMD array’s memory to prepare for processing. During (C × R × S ) × K matrix, where each column corresponds
processing, corresponding input data and weight data to a weight kernel. Note the process of converting IAs
pairs are accessed from the memory and dispatched to and Ws in CONV into input and weight in MMMs, respec-
the PEs for processing. The input and weight vectors in tively, is often referred as Im2Col operation. An MMM
the memory can be cached to exploit temporal locality operation can then be performed between the input and
throughout the entire computation. Similarly, for the the weight matrix to produce the output matrix of size
VMM operation, the input vector is loaded and cached, ( X ×Y ) × K .
and is paired across all weight vectors for processing. The advantage of a SIMD architecture is its flexibil-
CONV Operation: Fig. 4(c) shows the CONV opera- ity and programmability to support diverse workloads.
tion between an H × W × C input activation (IA) and an In general, for computation with straightforward data

Figure 4. Illustration of the (a) SIMD array architecture, (b) matrix-matrix multiplication (MMM) operation, and (c) convolution
(CONV) layer operations on SIMD array.

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 39


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
parallelism, e.g., VMM and MMM, a high compute uti- sent out through another end of the PE array, e.g., bot-
lization can be achieved. The SIMD architecture also tom row.
provides opportunities to reuse weights or inputs by A systolic array’s PE microarchitecture and dataflow
keeping them stationary at the PE’s registers to reduce are illustrated in Fig. 6(b). A PE is commonly designed
memory access. However, the memory size and band- with a multiplier to compute the product of an incom-
width have to scale with the number of PEs in order to ing input and a cached weight value, and an adder to
support the full vector processing. sum the computed product and an incoming partial sum
In general, a 1D SIMD array can be tiled into a 2D SIMD (psum). The updated psum is sent vertically to the next
array, where input and weight loading from the external PE down and the input is propagated horizontally to the
memory can be shared between 1D tiles to reduce the next PE on the right.
bandwidth requirement while scaling up the total num- To prepare for an MMM operation on a systolic array,
ber of PEs for processing. the weights are first loaded to the array. The weight data
The SIMD architecture is used extensively in CPUs are split into column vectors as shown in Fig. 7(a). Each
and GPUs [19], [20]. Fig. 5 shows a sub-partition of the vector is streamed to and stored in the corresponding
streaming multiprocessor (SM) in Nvidia’s A100 GPU column of the PE array column, as shown in Fig. 7(b).
[19]. The SM contains functional units (or CUDA cores) The steps of an MMM operation are illustrated in
for arithmetic computation. In each cycle, an instruction Fig. 7(c)–(e). The input matrix is split into row vectors
is issued to a set of CUDA cores for parallel execution. that are streamed sequentially to the PE array, as shown
in Fig. 7(c). The inputs propagate from left to right,
B. Systolic Array Architecture passing through one PE in a clock cycle. When an in-
A systolic array consists of a regular 2D array of PEs put enters a PE, the PE computes the product between
where each PE is connected to its immediate neighbors. the input and the cached weight, and sums the product
Fig. 6(a) presents the architecture of the systolic array. with the psum that enters from top. Following the com-
The inputs are sent to the PEs through a PE array bor- putation, the PE passes the input to the next PE on the
der, e.g., leftmost column, and the intermediate results right and the updated psum to the next PE down. Note
are propagated across the PEs, e.g., horizontally to the that the inputs to the rows of PE must be arranged with
right and vertically to the bottom. Finally, the output are a one cycle delay from one row to the next to ensure
that the correct psum accumulation. Data move through
the systolic array in waves. The wavefront propagates
diagonally across the systolic array. The outputs are col-
lected from the bottom row of PEs as shown in Fig. 7(e).
The computation latency of a H × W systolic array is
H + W − 1 cycles.
A systolic array allows efficient weight reuse. In a
systolic array, the transfer of psums and inputs are re-
stricted to efficient movements between neighboring
PEs. Due to weight reuse and efficient data movements

Figure 5. Illustration of an example of SIMD architecture in Figure 6. Illustration of the (a) systolic array architecture and
Nvidia A100 GPU. Adapted from [19]. (b) PE architecture in the systolic array.

40  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
within the array, a systolic array requires a lower data organizes the input data to ensure proper accumulation
bandwidth. Systolic array is a data-driven architecture of the psums in the MMU. The MMU operates similarly
and has a low control overhead. These factors contrib- to the description above where inputs are streamed in
ute to its high compute density. from the left to the right, and the psum accumulation
Compared to a SIMD array, a systolic array is pre- happens vertically across columns.
wired for a defined dataflow and it is thus less flexible. Table 1 compares the SIMD architecture to the sys-
A large-size systolic array provides a higher computa- tolic array architecture. A SIMD array can be in the form
tion capacity, but it may also suffer from a low utilization of a 1D PE array to support VMM operations, and it can
when the operations do not utilize the entire array. The also be scaled up to a 2D PE array to support MMM
long latency is another drawback of a large-size systolic operations. A systolic array is commonly designed as
array. a 2D PE array to support MMM operations. In terms of
TPU [13] is an example of systolic array. TPU is de- data movement, a SIMD array needs to access memory
signed with a matrix multiply unit (MMU) that consists to feed all its PEs, whereas a systolic array can rely on
of a systolic array of 256 × 256 PEs. Fig. 8(a) and (b) show neighboring PE connections to reduce the bandwidth
TPU’s system architecture and dataflow in the MMU, re- requirement and the number of memory accesses. A
spectively. The weight data are loaded from the weight systolic array has a lower control overhead and can be
FIFO into the MMU, and a systolic data setup module easily scaled up. As such, a systolic array provides a

Figure 7. Illustration of the operations on systolic array: (a) input and weight matrices, (b) weight data configuration, (c) input
streaming (early-stage), (d) input streaming (general), and (e) output collection.

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 41


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
Figure 8. Illustration of the TPU (a) system architecture and (b) matrix-multiply engine architecture. Adapted from [13] ©2017 ACM.

Table 1.
Processing architecture summary.
SIMD Array Systolic Array
Architecture 1D/2D PE array 2D PE array with
with shared neighboring
instructions connectivity
Operations VMM, MMM MMM
Data More memory Mostly local
movement access data movement
Compute Lower Higher
density
Flexibility Higher Lower
Hardware Higher Lower
Utilization

higher compute density than a SIMD array. On the other


Figure 9. Illustration of (a) weight-stationary dataflow and
hand, a systolic array is often designed for a fixed com-
(b) output-stationary dataflow. Adapted from [21] ©2017 IEEE.
putation, whereas a SIMD array is more flexible to sup-
port a wide range of operations. The higher flexibility
of a SIMD array over a systolic array leads to a higher
hardware utilization, e.g., in flexibly handling inputs of computation with as many inputs as possible. The WS
various shapes. dataflow can effectively reduce the number of memory
accesses required to fetch weights from memory, lead-
C. Stationary Dataflows ing to a lower memory bandwidth and a lower power
DNN computation can be mapped onto the processing consumption. An example WS dataflow on a systolic ar-
architectures in two common ways: weight stationary ray architecture is illustrated in Fig. 9(a). In the example,
(WS) and output stationary (OS). These mapping meth- the weights (W0 , W1, W2 , and W3 ) are cached locally in
ods provide data reuse opportunities and dictate the the PEs. Inputs are accessed from memory and sent to
computation dataflows. the corresponding PEs for computation. The computed
Weight Stationary (WS) Dataflow: In the WS data- psums are passed along the PE array for accumulation.
flow, a PE stores a weight locally and reuses it for MAC Lastly, the output data is written back to memory. The

42  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
systolic array adopts the WS dataflow to reuse cached There are multiple benefits by exploiting sparsity in
weights across all inputs. Jouppi et al. [13] is an example designing DNN compute. First, data sparsity can be ex-
of ML accelerator that adopts the WS dataflow. ploited to save power. Accelerators e.g., Eyeriss [32] gate
Output Stationary (OS) Dataflow: In the OS dataflow, the computation, e.g., by turning off the clock, whenever
a PE stores and accumulates a psum data locally. The OS a zero in the IA is detected during processing. This tech-
dataflow can effectively reduce the amount of reading nique can effectively reduce the power consumption
and writing of psums from and to the memory. An ex- during DNN processing and can be conveniently incor-
ample OS dataflow mapped on a PE array is illustrated in porated into existing dense DNN accelerators. However,
Fig. 9(b). In this example, each PE accumulates a psum, the throughput remains the same since PEs become idle
P0 , P1, P2 , or P3 , locally. In each cycle, new weights are during ineffective computation.
fetched and sent to the PEs, and one input is broadcast Second, data sparsity can be used to reduce off-chip
across all PEs. Each PE computes a MAC and updates its memory storage and bandwidth usage. The sparse W
local psum. Upon completion, the output data are writ- and IA can be stored in a compressed format with only
ten back from the PEs to memory. Du et al. [22] and Deng nonzero elements. They are loaded and decompressed
et al. [23] are examples of ML accelerators that adopt for computation. The compressed storage reduces the
the OS dataflow. storage size and memory bandwidth. However, the de-
compression can be difficult to parallelize and costly in
power and area, leading to a bottleneck and additional
IV. Sparse Architecture overhead for DNN processing.
The continued growth of model size and complexity has Lastly, data sparsity can be used to reduce latency
motivated research efforts in leveraging data sparsity to by skipping the ineffectual computation. During pro-
reduce the compute and storage requirements. In this cessing, IA-W pairs are identified by searching through
section, we present an overview of network sparsity and the sparse IA and W data and sent to the compute. The
how to exploit it to make more efficient processing. search step avoids wasting time on unnecessary com-
putation, resulting in significant latency savings. State-
A. Sparsity in Neural Networks of-the-art sparse DNN accelerators [31], [33], [34], [35],
The sparsity in a network comes from both the model’s [36] process data directly in the compressed form, offer-
weights (Ws) and input activations (IAs). For the mod- ing both low memory bandwidth and high degree of ac-
el weights, network pruning and other sparsification celeration. However, supporting sparse processing can
techniques can be used to zero out a large number of cost a high design complexity.
weights in a model with only a small inference accuracy
drop [24], [25], [26], [27], [28], [29]. For the input acti- B. Sparse Compression Format
vations, some commonly-used operators like rectifier Sparse compression formats are used to store sparse
linear unit (ReLU) can clamp all negative activations to data in compact ways to save storage space. A com-
zeros, resulting in sparsity in output activations (OA), pressed format contains only nonzero data values and
which become input activations (IA) of the next layer. metadata to hold the information for locating the posi-
A CONV computation with IA and W sparsity is il- tions of nonzero values in the uncompressed vectors and
lustrated in Fig. 10. With network pruning [24], the matrices. During processing, the metadata is decoded to
typical W density (nonzero data over all data)
ranges from 40% to 50% and the IA density
ranges between 30% and 55% for well-known
models, e.g., AlexNet, VGG-16, and ResNet-50
[30]. An up to 38% and 4% density for IA and W,
respectively, is achieved by [24] on the FC lay-
ers of VGG-16. The CONV layers can be pruned
down to 19% and 22% density for IA and W,
respectively. Zhang et al. [26] reported a 95%
W sparsity on AlexNet using ADMM. For an IA
and a W with 50% density each, because the
nonzero W and IA are nearly randomly distrib- Figure 10. Convolution computation between unstructured sparse IA
uted, the amount of effectual computation, and W in a sparse DNN. The colored cells indicate nonzero entries,
i.e., computation that does not involve a zero, and the white cells indicate zero entries. Adopted from [31] ©2021
IEEE.
is only about 25%.

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 43


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
obtain the input address for fetching the nonzero data Fig. 11(c) shows the CSR format of our matrix example.
in the compressed format and to calculate the output The CSR format requires a two-step decoding process.
address for writing back the computed result. Differ- For instance, to access data in Row 1, the two steps are:
ent sparse compression formats have different require- 1) obtain the positions of the nonzero data of Row 1
ments in terms of total storage size (including nonzero stored in the value array: Ptr[1], Ptr[1] + 1, and so on and
data and metadata) and decoding complexity. Here, we 2) obtain the column indices of the nonzero data in Row
present and discuss some common sparse compression 1: Index[Ptr[1]], Index[Ptr[1]+1], and so on.
formats used for sparse neural network processing. Compressed Sparse Column (CSC): The CSC for-
Coordinate List (COO): In the COO format, nonzero mat is similar to the CSR format, but nonzero data are
data are stored along with their absolute indices in the first stored by column, then by row in a 1D value array.
original uncompressed vector or matrix. Fig. 11(a) shows Fig. 11(d) shows the CSC format for our matrix example,
an example of a matrix with zero and nonzero data. The where the Ptr array stores the column-by-column count
COO format of the matrix stores all nonzero data in a of the total number of nonzero data and the row index
1D value array and records the (row, column) indices of array stores the row index of each nonzero data. The
each nonzero data, as shown in Fig. 11(b). The advan- CSC format shares the same advantages and disadvan-
tage of COO is its low decoding complexity, since the tages as the CSR format.
row and column indices can be directly used to locate Run-Length Coding (RLC): In the RLC format, non-
the positions of the nonzero data in the uncompressed zero data are stored in a 1D value array in either row
vector or matrix. However, the row and column indices major or column major, and a run array keeps track of
may require significant amount of storage overhead the number of zeros before each nonzero data (known
which makes COO less efficient for data of medium or as the “run length”). Fig. 11(e) shows the RLC format of
low sparsity. our matrix example using 2-bit run lengths. In this ex-
Compressed Sparse Row (CSR): In the CSR format, ample, nonzero values a, b, c, and d are stored in the 1D
nonzero data are stored first by row, then by column in value array, and they have 1, 3, 0, and 3 preceding zeros
a 1D value array. Different from COO, the metadata con- or run lengths, respectively, that are recorded in the
sists of a pointer (Ptr) array and a column index array. run array. Note that the nonzero data e has four preced-
The Ptr array stores the row-by-row count of the total ing zeros, which exceeds the two bits allocated to a run
number of nonzero data. The first entry Ptr[0] is always length. Therefore, an additional padding zero is inserted
0; the second entry Ptr[1] stores the count of nonzero before e with a run length of 3. The RLC format can be
data in the first row; and Ptr[2] stores the count of non- decoded in one step. The position of i-th nonzero data
zero data in the first two rows, etc. The column index in the value array can be calculated by accumulating all
array stores the column index of each nonzero data. preceding run lengths in the run array.

C. Sparse Computation Pipeline


The high-level computation pipe-
line of sparse DNN processing in
the compressed format is illustrat-
ed in Fig. 12. Following the compu-
tation pipeline, nonzero data and
metadata arrays of Ws and IAs are
first fetched on-chip for processing.
The compressed W and IA pairs
are then searched, paired and dis-
patched to a multiplier array for
computation in the so-called fron-
tend part of the pipeline. Finally,
the computed psums are accumu-
lated and written back to their re-
spective OAs in output buffers in
the so-called backend part of the
Figure 11. Examples of sparse compression formats: (a) sparse uncompressed ten- pipeline.
sor, (b) COO format, (c) CSR format, (d) CSC format, and (e) RLC format with a run The challenges of processing
of 2-bit.
sparse data are two folds: 1) at the

44  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
Figure 12. Processing pipeline of a sparse DNN processor. Adopted from [31] ©2021 IEEE.

front end, a sufficient number of


IA-W pairs must be discovered
and sent to the compute stage in
order to maintain a high compute
utilization and 2) at the backend:
the irregular psum traffic out
of the compute stage must be
reduced and written back to the
output buffer without costing ex-
cessive bandwidth.
We provide a high-level over-
view of the hardware design
techniques and explain how they
leverage sparsity in the following
three subsections.

D. Single-Operand Sparsity
Some of the earliest sparse ar-
chitectures leveraged sparsity
from either IA, e.g., Cnvlutin [37],
or W, e.g., Cambricon-X [38], but
not both. By limiting the support
to single-operand sparsity, these
designs could adopt an existing
dense DNN accelerator architec-
ture and dataflow [39], and add
a frontend to discover IA-W pairs
for computation. Fig. 13 shows the
frontend designs for Cnvlutin [37]
and Cambricon-X [38]. Both used
Figure 13. Sparse architectures for single operand sparsity: (a) Cnvlutin adapted from
indirect access to fetch dense data [37] and (b) Cambricon-X adapted from [38].
(W in Cnvlutin, IA in Cambricon-X)
using the indices of nonzero data
(IA in Cnvlutin, W in Cambricon-X) decoded from the offset is used as the index to fetch W data from the W
compressed format. data array.
Cnvlutin supports IA sparsity, where the IA data Cambricon-X supports W sparsity, where the W
are compressed in the COO format, as illustrated in data are compressed in the RLC format. For each W
Fig. 13(a). For each nonzero IA data, an IA offset is stored data, a W step index stores the number of zeros pre-
to represent the original location of the IA data in the ceding it, i.e., the run length, as shown in Fig. 13(b). To
uncompressed format. To discover IA-W pairs, the IA discover IA-W pairs, the run lengths are accumulated

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 45


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
to recover the indices of the W data in the uncom- E. Full Sparsity—Channel-Last Processing
pressed format. The recovered indices are used to se- There are generally two ways to handle full sparsity,
lect the corresponding IA data to form IA-W pairs for i.e., sparsity in both W and IA: channel-last processing
computation. or channel-first processing. In this subsection, we will
cover channel-last processing, an
example of which is SCNN [34].
The channel-last dataflow is
illustrated in Fig. 14. In the chan-
nel-last processing, the nonzero
W and IA data are ordered in the
( R, S ) / ( H , W ) dimension first and
C dimension last for compressed
storage and processing. Subse-
quently, as compressed W and IA
data are fetched for processing,
their channel indices are easily
aligned. As long as a nonzero W’s
and a nonzero IA’s channel indices
are matched, they can be paired
for multiplication.
Shown in Fig. 14(a) and (b), the
compressed W and IA data of the
same channel index can be cross
paired and multiplied together
using a 2D multiplier array. The
advantage of the channel-last pro-
cessing is the simple frontend, but
the drawback is the complicated
writeback because the OA ad-
dresses of the psums depend on
Figure 14. Illustration of channel-last dataflow for sparse DNN processing. (a) IA and
the ( R, S ) / ( H , W ) indices of the
W data in dense format, (b) front-end dataflow, and (c) back-end dataflow of channel-
last processing. Adopted from [31] ©2021 IEEE. IA/W data, which are irregular for
sparse data. There is little oppor-
tunity to reduce the psums before
writeback, resulting in writeback
traffic jam. It requires complex
hardware or wiring, e.g., a cross-
bar switch, to resolve the conten-
tion, and it may cause pipeline
stalls.
This backend challenge is illus-
trated in Fig. 14(c). The psums need
to be distributed by a switch to the
corresponding buffer bank. The
red lines indicate the psum write-
backs that lead to buffer conten-
tions. To avoid contentions, con-
flicting psums need to be held. In
the example, one output requires
the accumulation of three psums,
resulting in a three-cycle write-
back and stalling the multiplier ar-
Figure 15. Architecture of SCNN, adopted from [34] ©2017 ACM.
ray for two cycles.

46  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
Fig. 15 shows the architecture of SCNN that adopts each row into a valid bit to indicate a match and the
channel-last processing for sparse DNN processing. matched position in the IA channel index array. Upon
completion, an AIM returns a list of valid-position
F. Full Sparsity—Channel-First Processing pairs for processing.
In the channel-first processing, the nonzero W and
IA data are ordered in the C dimension first and G. Structured Sparsity
( R, S ) / ( H , W ) dimension last. As compressed W and IA Making use of full available sparsity can cost substantial
data are fetched, their channel indices are first matched hardware overhead. As a compromise, we can use a lim-
to produce pairs of W and IA data to be multiplied to- ited form of sparsity, such as coarse-grained or struc-
gether. Strings of resulting psums will share the same tured sparsity, that can provide a good enough gain in
OA address, so they can be reduced
before writeback. Compared to the
channel-last processing, the chan-
nel-first processing incurs an over-
head in the frontend due to the chan-
nel index matching, but it produces
immediately-reducible psums to cut
the writeback traffic, leading to more
gain from simplifying the backend
and a potential net improvement in
the overall power and performance.
The channel-first dataflow is il-
lustrated in Fig. 16. The W channel
index is matched with the IA chan-
nel index to generate valid W-IA
pairs. Valid W-IA pairs are fetched
and multiplied to produce psums.
The psums are to be accumulated to
the OA address following the IA indi-
ces ( h, w ) and the W indices ( r , s, k ) .
Due to the channel-first input order-
ing, the ( h, w ) and ( r , s ) addresses Figure 16. Illustration of channel-first dataflow for sparse DNN processing. Adapted
from [31] ©2021 IEEE.
will increment less frequently than
the input channel index over the
course of processing, allowing the
OA address to stay constant for the
majority of the time and the psums
can be immediately accumulated be-
fore writeback.
An example of channel-first pro-
cessing is SNAP [31]. SNAP utilizes
associative index matching (AIM)
units in the frontend to extract
IA-W pairs for multiplication, as
shown in Fig. 17. The AIM consists
of a comparator array and each row
is connected to a priority encoder.
During operation, an AIM receives
the W and IA channel index ar-
rays and compares each W chan-
nel index to every IA channel index
as shown in Fig. 17. A priority en- Figure 17. The associative index matching (AIM) unit in SNAP. Adopted from [31]
©2021 IEEE.
coder encodes the match result in

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 47


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
performance and efficiency without excessive hardware One well-known example that leverages the density-
overhead [27], [40]. bounded block sparsity is Nvidia A100 GPU [19]. As il-
Different forms of sparsity are compared in Fig. 18. If lustrated in Fig. 19, fined-grained structured pruning is
pruning [24] is done without any constraints, it results in applied to the trained model weights to create the so-
unstructured sparsity shown in Fig. 18(a). Pruning with- called 2:4 sparsity, i.e., a 50% density bound for each
out any constraints generally produces a higher sparsi- block of 1 × 4 data. The sparse weights are compressed
ty, but processing unstructured sparsity requires more with COO indices that are used to access the dense in-
fine-grained control and can cost excessive hardware puts in processing, similar to the illustration in Fig. 13(b).
overhead. Pruning can be done by block with a density
upper bound. The approach produces density-bounded H. Bit-Level Sparsity
block sparsity [19], [41]. For example, Fig. 18(b) shows Besides sparsity at data level, bit-level sparsity can also
the result of a density-bounded block pruning with each be leveraged by bit-serial multipliers. One example that
1 × 3 block containing at most one nonzero value. Prun- adopts this approach is bit-pragmatic [46], where the
ing can be done by block [42], [43], e.g., by 2 × 2 blocks as zero bits in one of the operands can be skipped in bit-
shown in Fig. 18(c). Pruning can even be done by input serial multiplication. The bit-pragmatic processing is
and output channel [40], [44], [45] as shown in Fig. 18(d). illustrated in Fig. 20 [46]. The IA is processed in a bit-
More coarse-grained pruning produces more hardware- serial fashion, and each nonzero bit is encoded by its
friendly structured sparsity, but it may sacrifice the position in the bit sequence similar to the COO format.
model accuracy to some degree. In computation, the nonzero bit position of each IA data
is used to set the configurable left shifter to shift the
W data value, effectively acting as a bit-wise multiplier.
Exploiting sparsity in bit-level reduces the number of
computation cycles, and can increase both efficiency
and throughput.

I. Sparse Architectures for RNNs and Transformers


Compared to the sparse architectures for CNNs, the
sparse architectures for RNNs are focused on improving
Figure 18. Common sparsity types: (a) fine-grained, (b) den- the performance for sparse matrix vector multiplication
sity structured, (c) block structured, and (d) filter structured.
(SpMV) and the element-wise operations associated to

Figure 19. Processing mechanism of Nvidia A100 GPU for fine-grained structured sparse model weights. Adopted from [19].

48  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
the type of RNNs. For instance, LSTM requires element- chips, and the rapid evolution of DNN models that short-
wise multiplication, sigmoid, and tanh operations to ens the useful life of such custom chips.
compute the outputs. ESE is an example of a sparse ar- Domain-specific accelerators for DNN, such as
chitecture for LSTM [47]. It proposes a load-balancing NVDLA [51] and TPU [13], represents a path forward
pruning technique to reduce the workload imbalance in by providing some degree of flexibility to support not
the sparse inputs and weights during pruning. Similar only current models but also future models. However,
to EIE [33], ESE adopts the CSC format to store and com- without growing the raw compute and memory capacity,
pute the sparse data. the performance of such accelerator will not be able to
Different from the pruning techniques that eliminate meet the demands of newer and more complex models.
the unimportant weights and inputs
in CNNs to RNNs, the sparse archi-
tectures for transformers proposed
to prune the unimportant connec-
tions (tokens or heads) in the self-
attention matrix [48], [49]. Fig. 21
presents an example of the attention
matrix. Several tokens have small
contributions to the final result, thus
can be pruned away without per-
formance degradation. Spatten [48]
proposed cascade head and token
pruning techniques to eliminate the
tokens and heads in the attention
matrix. It uses a shifting mechanism
to avoid irregular memory access
from the sparse computation and a
reconfigurable adder-tree to lever-
age the sparsity for speedup. DOTA
[49] trains a decoder side by side to Figure 20. Frontend mechanism and processing example of bit-pragmatic [46]: (a)
the Transformer to detect the weak IA and W data for processing and (b) bit-serial processing using IA's nonzero bit
connections in the attention matrix. position to control the shifter.

To process the sparse attention ma-


trix, DOTA adopts an out-of-order
processing scheme to leverage the
temporal locality and avoid unneces-
sary memory accesses.

V. Scale Up and Scale Out


The DNN model complexity grows at
1.5 times annually [8], [9], [29], [50],
but it is unlikely to expect new cus-
tom chips to be built to respond to
the rapid evolution of DNN models at
the same rate. This lag is attributed
to the high cost and effort to design
new chips, especially ones that uti-
lize large silicon area and advanced
technology nodes needed to support
the processing of more complex DNN
models. Other important factors in-
clude the diverse use cases of DNN Figure 21. Illustration of the attention matrix with unimportant tokens. Adopted
from [48].
that diminish the space for custom

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 49


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
Therefore, these domain-specific accelerators also need In the following, we use two recent designs as exam-
to be continuously upgraded, e.g., from the 28 nm TPUv1 ples to outline the primary ways in constructing SiP for
in 2016 [13] to the 7 nm TPUv4 in 2021. Similar challenges DNN compute acceleration. We categorize them into two
in high cost and limited lifespan still remain. classes, homogeneous integration and heterogeneous
We identify a growing trend to emphasize on hard- integration. In homogeneous integration, same chiplets
ware reuse and leverage advanced packaging to en- are tiled to scale up the system to support models of
hance the capability of hardware systems. Using this larger size. In heterogeneous integration, different types
approach, a chip is designed to be a modular building of chiplets are put together to extend the functionality
block, called chiplet; and a system is constructed by to cover new types of workloads.
reusing chiplets. To meet the requirements of different
DNN models and use cases, systems can be constructed A. Homogeneous Integration
using the suitable number and types of chiplets. In other The best example of homogeneous integration is Nvid-
words, this approach takes advantage of chiplet reuse to ia’s DNN multi-chip package (MCP) shown in Fig. 22,
construct systems in package (SiP). The premise of this where up to 36 DNN chiplets can be integrated in one
approach is that designing, fabricating and assembling MCP to scale up the system as needed [52]. The DNN
packages require lower cost and effort than designing chiplet measures 6 mm2 in a TSMC 16 nm technology.
and fabricating large monolithic chips. It integrates tiles of SIMD-based PEs to provide up to
For the SiP approach to succeed, we identify three basic 1,024 MACs/cycle (INT8) or 4 TOPS (INT8) [52].
requirements: 1) availability of reusable chiplets that are Nvidia’s DNN MCP is built on a 12-layer organic sub-
equipped with high-bandwidth and efficient I/O interfaces; strate. Organic substrate is generally of lower cost than
2) accessible advanced packaging and assembly process; substrates used for advanced packaging such as silicon
and 3) methodology to map workloads to chiplet-based interposers, but the routing density is generally lower
systems. Among the three requirements, a high-band- too. Nvidia’s DNN MCP adopts a serial link approach
width and efficient I/O interface is necessary to ensure to achieve a high inter-chiplet bandwidth using fewer
that the chiplets that constitute an SiP can be seamlessly wires at very high speed, suitable for organic substrate.
integrated to match the performance of a monolithic chip; In particular, the Nvidia design used a 200 mV low-
an accessible advanced packaging and assembly process swing, short-reach serial link called ground-referenced
ensure that high-density integration and high-bandwidth signaling (GRS) to achieve up to 25 Gbps/lane at
routing are feasible to construct an SiP at a reasonable 0.82–1.75 pJ/b for a short reach of 3–7 mm [53]. A chiplet
cost; and a mapping methodology is needed to divide the is equipped with four transmit lanes and four receive
workload and assign them appropriately to the chiplets to lanes for up to 100 Gbps of input and 100 Gbps of output
achieve high utilization and efficiency. bandwidth [52].
The compute and I/O specifications
above shed light on key design consid-
erations for a chiplet-based DNN ac-
celerator: 1) the compute capacity of
the DNN chiplet (4 TOPS in INT8) sig-
nificantly exceeds the I/O bandwidth
(100 Gbps, transmit or receive) and
2) the compute energy efficiency of
the DNN chiplet (0.11 pJ/OP in INT8)
is substantially lower than the I/O en-
ergy efficiency (0.82 pJ/b). The DNN
chiplet must reuse the input data (input
activations and weights) and reduce
the output data (output activations)
to minimize the I/O usage, or I/O can
easily overtake compute to become
the performance and energy bottle-
neck, rendering the chiplet approach
impractical.
The contrast between compute
Figure 22. Nvidia DNN MCP approach. Figure reused from [52] ©2020 IEEE.
and I/O also has an implication on the

50  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
chiplet size choice. If a chiplet’s x- and y-dimension are and activation functions are special operations and
each scaled up by a factor of S, the compute capacity relatively lightweight compared to CONV and FC lay-
scales up by a factor of approximately S2, but the chiplet ers. The control loops and data organization outside
I/O shoreline only scales up by a factor of S, allowing of NN processing to support different tasks are also
the I/O bandwidth to scale up by approximately S. This special operations. This factoring exercise naturally
back-of-envelope calculation suggests that chiplet size leads to heterogeneous chiplets, e.g., an accelerator
may have to be kept smaller or the disparity between chiplet that supports common and compute-heavy
compute and I/O can become even larger. operations, and a processor or FPGA chiplet that can
A mapping strategy was developed for Nvidia’s MCP be programmed to support special operations. Using
to divide the weights into parts and allocate them to dif- this approach, accelerator chiplets can be made to tar-
ferent chiplets [54]: 1) allocate output channels (differ- get common kernels that are unlikely to change over
ent kernels) across columns of chiplets and 2) divide the time, allowing us to extend the useful lifetime of these
input channels into parts and allocate them across rows chiplets. Processor and FPGA chiplets can be used to
of chiplets. To carry out the computation, the channels complement the accelerator chiplets to complete sys-
of the inputs are divided into parts and distributed to tem implementations.
appropriate rows of chiplets. This mapping strategy pro- An example of heterogeneous integration is the MCP
vides data reuse and reduction: 1) weights are cached consisting of an FPGA with the PETRA systolic array
and reused within a chiplet; 2) input activations are chiplet [55] as illustrated in Fig. 23. The PETRA chiplet
reused between multiple kernels within a chiplet; and measures 3 mm2 in an Intel 22 nm technology. It inte-
3) output psums are reduced in the channel dimension grates tiles of systolic arrays to provide up to 1,024
before going out of the chiplet. Such a mapping strategy MACs/cycle (FP16) or 1.43 TFLOPS (FP16) [55].
is essential for reducing the I/O usage and removing the The PETRA MCP is built on Intel’s embedded multi-
I/O bottleneck in the DNN MCP. die interconnect bridge (EMIB) [56], [57], a silicon bridge
that connects an FPGA chiplet and an external chiplet.
B. Heterogeneous Integration The silicon bridge provides a high routing density, en-
While homogeneous tiling of DNN chiplets solves the abling the use of parallel links of moderate speed. The
problem of scaling up DNN hardware to support larger I/O design for moderate-speed links can be made much
DNN models, it does not address the problem of scaling simpler than high-speed serial I/Os, and it can even be
out DNNs, i.e., extending DNNs to novel uses, e.g., DNNs made entirely digital [58]. A digital link is more reliable
used as a building block to support new applications. and can be ported to different technologies with ease. In
Besides scaling out DNNs, new operators can be added the MCP design, a digital advanced interface bus (AIB)
to DNNs in the future to enhance its capability, making it link [57], [58] was adopted with full swing, supporting a
difficult to design a truly future-proof DNN chiplet. short-reach of 3 mm at 2 Gbps/pin. Thanks to the short
We argue the importance of factoring computation reach and simple design, an AIB I/O consumes less than
into types, e.g., common operations and special opera- 1 pJ/b [58]. An AIB channel assembles 40 pins for an
tions, in considering chiplet-based system partition- aggregate bandwidth of 80 Gbps. Using a dense bump
ing. As examples, CONV and FC layers are common and pitch of 55 μm, an AIB channel occupies approximately
compute-heavy operations; and batch normalization 300 μm of die edge. The PETRA chiplet utilizes 8 AIB

Figure 23. Illustration of the concept of integrating an FPGA with the PETRA chiplet. Figure reused from [55] ©2021 IEEE.

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 51


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
channels, or a total bandwidth of 640 Gbps, to communi- Architect focusing on GPU performance analysis, model-
cate with the FPGA chiplet [55]. ing, and optimization for deep learning models. His re-
With heterogeneous integration, the FPGA can serve search interests include energy-efficient hardware ar-
as the flexible platform that can be configured to serve chitecture and accelerator design for machine learning,
as the host and handle control and data management, computer vision, and robotics applications.
and the PETRA systolic array chiplet can perform the
VMM and MMM that form the core part of DNN compu- Zhengya Zhang (Senior Member,
tation [55]. Operations that are not supported by the PE- IEEE) received the B.A.Sc. degree in
TRA chiplet can always be covered by the FPGA chiplet. computer engineering from the Univer-
The heterogeneous platform can be further extended, sity of Waterloo in 2003, and the M.S.
e.g., by adding a front-end chiplet to make a complete and Ph.D. degrees in electrical engi-
sensor platform, and by adding another function accel- neering from the University of Califor-
erator chiplet to expand the capability of the system. nia at Berkeley (UC Berkeley), Berkeley, CA, USA, in 2005
and 2009, respectively. He has been a Faculty Member
with the University of Michigan, Ann Arbor, MI, USA,
VI. Conclusion since 2009, where he is currently a Professor with the
DNN hardware design is a fast-evolving field. In this ar- Department of Electrical Engineering and Computer Sci-
ticle we provide a survey and a tutorial on the basics of ence. His research interests include low-power and high-
the DNN workloads, the essential processing architec- performance VLSI circuits and systems for computing,
tures, and the promising directions in sparse DNN pro- communications, and signal processing. He was a recipi-
cessing and multi-chip integration. First, we explain the ent of the University of Michigan College of Engineering
two basic architectures for DNN processing, SIMD and Neil Van Eenam Memorial Award in 2019, the Intel Early
systolic array, along with common WS and OS dataflows, Career Faculty Award in 2013, the National Science Foun-
to show the tradeoffs between flexibility and energy ef- dation CAREER Award in 2011, and the David J. Sakrison
ficiency, and utilization and compute density. Next, we Memorial Prize from UC Berkeley in 2009. He has been
present designs that exploit data sparsity to improve an Associate Editor of the IEEE Transactions on Very L arge
both performance and energy efficiency with com- Scale Integration (VLSI) Systems since 2015, and serves on
pressed storage and sparse processing. From partial the Technical Program Committee of the IEEE Custom
sparsity to full sparsity, architectures can be designed Integrated Circuits Conference (CICC) since 2018. He was
with a range of overheads to gain from an array of ben- an Associate Editor of the IEEE Transactions on Circuits
efits including lower energy, smaller memory, lower and Systems —Part I: R egular Papers from 2013 to 2015 and
memory bandwidth and higher performance. Lastly, we the IEEE Transactions on Circuits and Systems —Part II: Ex-
show a path in scaling up and scaling out DNN hardware press Briefs from 2014 to 2015, and served on the Techni-
using multi-chiplet integration, either by tiling of modu- cal Program Committee of the IEEE VLSI Symposium on
lar DNN chiplets in constructing larger-scale systems or Technology and Circuits from 2018 to 2022. He is an IEEE
by heterogeneously pairing of DNN chiplets with CPU or Solid-State Circuits Society Distinguished Lecturer.
FPGA to build a versatile platform.

References
Acknowledgment [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
This work was supported in part by ACE, one of the sev- no. 7553, pp. 436–444, May 2015.
[2] K. He et al., “Deep residual learning for image recognition,” in Proc.
en centers in JUMP 2.0, a Semiconductor Research Cor- Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778.
poration (SRC) Program sponsored by DARPA. [3] C. Szegedy et al., “Rethinking the inception architecture for com-
puter vision,” in Proc. the Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2016, pp. 2818–2826.
Jie-Fang Zhang (Member, IEEE) re- [4] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers
ceived the B.S. degree in electrical en- for image recognition at scale,” 2020, arXiv:2010.11929.
[5] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
gineering from National Taiwan Uni- Process. Syst. (NIPS), 2017, pp. 6000–6010.
versity, Taipei, Taiwan, in 2015, and the [6] J. Devlin et al., “BERT: Pretraining of deep bidirectional transform-
ers for language understanding,” in Proc. Conf. North Amer. Chapter As-
M.S. degree in computer science and
soc. Comput. Linguistics, Human Language Technol., vol. 1, Jun. 2019, Art.
engineering and the Ph.D. degree in no. 41714186.
electrical and computer engineering from the University [7] T. Brown et al., “Language models are few-shot learners,” in Proc.
Adv. Neural Inf. Process. Syst. (NIPS), 2020, pp. 1877–1901.
of Michigan, Ann Arbor, MI, USA, in 2018 and 2022, re- [8] A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neu-
spectively. He joined NVIDIA in 2022 as a Deep Learning ral network models for practical applications,” 2017, arXiv:1605.07678.

52  IEEE CIRCUITS AND SYSTEMS MAGAZINE THIRD QUARTER 2023


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.
[9] S. Bianco et al., “Benchmark analysis of representative deep neural [35] Z. Yuan et al., “STICKER: An energy-efficient multi-sparsity com-
network architectures,” IEEE Access, vol. 6, pp. 64270–64 277, 2018. patible accelerator for convolutional neural networks in 65-nm CMOS,”
[10] Y. Guo et al., “Deep learning for 3D point clouds: A survey,” IEEE IEEE J. Solid-State Circuits, vol. 55, no. 2, pp. 465–477, Feb. 2020.
Trans. Pattern Anal. Mach. Intell., vol. 43, no. 12, pp. 4338–4364, Dec. 2021. [36] Y.-H. Chen et al., "Eyeriss v2: A flexible accelerator for emerging
[11] G. Menghani, “Efficient deep learning: A survey on making deep deep neural networks on mobile devices," IEEE J. Emerg. Sel. Topics Cir-
learning models smaller, faster, and better,” 2021, arXiv:2106.08962. cuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- [37] J. Albericio et al., “Cnvlutin: Ineffectual-neuron-free deep neural
tion with deep convolutional neural networks,” in Proc. Adv. Neural Inf. network computing,” in Proc. Int. Symp. Comput. Archit. (ISCA), Jun.
Process. Syst. (NIPS), 2012, pp. 1097–1105. 2016, pp. 1–13.
[13] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor [38] S. Zhang et al., “Cambricon-X: An accelerator for sparse neural
processing unit,” in Proc. Int. Symp. Comput. Archit. (ISCA), 2017, pp. 1–12. networks,” in Proc. Int. Symp. Microarchitecture (MICRO), Oct. 2016,
[14] K. Simonyan and A. Zisserman, “Very deep convolutional networks pp. 1–12.
for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent. [39] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,”
(ICLR), 2015, pp. 1–14. in Proc. Int. Symp. Microarchitecture (MICRO), Dec. 2014, pp. 609–622.
[15] A. G. Howard et al., “MobileNets: Efficient convolutional neural net- [40] W. Wen et al., “Learning structured sparsity in deep neural net-
works for mobile vision applications,” 2017, arXiv:1704.04861. works,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2016, pp. 1–9.
[16] R. Girshick et al., “Rich feature hierarchies for accurate object de- [41] Z.-G. Liu, P. N. Whatmough, and M. Mattina, “Systolic tensor ar-
tection and semantic segmentation,” in Proc. Conf. Comput. Vis. Pattern ray: An efficient structured-sparse GEMM accelerator for mobile CNN
Recognit. (CVPR), Jun. 2014, pp. 580–587. inference,” IEEE Comput. Archit. Lett., vol. 19, no. 1, pp. 34–37, Jan./Jun.
[17] J. Redmon et al., “You only look once: Unified, real-time object de- 2020.
tection,” in Proc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, [42] J. Yu et al., “Scalpel: Customizing DNN pruning to the underlying
pp. 779–788. hardware parallelism,” in Proc. Int. Symp. Comput. Archit. (ISCA), Jun.
[18] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks 2017, pp. 548–560.
for semantic segmentation,” in Proc. Conf. Comput. Vis. Pattern Recognit. [43] S. Narang, E. Undersander, and G. Diamos, “Block-sparse recurrent
(CVPR), Jun. 2015, pp. 3431–3440. neural networks,” 2017, arXiv:1711.02782.
[19] R. Krashinsky et al. Nvidia Ampere Architecture In-Depth. Accessed: [44] Z. Liu et al., “Learning efficient convolutional networks through
Dec. 10, 2022. [Online]. Available: https://developer.nvidia.com/blog/ network slimming,” in Proc. Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
nvidia-ampere-architecture-in-depth/ pp. 2755–2763.
[20] Get Outstanding Computational Performance Without a Specialized [45] H. Li et al., “Pruning filters for efficient ConvNets,” 2016, arX-
Accelerator. Accessed: Dec. 10, 2022. [Online]. Available: https://www. iv:1608.08710.
intel.com/content/www/us/en/architecture-andtechnology/avx-512-so- [46] J. Albericio et al., “Bit-pragmatic deep neural network computing,”
lution-brief.html in Proc. Int. Symp. Microarchitecture (MICRO), Oct. 2017, pp. 382–394.
[21] V. Sze et al., “Efficient processing of deep neural networks: A tuto- [47] S. Han et al., “ESE: Efficient speech recognition engine with sparse
rial and survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, Dec. 2017. LSTM on FPGA,” in Proc. Int. Symp. Field Program. Gate Arrays (FPGA),
[22] Z. Du et al., “ShiDianNao: Shifting vision processing closer to the Feb. 2017, pp. 75–84.
sensor,” in Proc. Int. Symp. Comput. Archit. (ISCA), Jun. 2015, pp. 92–104. [48] H. Wang, Z. Zhang, and S. Han, “SpAtten: Efficient sparse attention
[23] C. Deng et al., “TIE: Energyefficient tensor train-based inference en- architecture with cascade token and head pruning,” in Proc. Int. Symp.
gine for deep neural network,” in Proc. Int. Symp. Comput. Archit. (ISCA), High-Perform. Comput. Archit. (HPCA), Feb./Mar. 2021, pp. 97–110.
Jun. 2019, pp. 264–277. [49] Z. Qu et al., “DOTA: Detect and omit weak attentions for scalable
[24] S. Han et al., “Learning both weights and connections for efficient Transformer acceleration,” in Proc. Int. Conf. Architectural Support Pro-
neural network,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2015, gram. Lang. Operation Systems (ASPLOS), Feb. 2022, pp. 14–26.
pp. 1135–1143. [50] N. P. Jouppi et al., “Ten lessons from three generations shaped
[25] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing Google's TPUv4i: Industrial product,” in Proc. Int. Symp. Comput. Archit.
deep neural networks with pruning, trained quantization and Huffman (ISCA), Jun. 2021, pp. 1–14.
coding,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–14. [51] Nvidia Deep Learning Accelerator (NVDLA). Accessed: Dec. 10, 2022.
[26] T. Zhang et al., “A systematic DNN weight pruning framework using [Online]. Available: http://nvdla.org/
alternating direction method of multipliers,” in Proc. Eur. Conf. Comput. [52] B. Zimmer et al., “A 0.32-128 TOPS, scalable multi-chip-module-
Vis. (ECCV), 2018, pp. 184–199. based deep neural network inference accelerator with ground-ref-
[27] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep con- erenced signaling in 16 nm,” IEEE J. Solid-State Circuits, vol. 55, no. 4,
volutional neural networks,” ACM J. Emerg. Technol. Comput. Syst., vol. pp. 920–932, Apr. 2020.
13, no. 3, pp. 1–18, 2017. [53] J. W. Poulton et al., “A 1.17-pJ/b, 25-Gb/s/pin ground-referenced
[28] S. Dave et al., “Hardware acceleration of sparse and irregular ten- single-ended serial link for off-and on-package communication using
sor computations of ML models: A survey and insights,” Proc. IEEE, vol. a process-and temperature-adaptive voltage regulator,” IEEE J. Solid-
109, no. 10, pp. 1706–1752, Oct. 2021. State Circuits, vol. 54, no. 1, pp. 43–54, Jan. 2019.
[29] L. Deng et al., “Model compression and hardware acceleration for [54] R. Venkatesan et al., “A 0.11 pJ/OP, 0.32-128 TOPS, scalable multi-
neural networks: A comprehensive survey,” Proc. IEEE, vol. 108, no. 4, chip-module-based deep neural network accelerator designed with a
pp. 485–532, Apr. 2020. high-productivity VLSI methodology,” in Proc. IEEE Hot Chips Symp.
[30] J.-F. Zhang et al., “SNAP: A 1.67—21.55 TOPS/W sparse neural ac- (HCS), Aug. 2019, pp. 1–24.
celeration processor for unstructured sparse deep neural network in- [55] S.-G. Cho et al., “PETRA: A 22 nm 6.97 TFLOPS/W AIB-enabled
ference in 16 nm CMOS,” in Proc. Symp. VLSI Circuits (VLSI), Jun. 2019, configurable matrix and convolution accelerator integrated with an
pp. 306–307. Intel Stratix 10 FPGA,” in Proc. Symp. VLSI Circuits (VLSI), Jun. 2021,
[31] J.-F. Zhang et al., “SNAP: An efficient sparse neural acceleration pp. 1–2.
processor for unstructured sparse deep neural network inference,” [56] R. Mahajan et al., “Embedded multi-die interconnect bridge
IEEE J. Solid-State Circuits, vol. 56, no. 2, pp. 636–647, Feb. 2021. (EMIB)—A high density, high bandwidth packaging interconnect,” in
[32] Y.-H. Chen et al., “Eyeriss: An energyefficient reconfigurable ac- Proc. IEEE Electron. Compon. Technol. Conf. (ECTC), May/Jun. 2016,
celerator for deep convolutional neural networks,” IEEE J. Solid-State pp. 557–565.
Circuits, vol. 52, no. 1, pp. 127–138, Jan. 2017. [57] D. Greenhill et al., “3.3 A 14 nm 1 GHz FPGA with 2.5D transceiver
[33] S. Han et al., “EIE: Efficient inference engine on compressed deep integration,” in Proc. Int. Solid-State Circuits Conf. (ISSCC), Feb. 2017,
neural network,” in Proc. Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 54–55.
pp. 243–254. [58] C. Liu, J. Botimer, and Z. Zhang, “A 256 Gb/s/mm-shoreline AIB-
[34] A. Parashar et al., “SCNN: An accelerator for compressed-sparse compatible 16 nm FinFET CMOS chiplet for 2.5D integration with Stratix
convolutional neural networks,” in Proc. Int. Symp. Comput. Archit. 10 FPGA on EMIB and tiling on silicon interposer,” in Proc. IEEE Custom
(ISCA), Jun. 2017, pp. 27–40. Integr. Circuits Conf. (CICC), Apr. 2021, pp. 1–2.

THIRD QUARTER 2023 IEEE CIRCUITS AND SYSTEMS MAGAZINE 53


Authorized licensed use limited to: University of Michigan Library. Downloaded on November 24,2023 at 14:36:10 UTC from IEEE Xplore. Restrictions apply.

You might also like