2023-Optimization of Microarchitecture and Dataflow For Sparse Tensor CNN Acceleration
2023-Optimization of Microarchitecture and Dataflow For Sparse Tensor CNN Acceleration
ABSTRACT The inherent sparsity present in convolutional neural networks (CNNs) offers a valuable
opportunity to significantly decrease the computational workload during inference. Nevertheless, leveraging
unstructured sparsity typically comes with the trade-off of increased complexity or substantial hardware over-
heads for accelerators. To address these challenges, this research introduces an innovative inner join aimed at
effectively reducing the size and power consumption of the sparsity-handling circuit. Additionally, a novel
dataflow named Channel Stacking of Sparse Tensors (CSSpa) is presented, focusing on maximizing data
reuse to minimize memory accesses − an aspect that significantly contributes to overall power consumption.
Through comprehensive simulations, CSSpa demonstrates a 1.6× speedup and a 5.6× reduction in SRAM
accesses when executing inference on the ResNet50 model, compared to the existing Sparten architecture.
Furthermore, the implementation results reveal a notable 2.32× enhancement in hardware resource efficiency
and a 3.3× improvement in energy efficiency compared to Sparten.
INDEX TERMS AI accelerator, convolutional neural networks (CNNs), data compression, dataflow,
network on a chip (NoC).
There exist various techniques to reduce the computational models, the actual number of valid multiplications can be
demands of CNNs while maintaining their accuracy. Among reduced to 10%. These properties emphasize the importance
these methods, leveraging the sparsity present in model inputs of efficient sparsity-exploiting techniques to leverage these
proves to be a highly efficient approach. Given that the benefits to mitigate the memory and computation-intensive
primary operation in CNNs involves multiplication and accu- challenges of CNNs.
mulation (MAC), input features or weights containing zeros While exploiting one-sided sparsity is relatively straight-
can lead to ineffectual outputs. Consequently, skipping these forward, leveraging sparsity on both sides presents chal-
zero inputs, or, in other words, exploiting the sparsity of the lenges. Firstly, the irregular distribution of zeros in the
models, is advisable. Exploiting model sparsity offers several inputs can lead to an uneven workload distribution among
advantages, such as reducing the computational workload, processing elements (PEs). Consequently, some PEs with
thereby enhancing power efficiency and throughput. Addi- lower workloads may finish their tasks early and remain
tionally, compressing non-zero inputs allows for reduced idle, while others continue to work, resulting in subopti-
storage requirements and minimizes data traffic. mal resource utilization. The second challenge in dealing
In the context of CNNs, it has been observed that zeros with two-sided sparsity is the irregular data-access pat-
frequently appear in the features. This phenomenon is pri- tern. Since zero values are randomly distributed in both
marily due to the use of activation functions with a nonlinear features and weights, it becomes necessary to design
nature in CNNs, with the Rectified Linear Unit (ReLU) being additional hardware to identify non-zero matching posi-
one of the most employed activation functions [9]. The ReLU tions for these inputs. Moreover, input compression intro-
function maps negative outputs to zeros while retaining pos- duces complexity as it requires decoding the actual input
itive outputs, which then serve as inputs for the subsequent coordinates.
layers. Numerous research works have been dedicated to Methods for convoluting compressed sparse inputs can be
leveraging this feature sparsity [10], [11], [12], [13], [14], broadly categorized into two groups. The first group is based
[15], [16], [17]. Moreover, similar to the synapses connecting on an inner-product (or output-stationary dataflow) approach,
biological neurons, the weights in CNNs can also exhibit con- where valid input pairs are located before being sent to
siderable sparsity without significantly affecting the model’s MACs. Representative examples of this approach include
accuracy. Sparse weights can be obtained by either discarding Sparten [31], Extensor [36], and SIGMA [37]. The sec-
insignificant weights that fall below a certain threshold [20], ond group is based on the outer-product (or input-stationary
[21], or by employing optimization algorithms to identify and dataflow) approach, as seen in SCNN [27], SpArch [28],
retain only the important weights [22], [23]. Subsequently, and OuterSPACE [29]. In this method, PEs multiply each
retraining can be performed with the remaining weights. This incoming pair of inputs, and the resulting output products
process can be iterated multiple times until a satisfactory are delivered to their respective accumulators. Addresses
level of accuracy is attained. Studies have demonstrated that for accumulations are computed using separate coordinate
sparsity can be achieved up to 80% while the drop in accuracy decoding circuitry. However, these methods have certain
remains negligible [33]. The technique of exploiting sparsity microarchitecture drawbacks. The inner-product approach
by focusing solely on the zeros in features or weights is faces challenges in achieving a balance between performance
commonly referred to as one-sided sparsity exploitation. and hardware cost. Ensuring full PE speed often requires
Two-sided sparsity has been leveraged to enhance perfor- significant hardware additions. For example, in Sparten [31],
mance further [24], [25], [26], [27], [28], [29], [30], [31], bulky hardware components are necessary to find valid input
[32], [33], [34], [35], [36], [37], [38], [39]. By incorporating pairs, occupying up to 62.7% of the area and consuming 46%
sparsity in both features and weights, significant reductions of the total power. Conversely, the outer-product approach
in computations can be achieved. FIGURE 1 presents the encounters complexities in the output scatter circuit and may
weight and feature density of layers in AlexNet and VGG16. exhibit suboptimal reuse of output partial sums (psums). For
To obtain these densities, we randomly selected images from instance, SCNN [27] experiences network congestion when
the CIFAR10 dataset [44] and ran inferences on the pre- output psums are scattered across many PEs.
trained models. The weight density was directly obtained Accelerator dataflows play a pivotal role in both per-
from the pruned pretrained models, while the feature den- formance and power consumption. Similar to dense CNN
sity was calculated by counting non-zero features layer by counterparts, achieving maximal reuse of locally buffered
layer from the inference results. The data shows that weight data is essential for reducing memory access. This is crucial
density varies across layers and networks, with some layers since memory accesses typically contribute significantly to
in AlexNet reaching as low as 20% and in VGG16 reaching power consumption and have a considerable impact on over-
30%. Feature density also varies, typically being higher in the all performance. Specifically, reading 32 bits from SRAM
early layers but ranging as low as 30% as well. The triangles consumes approximately 5pJ of energy, which is 50 times
in the figures represent the fraction of valid multiplications higher than the energy cost of a 32-bit integer ADD operation
achievable if only valid (non-zero and matching) input pairs [41]. Furthermore, it is essential for the accelerator to handle
are considered. It is evident that in certain layers of both a diverse range of layer sizes, strides, and types, including
occurs in the output-scatter network of SCNN since output cost of introducing bulky hardware overhead and increased
products of the same output feature can arise at multiple complexity.
places. Secondly, the Cartesian product dataflow requires S2 Engine [34] utilizes an output-stationary systolic
a high-bandwidth crossbar between multipliers and accu- dataflow approach with multiple asynchronous PEs. Each PE
mulators to route the output products. This high-bandwidth selects aligned data pairs by leveraging a faster clock speed,
requirement adds complexity and may present challenges in which is set at four times the computation speed. However,
terms of area and power consumption. Thirdly, the Cartesian this design choice results in a limitation on the operation
product strategy assumes that each filter weight is multi- speed of the overall accelerator. Due to the asynchronous
plied by every feature. However, this assumption is only true nature and the faster clock speed of the PEs, the maximum
for unit-stride convolutions. In the case of k-stride convo- operation speed of the S2 Engine is always four times slower
lutions, a weight is multiplied by every k th input feature. than the processing allowable speed. This discrepancy creates
Consequently, the Cartesian product approach restricts the a bottleneck in the accelerator’s performance, as it cannot
applicability of SCNN to CNNs with unit-stride convolutions. fully exploit the potential processing speed available.
In contrast to SCNN, Sparten [31] adopts an inner-product ISOSceles [43] proposes an inter-layer pipelineapproach
approach by using intersection to search for non-zero match- to reduce data movement in CNN accelerators. Processing
ing pairs of inputs before executing multiplications. This CNN models layer-by-layer can lead to a large number of
architecture offers the advantage of easy controllability. outputs, claimed by the paper, which may overflow the
Additionally, Sparten’s use of output-stationary dataflow at accelerator’s internal SRAM. Consequently, the intermediate
the MACs contributes to improved power efficiency because outputs have to be stored externally, resulting in increased
the output size is typically larger than the input size due data movement costs. However, the presence of sparsity
to accumulation. However, Sparten does have some critical in the CNN models helps significantly reduce the size of
drawbacks that need to be considered. Firstly, the prefix sums, intermediate outputs. As a result, the experiments conducted
a crucial component in the architecture, consume significant on four CNN models (AlexNet, VGG16, GoogLeNet, and
power and hardware resources. In the breakdown report, the ResNet50) demonstrate that a 1MB internal SRAM is suf-
area and power consumption of the prefix sums were found ficient to store the intermediate results when using 8-bit
to be 54.6% and 40.6% of the total, respectively. The second precision. Despite the benefits of reducing data movement,
major limitation of Sparten is related to the poor reuse of input processing CNNs in an inter-layer fashion requires a substan-
features, resulting in a high number of memory accesses. tial amount of locally stored weights for multiple consecutive
This inefficiency in input feature reuse leads to increased layers. This local weight storage is necessary to facilitate the
data movement and can impact the overall performance and inter-layer pipeline, but it also introduces additional hardware
memory bandwidth utilization of the system. complexity.
GoSPA [32] employs a global search approach to consider
all possible computations for each feature. Subsequently, III. INNER JOIN LOGIC
it transfers only the relevant features to specific PEs that have In Section I, we explain that output-stationary dataflow
locally stored weights for valid computations. This dataflow offers advantages over weight-stationary or input-stationary
strategy is advantageous as it avoids unnecessary feature approaches because it minimizes the movement of psums.
transfers that may occur in Sparten, leading to more effi- Psums tend to have larger sizes compared to weights or inputs
cient data movement. However, the performance of GoSPA since they accumulate over time during the convolution oper-
is hindered by a bottleneck in the feature staging step. This ation. By applying output-stationary dataflow, a convolution
bottleneck arises from the limitation of broadcasting only one operation can be seen as a combination of multiple inner
feature to PEs at each clock cycle, and it skips broadcasting products between input feature vectors and weight vectors.
to PEs where the features are not required. Consequently, this However, the irregularity of sparse inputs poses a challenge in
results in poor PE utilization in GoSPA. efficiently finding non-zero matching input pairs between the
Sparse-PE [33] utilizes a multi-threaded, generic PE core two inputs (i.e., inner join), which is crucial for utilizing the
for performing sparse matrix multiplication. The PE core multiplier and improving system throughput. In this section,
employs a look-ahead mechanism to anticipate computations the paper presents the proposed inner join logic, designed
beforehand and only schedules valid computations, thereby to address this challenge while maintaining low power
avoiding ineffectual computations. This look-ahead strategy consumption.
is beneficial in optimizing the PE core’s efficiency. However, FIGURE 2 illustrates the functional diagram of the pro-
similar to the issue faced by SIGMA [37], Sparse-PE encoun- posed inner join logic. The input feature and filter tensors are
ters challenges when the look-ahead window is not large divided into multiple n-sized vectors, referred to as ‘‘chunks’’
enough to handle high sparsity level cases. In such situations, (e.g., n = 128). The primary purpose of the inner join logic
the resource utilization of the PE core may be suboptimal. is to identify non-zero matching pairs of these input chunks,
On the other hand, enlarging the look-ahead window can as they represent valid operand pairs for the convolution
address the resource utilization problem but comes at the operations. To optimize the operation of the inner join logic,
and at the same time, to reduce the data transfer costs, these
chunks are compressed using the Zero-Value Compression
(ZVC) method. As a result of this compression, each chunk
becomes a two-tuple consisting of a bitmask and a set of
non-zero values. The bitmask in each chunk represents the
positions of non-zero values within the chunk. It is a binary
representation, where a ‘‘1’’ in the bitmask indicates the
presence of a non-zero value at the corresponding position
in the chunk, and a ‘‘0’’ represents a zero value.
To reduce the hardware size, the inner join logic utilizes
a pair of k-bit sub-chunks (e.g., k = 8) from the bitmasks
as inputs at a time. Which are referred to as search windows.
The goal of the inner join logic is to locate non-zero matching
pairs of values from both the feature and weight sub-chunks
within a search window. These valid pairs are then sent to
the MAC unit, one pair at every clock cycle. Once all valid
inputs in the current sub-chunks are processed, the window
is shifted to the next sub-chunks. Locating the matching
input pairs involves two key steps: (1) finding the valid input
pairs and (2) accessing the corresponding non-zero values
in the compressed arrays. In the first step, the positions of
the non-zero matching pairs within a search window are FIGURE 3. Power and hardware resource utilization of (a) different-sized
identified by performing an AND operation between the two prefix sums and (b) different-sized priority encoders.
bitmask sub-chunks. A k-bit priority encoder is utilized to
encode the AND result, generating one matching position
at every clock cycle with priority decreasing from right to the third bit from the right of the search window, marked
left, as shown in FIGURE 2. To determine the next matching by the red dashed box. The output of the priority encoder
position, the current-matching bit is set to 0. is 3, indicating the current matching position. To find the
In FIGURE 2, an example with an 8-bit search window address of the non-zero matching value in the compressed
is illustrated, which is represented by a black dash-dotted array for each input chunk, the computation is divided into
rectangle. The current matching position is highlighted in two steps. The first step involves counting the total number
PEs to process data from one buffer while the other buffer
is simultaneously filled with new input data. At every clock
cycle, all PEs within a cluster send requests to their shared
feature buffer, and the buffer responds to each PE individ-
ually. To ensure efficient and simultaneous data access for
all PEs within the cluster, the number of reading ports in
the shared feature buffer equals the number of PEs in that
cluster. Whenever a PE processes all the buffered features in
one half of the shared feature buffer, it can immediately swap
to the other half, provided that the other half is filled with
newly arrived data. The PE does not need to wait for other
PEs in the same cluster to finish their tasks before swapping
its input buffer. This control technique desynchronizes the
operations of PEs in a cluster, virtually enlarging their input
buffer size. As a result, the workload imbalance among the
PEs is minimized, leading to improved overall efficiency and
performance of the accelerator.
B. PE ARCHITECTURE
FIGURE 5 illustrates the microarchitecture of a single PE.
The primary function of PE is to compute the final sums FIGURE 6. CSSpa overall architecture.
for an output feature section on an output channel. PEs
are equipped with the necessary components to efficiently
process the input data. The key components of the PE C. OVERALL ARCHITECTURE OF THE ACCELERATOR
include: FIGURE 6 illustrates the top-level architecture and memory
• Input selector: This block is equipped with the inner hierarchy of the CSSpa system. The controller is omitted for
join logic explained in Section III to efficiently identify simple illustration. The core processor includes a 16 × 16 PE
non-zero matching pairs of inputs. array, as introduced in Section IV-A. The weights are stored in
• 8-bit multiplier: The PE includes an 8-bit multiplier that an off-chip DRAM, while the features are stored in a 1MB on-
performs the multiplication of feature and weight values. chip SRAM. The final sums from the PE array undergo ReLU
• 24-bit accumulator: The PE utilizes a 24-bit accumulator activation and pooling before being compressed by the com-
to accumulate the results of the multiplier into psums pressor. The compressed data is then stored locally in the
stored in the output buffer. on-chip SRAM to be processed in the subsequent layers. The
• Output buffer: The PE has a 24-bit output buffer, which size of the on-chip SRAM is selected to ensure that the out-
has a size of 14 × 14 elements. This buffer size is puts of intermediate layers in the CNN are sufficiently stored,
selected based on experimental results from various thereby reducing expensive power consumption caused by
CNN models to ensure an even division of the output accessing the off-chip DRAM. This size is determined based
channel dimension among multiple PEs. on the simulation results of common CNN models such as
• Feature and Weight buffers: The PE employs an 8-bit AlexNet, VGG16, GoogLeNet, and ResNet50. In case of data
shared input feature buffer and an 8-bit weight buffer. overflow, the excess data is stored in the DRAM. Both the
As explained in Section IV-A, both buffers are double on-chip SRAM and off-chip DRAM have a bandwidth of
buffered to minimize latency caused by data transfer. 64 bits. When PEs complete their executions, they send alert
Each PE possesses a local doubled weight buffer with a signals to the buffers to refill the data. Meanwhile, the PEs
size of 2 × 72 elements. However, PEs in a cluster share continue their operations with the other half of the ping-pong
a doubled input feature buffer to reduce hardware over- buffer without any latency.
head. The shared input feature buffer can accommodate
a maximum of 2 × 128 elements. D. DATA COMPRESSION METHOD
• Finite state machine (FSM) controller: Each PE is In order to reduce storage requirements, the proposed archi-
equipped with a separate state machine controller. This tecture employs output feature compression before storing
controller automatically orchestrates the operation of them in the SRAM. FIGURE 7 depicts the compression
the PE once it receives input data and configuration format utilized in this approach. The output feature tensor is
information, such as the number of assigned output divided into separate subsections, with each subsection com-
features and stride. Its primary function is to reuse the prising Ct consecutive channels, Wt columns, and Ht rows.
stored data in the input buffers to produce all possible The dimensions of these subsections are determined based
outputs. on the results of the mapping algorithm, which considers the
TABLE 2. Benchmarks.
FIGURE 11. Comparisons on normalized speedup on convolution layers of (a) AlexNet, (b) VGG16, (c) GoogLeNet, (d) ResNet50, and on fully connected
layers of (e) AlexNet and (f) VGG16 (baseline: Dense architecture).
FIGURE 13. Comparisons on normalized SRAM accesses on convolution layers of (a) AlexNet, (b) VGG16, (c) GoogLeNet, (d) ResNet50, and on fully
connected layers of (e) AlexNet and (f) VGG16 (baseline: Dense architecture).
in CSSpa were substantially reduced from 59.5% and 2% accesses, as memory accesses are significant contributors to
in Sparten to 10.4% and 0.5% in CSSpa, respectively. It is the overall energy consumption. In FIGURE 13, we present
important to note that FIGURE 12 focuses solely on the a comparison of SRAM accesses when running inference on
comparison of the energy consumption of one PE cluster. various accelerators using actual CNN models. It is evident
The overall energy consumption of the accelerator is also that Sparten incurred a very large number of SRAM accesses
influenced by other functional blocks, with data movement due to its locally buffered data not being fully reused. On the
between the PE clusters and global memories playing a sig- other hand, GoSPA and CSSpa resulted in similar numbers
nificant role in the overall energy of the system. of SRAM accesses since both accelerators read and flush the
Since implementing hardware architectures for other accel- data only when they participate in all possible computations.
erators is time-consuming, conducting comprehensive energy The average numbers of SRAM accesses incurred by Sparten
consumption comparisons can be challenging. However, on AlexNet, VGG16, GoogLeNet, and ResNet50 were 9.4×,
energy consumption information can still be inferred from the 7.3×, 2×, and 5.6×, respectively, compared to those caused
energy consumption of the cores and the number of memory by CSSpa. This substantial reduction in SRAM accesses for
AlexNet can be attributed to its relatively wide filters, which [9] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, ‘‘Efficient processing of
allow for more effective data reuse in CSSpa. In contrast, deep neural networks: A tutorial and survey,’’ Proc. IEEE, vol. 105, no. 12,
pp. 2295–2329, Dec. 2017.
GoogLeNet, which contains numerous 1 × 1 filters, experi- [10] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
enced the least reduction in SRAM accesses. Combining this A. Moshovos, ‘‘Cnvlutin: Ineffectual-neuron-free deep neural network
information with the reported energy consumption compar- computing,’’ in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit.
(ISCA), Jun. 2016, pp. 1–13.
isons of the cores, the SRAM access data can provide a more [11] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, ‘‘Eyeriss: An
accurate perspective on the overall energy efficiency of the energy-efficient recognizable accelerator for deep convolutional neural
accelerators. net-works,’’ IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017.
[12] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro,
VII. CONCLUSION R. Tapiador-Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-
Barranco, S.-C. Liu, and T. Delbruck, ‘‘NullHop: A flexible convolutional
This paper presents an innovative inner join logic aimed at neural network accelerator based on sparse representations of feature
reducing hardware costs in sparsity exploiting accelerators. maps,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644–656,
The proposed inner join logic streamlines the input matching Mar. 2019.
circuits of the inner-product approach by efficiently pro- [13] M. Kim and J.-S. Seo, ‘‘Convolutional neural network accelerator featuring
conditional computing and low external memory access,’’ IEEE J. Solid-
cessing continuous search windows containing RVC-encoded State Circuits, vol. 56, no. 3, pp. 803–813, Mar. 2021.
inputs. This approach optimizes hardware size, minimizes [14] J.-S. Park, J.-W. Jang, H. Lee, D. Lee, S. Lee, H. Jung, S. Lee, S. Kwon,
power consumption, and maintains peak performance. Addi- K. Jeong, J.-H. Song, S. Lim, and I. Kang, ‘‘A 6K-MAC feature-map-
sparsity-aware neural processing unit in 5 nm flagship mobile SoC,’’ in
tionally, we introduce a novel dataflow, Channel Stacking IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021,
for Sparse Tensors (CSSpa), which maximizes the reuse of pp. 152–154.
input features buffered in a limited buffer, leading to a sub- [15] J.-W. Jang et al., ‘‘Sparsity-aware and re-configurable NPU architecture
for Samsung flagship mobile SoC,’’ in Proc. ACM/IEEE 48th Annu. Int.
stantial reduction in the number of SRAM accesses and an Symp. Comput. Archit. (ISCA), Jun. 2021, pp. 15–28.
overall improvement in the power efficiency of the baseline [16] X. Zhu, K. Guo, H. Fang, L. Chen, S. Ren, and B. Hu, ‘‘Cross view cap-
design. Furthermore, the CSSpa dataflow ensures adapt- ture for stereo image super-resolution,’’ IEEE Trans. Multimedia, vol. 24,
pp. 3074–3086, 2022, doi: 10.1109/TMM.2021.3092571.
ability to accommodate various types and strides of CNN
[17] X. Zhu, K. Guo, S. Ren, B. Hu, M. Hu, and H. Fang, ‘‘Lightweight image
layers. We also present a technique for sharing ping-pong super-resolution with expectation-maximization attention mechanism,’’
input feature buffers among PEs in the same cluster, effec- IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1273–1284,
Mar. 2022.
tively virtually enlarging the local buffer size. This mitigates
[18] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen,
workload imbalances and contributes to enhanced overall and Y. Chen, ‘‘Cambricon-X: An accelerator for sparse neural networks,’’
speed. The implementation and simulation results demon- in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
strate significant performance enhancements when compared Oct. 2016, pp. 1–12.
[19] N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang,
with state-of-the-art architectures. To thoroughly evaluate the ‘‘Tensaurus: A versatile accelerator for mixed sparse-dense tensor
impact of the proposed inner join and dataflow, we plan to computations,’’ in Proc. IEEE Int. Symp. High Perform. Comput.
conduct ASIC implementation of the accelerator in future Archit. (HPCA), San Diego, CA, USA, Feb. 2020, pp. 689–702, doi:
10.1109/HPCA47549.2020.00062.
work. [20] S. Han, J. Pool, J. Tran, and W. J. Dally, ‘‘Learning both weights and
connections for efficient neural networks,’’ 2015, arXiv:1506.02626.
[21] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, ‘‘Pruning filters
REFERENCES for efficient ConvNets,’’ in Proc. ICLR, 2016, pp. 1–13.
[1] K. Fukushima, ‘‘Neocognitron,’’ Scholarpedia, vol. 2, no. 1, p. 1717, 2007, [22] Y. He, X. Zhang, and J. Sun, ‘‘Channel pruning for accelerating very deep
doi: 10.4249/scholarpedia.1717. neural networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, pp. 1398–1406.
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘Ima- [23] W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang, and B. Ren,
geNet large scale visual recognition challenge,’’ Int. J. Comput. Vis., ‘‘PatDNN: Achieving real-time DNN execution on mobile devices with
vol. 115, no. 3, pp. 211–252, Dec. 2015, doi: 10.1007/s11263-015-0816-y. pattern-based weight pruning,’’ in Proc. 25th Int. Conf. Architectural Sup-
[3] S. Kang, G. Park, S. Kim, S. Kim, D. Han, and H.-J. Yoo, ‘‘An overview port Program. Lang. Operating Syst., New York, NY, USA, Mar. 2020,
of sparsity exploitation in CNNs for on-device intelligence with software- pp. 907–922, doi: 10.1145/3373376.3378534.
hardware cross-layer optimizations,’’ IEEE J. Emerg. Sel. Topics Circuits [24] D. Kim, J. Ahn, and S. Yoo, ‘‘A novel zero weight/activation-aware
Syst., vol. 11, no. 4, pp. 634–648, Dec. 2021. hardware architecture of convolutional neural network,’’ in Proc. Design,
Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017, pp. 1462–1467.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
[25] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, ‘‘Eyeriss V2: A flexible
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
accelerator for emerging deep neural networks on mobile devices,’’ IEEE
cess. Syst. (NIPS), vol. 25, Dec. 2012, pp. 1097–1105.
J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019.
[5] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
[26] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally,
large-scale image recognition,’’ 2014, arXiv:1409.1556.
‘‘EIE: Efficient inference engine on compressed deep neural network,’’ in
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016,
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ pp. 243–254.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, [27] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany,
pp. 1–9. J. Emer, S. W. Keckler, and W. J. Dally, ‘‘SCNN: An accelerator for
[7] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image compressed-sparse convolutional neural networks,’’ in Proc. ACM/IEEE
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 44th Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2017, pp. 27–40.
Jun. 2016, pp. 770–778. [28] Z. Zhang, H. Wang, S. Han, and W. J. Dally, ‘‘SpArch: Efficient archi-
[8] A. Brock, S. De, S. L. Smith, and K. Simonyan, ‘‘High-performance large- tecture for sparse matrix multiplication,’’ in Proc. IEEE Int. Symp. High
scale image recognition without normalization,’’ 2021, arXiv:2102.06171. Perform. Comput. Archit. (HPCA), Feb. 2020, pp. 261–274.
[29] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, [42] Xilinx, Inc. (2021). Vivado. [Online]. Available: https://www.xilinx.
H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, ‘‘OuterSPACE: com/support/download.html
An outer product based sparse matrix multiplication accelerator,’’ in Proc. [43] Y. Yang, J. S. Emer, and D. Sanchez, ‘‘ISOSceles: Accelerating sparse
IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2018, CNNs through inter-layer pipelining,’’ in Proc. IEEE Int. Symp. High-
pp. 724–736. Perform. Comput. Archit. (HPCA), Montreal, QC, Canada, Feb. 2023,
[30] A. D. Lascorz, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify, pp. 598–610, doi: 10.1109/HPCA56546.2023.10071080.
M. Nikolic, K. Siu, and A. Moshovos, ‘‘Bit-tactical: A software/hardware [44] A. Krizhevsky, V. Nair, and G. Hinton. (2014). The CIFAR-10 Dataset.
approach to exploiting value and bit sparsity in neural networks,’’ in Proc. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html
24th Int. Conf. Architectural Support Program. Lang. Operating Syst., [45] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance
Apr. 2019, pp. 749–763. deep learning library,’’ in Proc. Adv. Neural Inf. Process. Syst., 2019,
[31] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, pp. 8024–8035.
‘‘SparTen: A sparse tensor accelerator for convolutional neural networks,’’ [46] M. R. Guthaus, J. E. Stine, S. Ataei, B. Chen, B. Wu, and M. Sarwar,
in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, Oct. 2019, ‘‘OpenRAM: An open-source memory compiler,’’ in Proc. IEEE/ACM
pp. 151–165. Int. Conf. Comput.-Aided Design (ICCAD), Austin, TX, USA, Nov. 2016,
[32] C. Deng, Y. Sui, S. Liao, X. Qian, and B. Yuan, ‘‘GoSPA: An energy- pp. 1–6, doi: 10.1145/2966986.2980098.
efficient high-performance globally optimized SParse convolutional neural
network accelerator,’’ in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput.
Archit. (ISCA), Jun. 2021, pp. 1110–1123.
[33] M. A. Qureshi and A. Munir, ‘‘Sparse-PE: A performance-
efficient processing engine core for sparse convolutional neural
networks,’’ IEEE Access, vol. 9, pp. 151458–151475, 2021, doi:
10.1109/ACCESS.2021.3126708.
[34] J. Yang, W. Fu, X. Cheng, X. Ye, P. Dai, and W. Zhao, ‘‘S2 engine:
NGOC-SON PHAM (Member, IEEE) received
A novel systolic architecture for sparse convolutional neural networks,’’
the B.S. degree from Vietnam National University,
IEEE Trans. Comput., vol. 71, no. 6, pp. 1440–1452, Jun. 2022, doi:
10.1109/TC.2021.3087946. Hanoi, Vietnam, in 2011, and the M.S. and Ph.D.
[35] Z.-G. Liu, P. N. Whatmough, Y. Zhu, and M. Mattina, ‘‘S2TA: Exploiting degrees in electrical and electronics engineering
structured sparsity for energy-efficient mobile CNN acceleration,’’ in Proc. from Chung-Ang University, Seoul, South Korea,
IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), Seoul, South in 2019. He is currently a Research Professor
Korea, Apr. 2022, pp. 573–586, doi: 10.1109/HPCA53966.2022.00049. with the Department of Computer Science and
[36] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, Engineering, Korea University. Prior to joining
E. Solomonik, J. Emer, and C. W. Fletcher, ‘‘ExTensor: An accelerator Korea University, he was an ASIC Design Engi-
for sparse tensor algebra,’’ in Proc. 52nd Annu. IEEE/ACM Int. Symp. neer with Zaram Technology, South Korea. His
Microarchitecture, New York, NY, USA, Oct. 2019, pp. 319–333, doi: research interests include mixed-signal integrated circuit design, embedded
10.1145/3352460.3358275. and real-time systems, and machine learning accelerators.
[37] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,
and T. Krishna, ‘‘SIGMA: A sparse and irregular GEMM accelerator with
flexible interconnects for DNN training,’’ in Proc. IEEE Int. Symp. High
Perform. Comput. Archit. (HPCA), Feb. 2020, pp. 58–70.
[38] N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang,
‘‘MatRaptor: A sparse-sparse matrix multiplication accelerator based
on row-wise product,’’ in Proc. 53rd Annu. IEEE/ACM Int. Symp.
Microarchitecture (MICRO), Athens, Greece, Oct. 2020, pp. 766–780,
TAEWEON SUH (Member, IEEE) received the
doi: 10.1109/MICRO50266.2020.00068.
B.S. degree in electrical engineering from Korea
[39] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang,
‘‘SNAP: A 1.67–21.55 TOPS/W sparse neural acceleration processor for
University, Seoul, South Korea, in 1993, the
unstructured sparse deep neural network inference in 16 nm CMOS,’’ in M.S. degree in electronics engineering from Seoul
Proc. Symp. VLSI Circuits, Jun. 2019, pp. C306–C307. National University, in 1995, and the Ph.D. degree
[40] Y.-C. Lin and C.-Y. Su, ‘‘Faster optimal parallel prefix circuits: in electrical and computer engineering from the
New algorithmic construction,’’ J. Parallel Distrib. Comput., vol. 65, Georgia Institute of Technology, Atlanta, GA,
no. 12, pp. 1585–1595, Dec. 2005. USA, in 2006. He is currently a Professor with the
[41] M. Horowitz, ‘‘1.1 computing’s energy problem (and what we can do about Department of Computer Science and Engineer-
it),’’ in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, ing, Korea University.
Feb. 2014, pp. 10–14.