0% found this document useful (0 votes)

14 views

2023-Optimization of Microarchitecture and Dataflow For Sparse Tensor CNN Acceleration

Uploaded by

phamngocson1408

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

2023-Optimization of Microarchitecture and Dataflow For Sparse Tensor CNN Acceleration

Uploaded by

phamngocson1408

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Received 2 September 2023, accepted 11 September 2023, date of publication 27 September 2023,

date of current version 9 October 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3319727

Optimization of Microarchitecture and Dataflow

for Sparse Tensor CNN Acceleration
NGOC-SON PHAM , (Member, IEEE), AND TAEWEON SUH , (Member, IEEE)
Department of Computer Science and Engineering, Korea University, Seoul 02841, South Korea
Corresponding author: Taeweon Suh ([email protected])
This work was supported in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP)
Grant funded by the Korean Government through [Ministry of Science and ICT (MSIT)], Convergence Security Core Talent Training
Business, Korea University, under Grant 2022-0-01198; in part by the National Research Foundation of Korea (NRF) Grant by the
Korean Government through MSIT under Grant NRF-2022R1A2C1011469; and in part by Samsung Electronics Company Ltd.,
under Grant IO210204-08384-01.

ABSTRACT The inherent sparsity present in convolutional neural networks (CNNs) offers a valuable
opportunity to significantly decrease the computational workload during inference. Nevertheless, leveraging
unstructured sparsity typically comes with the trade-off of increased complexity or substantial hardware over-
heads for accelerators. To address these challenges, this research introduces an innovative inner join aimed at
effectively reducing the size and power consumption of the sparsity-handling circuit. Additionally, a novel
dataflow named Channel Stacking of Sparse Tensors (CSSpa) is presented, focusing on maximizing data
reuse to minimize memory accesses − an aspect that significantly contributes to overall power consumption.
Through comprehensive simulations, CSSpa demonstrates a 1.6× speedup and a 5.6× reduction in SRAM
accesses when executing inference on the ResNet50 model, compared to the existing Sparten architecture.
Furthermore, the implementation results reveal a notable 2.32× enhancement in hardware resource efficiency
and a 3.3× improvement in energy efficiency compared to Sparten.

INDEX TERMS AI accelerator, convolutional neural networks (CNNs), data compression, dataflow,
network on a chip (NoC).

I. INTRODUCTION TABLE 1. Size of some common CNNs.

Deep neural networks (DNNs) have garnered significant
attention in recent decades due to their remarkable success
across various domains, such as computer vision, speech
recognition, and autonomous vehicles. Among these, con-
volutional neural networks (CNNs) have emerged as indis-
pensable tools in computer vision. Modeled after biological
brains, CNNs extract meaningful features from intricate input
images by scanning them with multiple filters [1]. Recent
CNN models have demonstrated accuracy that surpasses
human performance [2]. However, CNN’s exceptional accu-
compared to AlexNet [4] introduced in 2012, the recent state-
racy comes at a considerable computational cost. TABLE 1
of-the-art model NFNet-F6 [8] demands roughly 7× more
outlines the total number of parameters and computations
parameters and 520× more computations. Simultaneously,
for common CNN models. Compounding the challenge,
there is a mounting demand to optimize CNNs for edge
CNN models have grown in both size and depth [3]. When
devices, aiming to alleviate server workload, reduce traffic
costs, and safeguard users’ privacy. Given the preceding dis-
The associate editor coordinating the review of this manuscript and cussion, it becomes imperative to curtail CNN computations
approving it for publication was Mario Donato Marino . while upholding accuracy.
2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
108818 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
N.-S. Pham, T. Suh: et al.: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

There exist various techniques to reduce the computational models, the actual number of valid multiplications can be
demands of CNNs while maintaining their accuracy. Among reduced to 10%. These properties emphasize the importance
these methods, leveraging the sparsity present in model inputs of efficient sparsity-exploiting techniques to leverage these
proves to be a highly efficient approach. Given that the benefits to mitigate the memory and computation-intensive
primary operation in CNNs involves multiplication and accu- challenges of CNNs.
mulation (MAC), input features or weights containing zeros While exploiting one-sided sparsity is relatively straight-
can lead to ineffectual outputs. Consequently, skipping these forward, leveraging sparsity on both sides presents chal-
zero inputs, or, in other words, exploiting the sparsity of the lenges. Firstly, the irregular distribution of zeros in the
models, is advisable. Exploiting model sparsity offers several inputs can lead to an uneven workload distribution among
advantages, such as reducing the computational workload, processing elements (PEs). Consequently, some PEs with
thereby enhancing power efficiency and throughput. Addi- lower workloads may finish their tasks early and remain
tionally, compressing non-zero inputs allows for reduced idle, while others continue to work, resulting in subopti-
storage requirements and minimizes data traffic. mal resource utilization. The second challenge in dealing
In the context of CNNs, it has been observed that zeros with two-sided sparsity is the irregular data-access pat-
frequently appear in the features. This phenomenon is pri- tern. Since zero values are randomly distributed in both
marily due to the use of activation functions with a nonlinear features and weights, it becomes necessary to design
nature in CNNs, with the Rectified Linear Unit (ReLU) being additional hardware to identify non-zero matching posi-
one of the most employed activation functions [9]. The ReLU tions for these inputs. Moreover, input compression intro-
function maps negative outputs to zeros while retaining pos- duces complexity as it requires decoding the actual input
itive outputs, which then serve as inputs for the subsequent coordinates.
layers. Numerous research works have been dedicated to Methods for convoluting compressed sparse inputs can be
leveraging this feature sparsity [10], [11], [12], [13], [14], broadly categorized into two groups. The first group is based
[15], [16], [17]. Moreover, similar to the synapses connecting on an inner-product (or output-stationary dataflow) approach,
biological neurons, the weights in CNNs can also exhibit con- where valid input pairs are located before being sent to
siderable sparsity without significantly affecting the model’s MACs. Representative examples of this approach include
accuracy. Sparse weights can be obtained by either discarding Sparten [31], Extensor [36], and SIGMA [37]. The sec-
insignificant weights that fall below a certain threshold [20], ond group is based on the outer-product (or input-stationary
[21], or by employing optimization algorithms to identify and dataflow) approach, as seen in SCNN [27], SpArch [28],
retain only the important weights [22], [23]. Subsequently, and OuterSPACE [29]. In this method, PEs multiply each
retraining can be performed with the remaining weights. This incoming pair of inputs, and the resulting output products
process can be iterated multiple times until a satisfactory are delivered to their respective accumulators. Addresses
level of accuracy is attained. Studies have demonstrated that for accumulations are computed using separate coordinate
sparsity can be achieved up to 80% while the drop in accuracy decoding circuitry. However, these methods have certain
remains negligible [33]. The technique of exploiting sparsity microarchitecture drawbacks. The inner-product approach
by focusing solely on the zeros in features or weights is faces challenges in achieving a balance between performance
commonly referred to as one-sided sparsity exploitation. and hardware cost. Ensuring full PE speed often requires
Two-sided sparsity has been leveraged to enhance perfor- significant hardware additions. For example, in Sparten [31],
mance further [24], [25], [26], [27], [28], [29], [30], [31], bulky hardware components are necessary to find valid input
[32], [33], [34], [35], [36], [37], [38], [39]. By incorporating pairs, occupying up to 62.7% of the area and consuming 46%
sparsity in both features and weights, significant reductions of the total power. Conversely, the outer-product approach
in computations can be achieved. FIGURE 1 presents the encounters complexities in the output scatter circuit and may
weight and feature density of layers in AlexNet and VGG16. exhibit suboptimal reuse of output partial sums (psums). For
To obtain these densities, we randomly selected images from instance, SCNN [27] experiences network congestion when
the CIFAR10 dataset [44] and ran inferences on the pre- output psums are scattered across many PEs.
trained models. The weight density was directly obtained Accelerator dataflows play a pivotal role in both per-
from the pruned pretrained models, while the feature den- formance and power consumption. Similar to dense CNN
sity was calculated by counting non-zero features layer by counterparts, achieving maximal reuse of locally buffered
layer from the inference results. The data shows that weight data is essential for reducing memory access. This is crucial
density varies across layers and networks, with some layers since memory accesses typically contribute significantly to
in AlexNet reaching as low as 20% and in VGG16 reaching power consumption and have a considerable impact on over-
30%. Feature density also varies, typically being higher in the all performance. Specifically, reading 32 bits from SRAM
early layers but ranging as low as 30% as well. The triangles consumes approximately 5pJ of energy, which is 50 times
in the figures represent the fraction of valid multiplications higher than the energy cost of a 32-bit integer ADD operation
achievable if only valid (non-zero and matching) input pairs [41]. Furthermore, it is essential for the accelerator to handle
are considered. It is evident that in certain layers of both a diverse range of layer sizes, strides, and types, including

VOLUME 11, 2023 108819

N.-S. Pham, T. Suh: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

NullHop [12]. Cnvlutin [10], for instance, capitalizes on

sparsity by disregarding zeros in features, but it does not
entirely eliminate the transfer of zeros. On the other hand,
Cambricon-X [18] exclusively avoids zeros in weights while
still considering zeros in features. Tensaurus [19] accelerates
sparse and dense tensor factorizations through the introduc-
tion of a novel sparse storage format called Compressed
Interleaved Sparse Slice (CISS). This format effectively uti-
lizes high memory bandwidth and optimizes performance.
However, it should be noted that Tensaurus only supports
weight sparsity.
Recent advancements in sparse Generalized Sparse-
Matrix, Sparse-Matrix Multiplication (SpGEMM) accelera-
tors have been documented. SIGMA [37] adopts a fixed-sized
search window (e.g., 4 elements) to identify valid input pairs
and utilizes flexible interconnects to stream dynamic inputs
to stationary ones. Nevertheless, this accelerator encounters
underutilization issues when dealing with high sparsity levels.
Additionally, widening the search window can increase hard-
ware overhead and complexity. In contrast, ExTensor [36]
employs an intersection approach between two dynamic input
FIGURE 1. Feature, weight density and the reduction in the numbers of streams and uses the skipto() method to skip inputs where
MACs achievable by exploiting sparsity in (a) AlexNet, and (b) VGG16.
no valid pairs can be formed, thereby accelerating the inter-
section process. However, ExTensor still faces challenges in
achieving high PE utilization as the intersection may not
fully connected layers, to ensure versatility and efficiency in produce valid input pairs at every clock cycle. Two other
processing various neural network architectures. approaches, SpArch [28] and OuterSPACE [29], implement
In this study, we have introduced two cost-effective an outer-product (or input-stationary) dataflow to bypass the
solutions to address the weaknesses associated with the inner- inefficiencies associated with the input matching process in
product dataflow: the inner-product dataflow. While this improves efficiency,
• We propose an efficient inner join that effectively it leads to reduced output reuse and places a burden on scatter
reduces the burdens related to input matching logic networks to route output products to corresponding accumu-
of the inner-product approach while maintaining high lators. Lastly, MatRaptor [38] introduces a modified version
performance. of the Compressed Sparse Row (CSR) format known as
• Additionally, we present a new dataflow called Channel Channel Cyclic Sparse Row (C2 SR) to enhance data reuse and
Stacking for Sparse Tensors (CSSpa). CSSpa facilitates memory efficiency. However, this format requires complex
data reuse at the PE level while making efficient use of encoding for output matrices, which can add to the design
the limited buffers available in the PEs. complexity.
Through comprehensive simulations using cycle-accurate EIE [26] focuses on exploiting two-sided sparsity specif-
simulators and actual hardware implementations, our pro- ically in FC layers and does not extend its optimization to
posed solutions have demonstrated substantial improvements convolution layers. EIE essentially discards zeros in weights,
over state-of-the-art works in terms of speedup, hardware leaving them as idle cycles, resulting in a waste of com-
size, and power consumption. The remainder of this paper is pute resources. On the other hand, Eyeriss v2 [25] addresses
organized as follows. Section II discusses some recent notable two-sided sparsity by utilizing the Compressed Sparse Col-
CNN accelerators. Section III introduces the proposed inner umn (CSC) format for both weights and features. However,
join logic. Section IV presents the overall architecture. its pipeline-staged PE design introduces complexities in
Section V presents the proposed CSSpa dataflow. The imple- buffering logic to reduce bubbles caused by the failure
mentation and simulation results are provided in Section VI. of finding valid input pairs. This design decision leads to
Finally, Section VII concludes the paper. a significant increase in area, with Eyeriss v2 occupying
approximately 93% more space compared to the original
II. RELATED WORK Eyeriss [11].
Sparse architectures offer a significant reduction in compute Sparse CNN (SCNN) [27] targets two-sided sparsity in
and memory access requirements by leveraging the presence CNNs by employing the Cartesian product dataflow. This
of zeros in either features (one-sided) or both features and dataflow allows SCNN to avoid the need for expensive
weights (two-sided). Several notable examples of one-sided input-pair searching circuits. However, SCNN faces sev-
sparsity utilization include Cnvlutin [10], Eyeriss [11], and eral significant challenges. Firstly, congestion frequently

108820 VOLUME 11, 2023

N.-S. Pham, T. Suh: et al.: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

occurs in the output-scatter network of SCNN since output cost of introducing bulky hardware overhead and increased
products of the same output feature can arise at multiple complexity.
places. Secondly, the Cartesian product dataflow requires S2 Engine [34] utilizes an output-stationary systolic
a high-bandwidth crossbar between multipliers and accu- dataflow approach with multiple asynchronous PEs. Each PE
mulators to route the output products. This high-bandwidth selects aligned data pairs by leveraging a faster clock speed,
requirement adds complexity and may present challenges in which is set at four times the computation speed. However,
terms of area and power consumption. Thirdly, the Cartesian this design choice results in a limitation on the operation
product strategy assumes that each filter weight is multi- speed of the overall accelerator. Due to the asynchronous
plied by every feature. However, this assumption is only true nature and the faster clock speed of the PEs, the maximum
for unit-stride convolutions. In the case of k-stride convo- operation speed of the S2 Engine is always four times slower
lutions, a weight is multiplied by every k th input feature. than the processing allowable speed. This discrepancy creates
Consequently, the Cartesian product approach restricts the a bottleneck in the accelerator’s performance, as it cannot
applicability of SCNN to CNNs with unit-stride convolutions. fully exploit the potential processing speed available.
In contrast to SCNN, Sparten [31] adopts an inner-product ISOSceles [43] proposes an inter-layer pipelineapproach
approach by using intersection to search for non-zero match- to reduce data movement in CNN accelerators. Processing
ing pairs of inputs before executing multiplications. This CNN models layer-by-layer can lead to a large number of
architecture offers the advantage of easy controllability. outputs, claimed by the paper, which may overflow the
Additionally, Sparten’s use of output-stationary dataflow at accelerator’s internal SRAM. Consequently, the intermediate
the MACs contributes to improved power efficiency because outputs have to be stored externally, resulting in increased
the output size is typically larger than the input size due data movement costs. However, the presence of sparsity
to accumulation. However, Sparten does have some critical in the CNN models helps significantly reduce the size of
drawbacks that need to be considered. Firstly, the prefix sums, intermediate outputs. As a result, the experiments conducted
a crucial component in the architecture, consume significant on four CNN models (AlexNet, VGG16, GoogLeNet, and
power and hardware resources. In the breakdown report, the ResNet50) demonstrate that a 1MB internal SRAM is suf-
area and power consumption of the prefix sums were found ficient to store the intermediate results when using 8-bit
to be 54.6% and 40.6% of the total, respectively. The second precision. Despite the benefits of reducing data movement,
major limitation of Sparten is related to the poor reuse of input processing CNNs in an inter-layer fashion requires a substan-
features, resulting in a high number of memory accesses. tial amount of locally stored weights for multiple consecutive
This inefficiency in input feature reuse leads to increased layers. This local weight storage is necessary to facilitate the
data movement and can impact the overall performance and inter-layer pipeline, but it also introduces additional hardware
memory bandwidth utilization of the system. complexity.
GoSPA [32] employs a global search approach to consider
all possible computations for each feature. Subsequently, III. INNER JOIN LOGIC
it transfers only the relevant features to specific PEs that have In Section I, we explain that output-stationary dataflow
locally stored weights for valid computations. This dataflow offers advantages over weight-stationary or input-stationary
strategy is advantageous as it avoids unnecessary feature approaches because it minimizes the movement of psums.
transfers that may occur in Sparten, leading to more effi- Psums tend to have larger sizes compared to weights or inputs
cient data movement. However, the performance of GoSPA since they accumulate over time during the convolution oper-
is hindered by a bottleneck in the feature staging step. This ation. By applying output-stationary dataflow, a convolution
bottleneck arises from the limitation of broadcasting only one operation can be seen as a combination of multiple inner
feature to PEs at each clock cycle, and it skips broadcasting products between input feature vectors and weight vectors.
to PEs where the features are not required. Consequently, this However, the irregularity of sparse inputs poses a challenge in
results in poor PE utilization in GoSPA. efficiently finding non-zero matching input pairs between the
Sparse-PE [33] utilizes a multi-threaded, generic PE core two inputs (i.e., inner join), which is crucial for utilizing the
for performing sparse matrix multiplication. The PE core multiplier and improving system throughput. In this section,
employs a look-ahead mechanism to anticipate computations the paper presents the proposed inner join logic, designed
beforehand and only schedules valid computations, thereby to address this challenge while maintaining low power
avoiding ineffectual computations. This look-ahead strategy consumption.
is beneficial in optimizing the PE core’s efficiency. However, FIGURE 2 illustrates the functional diagram of the pro-
similar to the issue faced by SIGMA [37], Sparse-PE encoun- posed inner join logic. The input feature and filter tensors are
ters challenges when the look-ahead window is not large divided into multiple n-sized vectors, referred to as ‘‘chunks’’
enough to handle high sparsity level cases. In such situations, (e.g., n = 128). The primary purpose of the inner join logic
the resource utilization of the PE core may be suboptimal. is to identify non-zero matching pairs of these input chunks,
On the other hand, enlarging the look-ahead window can as they represent valid operand pairs for the convolution
address the resource utilization problem but comes at the operations. To optimize the operation of the inner join logic,

VOLUME 11, 2023 108821

N.-S. Pham, T. Suh: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

FIGURE 2. Proposed inner join logic.

and at the same time, to reduce the data transfer costs, these
chunks are compressed using the Zero-Value Compression
(ZVC) method. As a result of this compression, each chunk
becomes a two-tuple consisting of a bitmask and a set of
non-zero values. The bitmask in each chunk represents the
positions of non-zero values within the chunk. It is a binary
representation, where a ‘‘1’’ in the bitmask indicates the
presence of a non-zero value at the corresponding position
in the chunk, and a ‘‘0’’ represents a zero value.
To reduce the hardware size, the inner join logic utilizes
a pair of k-bit sub-chunks (e.g., k = 8) from the bitmasks
as inputs at a time. Which are referred to as search windows.
The goal of the inner join logic is to locate non-zero matching
pairs of values from both the feature and weight sub-chunks
within a search window. These valid pairs are then sent to
the MAC unit, one pair at every clock cycle. Once all valid
inputs in the current sub-chunks are processed, the window
is shifted to the next sub-chunks. Locating the matching
input pairs involves two key steps: (1) finding the valid input
pairs and (2) accessing the corresponding non-zero values
in the compressed arrays. In the first step, the positions of
the non-zero matching pairs within a search window are FIGURE 3. Power and hardware resource utilization of (a) different-sized
identified by performing an AND operation between the two prefix sums and (b) different-sized priority encoders.
bitmask sub-chunks. A k-bit priority encoder is utilized to
encode the AND result, generating one matching position
at every clock cycle with priority decreasing from right to the third bit from the right of the search window, marked
left, as shown in FIGURE 2. To determine the next matching by the red dashed box. The output of the priority encoder
position, the current-matching bit is set to 0. is 3, indicating the current matching position. To find the
In FIGURE 2, an example with an 8-bit search window address of the non-zero matching value in the compressed
is illustrated, which is represented by a black dash-dotted array for each input chunk, the computation is divided into
rectangle. The current matching position is highlighted in two steps. The first step involves counting the total number

108822 VOLUME 11, 2023

N.-S. Pham, T. Suh: et al.: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

of 1s in the previous search windows, which is stored in

an accumulator. The second step is to count the number
of 1s within the current search window up to the matching
position indicated by the output of the priority encoder. In this
example, the number of 1s in the previous search window
(before the current matching position) for the feature chunk
is 3, as indicated by the accumulator output. This number
is then added to the output of the 8-bit prefix sum, which
counts all 1s from the beginning of the current search window
to the current matching position. In this case, the number of
1s in the current search window up to the matching position
is 2. Therefore, the address of the non-zero matching input
is computed to be 5. After processing the current search
window, the total number of 1s in the window (which is 4)
is accumulated by the accumulator, and the search window
is shifted left by 8 bits to continue the process for the next
sub-chunks of data.
The prefix sums in the proposed inner join logic are
designed as carry lookahead-like logarithmic delays, as rep-
resented in [40]. This design choice significantly reduces the
hardware size compared to ripple carry-like linear delays.
FIGURE 4. Workload assignment.
The application of the proposed inner join leads to much
smaller and more efficient prefix sums and priority encoders
compared to Sparten’s 128-bit architectures. In FIGURE 3(a),
the benefits of using the proposed technique are evident when
comparing an 8-bit prefix sum as a baseline to a 128-bit
prefix sum. The power consumption and hardware resource
utilization, including the number of registers and lookup
tables (LUTs), of a 128-bit prefix sum are greater than 200×
and 55×, respectively, compared to the baseline 8-bit prefix
sum. Similarly, in FIGURE 3(b), the power consumption and
hardware resource utilization of the priority encoder increase FIGURE 5. PE architecture.
as its input size increases. Despite the requirement of two
additional adders and two accumulators in the proposed inner
join logic, the overall size and power consumption of the inner PEs in different groups can be assigned to produce output
join logic are significantly smaller than those of the 128-bit for different output channels. Due to the diverse shapes of
architecture in Sparten. CNN layers, a mapping algorithm is designed to allocate the
output feature map to the PEs in a way that ensures each PE
receives a comparable amount of workload, thereby minimiz-
IV. MICROARCHITECTURE ing workload imbalance. In the proposed architecture, each
A. WORKLOAD ALLOCATION PE is responsible for producing a separate output feature
In FIGURE 4, the workload allocation for PEs is illustrated section, resulting in overlapping input features allocated to
for the designed architecture, which utilizes a 16 × 16 PE neighboring PEs at the section border. To minimize the num-
array. Generally, PEs on the same row are responsible for pro- ber of input data reads, the output feature sections assigned
ducing output features of consecutive output channels, while to the PEs are squared.
PEs on the same column produce output features for the same After receiving all the necessary input data, including input
output channel. This implies that PEs on the same row share features, weights, and configuration data, the PEs operate
the same input features, and PEs on the same column share independently from one another. Their communication is
the same filter weights. PEs on the same row are grouped limited to global buffers and the control block, which comes
together to form a PE cluster. However, to enhance the total into play when they finish processing all the buffered input
PE utilization and accommodate variations in the shape of data or complete the final sums. To address the latency caused
CNN models’ layers, the PE array can be divided into smaller by data transfer time and optimize data access, a ping-pong
groups, and different tasks can be assigned to these groups. double-buffering technique is implemented for all PEs. Each
For instance, when the width and height of the output feature PE is equipped with a ping-pong double-buffering weight
map are small but the number of channels is significant, PEs buffer, while PEs within the same cluster share a ping-pong
in the same column can be broken into smaller groups, and double-buffering input feature buffer. This approach allows

VOLUME 11, 2023 108823

N.-S. Pham, T. Suh: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

PEs to process data from one buffer while the other buffer
is simultaneously filled with new input data. At every clock
cycle, all PEs within a cluster send requests to their shared
feature buffer, and the buffer responds to each PE individ-
ually. To ensure efficient and simultaneous data access for
all PEs within the cluster, the number of reading ports in
the shared feature buffer equals the number of PEs in that
cluster. Whenever a PE processes all the buffered features in
one half of the shared feature buffer, it can immediately swap
to the other half, provided that the other half is filled with
newly arrived data. The PE does not need to wait for other
PEs in the same cluster to finish their tasks before swapping
its input buffer. This control technique desynchronizes the
operations of PEs in a cluster, virtually enlarging their input
buffer size. As a result, the workload imbalance among the
PEs is minimized, leading to improved overall efficiency and
performance of the accelerator.

B. PE ARCHITECTURE
FIGURE 5 illustrates the microarchitecture of a single PE.
The primary function of PE is to compute the final sums FIGURE 6. CSSpa overall architecture.
for an output feature section on an output channel. PEs
are equipped with the necessary components to efficiently
process the input data. The key components of the PE C. OVERALL ARCHITECTURE OF THE ACCELERATOR
include: FIGURE 6 illustrates the top-level architecture and memory
• Input selector: This block is equipped with the inner hierarchy of the CSSpa system. The controller is omitted for
join logic explained in Section III to efficiently identify simple illustration. The core processor includes a 16 × 16 PE
non-zero matching pairs of inputs. array, as introduced in Section IV-A. The weights are stored in
• 8-bit multiplier: The PE includes an 8-bit multiplier that an off-chip DRAM, while the features are stored in a 1MB on-
performs the multiplication of feature and weight values. chip SRAM. The final sums from the PE array undergo ReLU
• 24-bit accumulator: The PE utilizes a 24-bit accumulator activation and pooling before being compressed by the com-
to accumulate the results of the multiplier into psums pressor. The compressed data is then stored locally in the
stored in the output buffer. on-chip SRAM to be processed in the subsequent layers. The
• Output buffer: The PE has a 24-bit output buffer, which size of the on-chip SRAM is selected to ensure that the out-
has a size of 14 × 14 elements. This buffer size is puts of intermediate layers in the CNN are sufficiently stored,
selected based on experimental results from various thereby reducing expensive power consumption caused by
CNN models to ensure an even division of the output accessing the off-chip DRAM. This size is determined based
channel dimension among multiple PEs. on the simulation results of common CNN models such as
• Feature and Weight buffers: The PE employs an 8-bit AlexNet, VGG16, GoogLeNet, and ResNet50. In case of data
shared input feature buffer and an 8-bit weight buffer. overflow, the excess data is stored in the DRAM. Both the
As explained in Section IV-A, both buffers are double on-chip SRAM and off-chip DRAM have a bandwidth of
buffered to minimize latency caused by data transfer. 64 bits. When PEs complete their executions, they send alert
Each PE possesses a local doubled weight buffer with a signals to the buffers to refill the data. Meanwhile, the PEs
size of 2 × 72 elements. However, PEs in a cluster share continue their operations with the other half of the ping-pong
a doubled input feature buffer to reduce hardware over- buffer without any latency.
head. The shared input feature buffer can accommodate
a maximum of 2 × 128 elements. D. DATA COMPRESSION METHOD
• Finite state machine (FSM) controller: Each PE is In order to reduce storage requirements, the proposed archi-
equipped with a separate state machine controller. This tecture employs output feature compression before storing
controller automatically orchestrates the operation of them in the SRAM. FIGURE 7 depicts the compression
the PE once it receives input data and configuration format utilized in this approach. The output feature tensor is
information, such as the number of assigned output divided into separate subsections, with each subsection com-
features and stride. Its primary function is to reuse the prising Ct consecutive channels, Wt columns, and Ht rows.
stored data in the input buffers to produce all possible The dimensions of these subsections are determined based
outputs. on the results of the mapping algorithm, which considers the

108824 VOLUME 11, 2023

N.-S. Pham, T. Suh: et al.: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

Algorithm 1 CS-Spa Dataflow

1: f = Array(Wt , Ht , C) # Input feature workload
2: w = Array(S, R, C) # Weight workload
3: o = Array(Qt , Pt ) # Output workload
4: f_buf = Array(Wt , Ct ) # Input feature buffer
5: w_buf = Array(S, R, Ct ) # Weight buffer
6: f_op = Array(S, Ct ) # Input feature operand
7: w_op = Array(S, Ct ) # Weight operand
8:
9: for Ct in [0, C, Ct ):
10: w_buf = w[S, R, Ct ] # Store weight buffer
11: for h in [0, Ht , 1):
12: # Store input feature buffer
13: f_buf = f[Wt , h, Ct ]
14: for r in [0, R, stride) if h-Pt < r <= h:
15: # Select weight operands
16: w_op = w_buf[S, r, Ct ]
17: for w in [0, Wt , stride) if w < Qt
18: # Select input feature operands
19: f_op = f_buf[w: w+S-1, Ct ]
20: p=h-r
21: q=w
22: # Perform dot product
23: o[p, q] + = w_op ∗ f_op

FIGURE 7. Channel stacking data compression method.

parameters of the subsequent layer to efficiently partition the

output features of the current layer. To facilitate the operation
of the inner join logic, the data is stored in the ZVC method.
The compression process involves converting a 3-D sparse
data tensor into a 2-D array containing non-zero values and a
corresponding 2-D bitmask, which can be effectively stored
in the memory hardware. The priority of compression is
FIGURE 8. CSSpa dataflow for convolution layers with non-unit width and
in descending order, starting from the channel dimension, height filters.
followed by rows, and finally columns. In addition to the
compressed ZVC data, metadata is also stored, which aids in
reading the compressed data during the execution of the sub- the dataflow is designed to avoid expensive movement of the
sequent layer. While the compression of the output features output psums, thereby further optimizing data transfer and
is performed on-the-fly, the compression of the weights can reducing unnecessary overhead. Lastly, CSSpa is designed to
be accomplished offline using a similar compression method easily accommodate different convolutional strides, provid-
and priority order. ing flexibility and adaptability to various configurations.

V. CSSPA DATAFLOW A. NON-UNIT WIDTH/HEIGHT FILTER CASE

In Section IV-A, we provide a detailed explanation of the In FIGURE 8, CSSpa dataflow is presented, specifically
workload allocation to the PEs and the spatial data reuse designed to handle convolution layers with non-unit width
strategy adopted to minimize memory accesses. This section and height filters. As previously discussed, each PE is allo-
introduces the local dataflow of the PEs and the temporal data cated the final sums of an output section with dimensions
reuse strategy. The primary objective of an efficient dataflow Pt × Qt , which are stored in the PE’s output buffer. However,
is to transfer data from large sources to constrained buffers due to the limited size of the input buffer, the required input
at the lowest possible transfer cost. To achieve these goals, data is divided and processed through a series of steps.
we propose Channel Stacking for Sparse Tensors dataflow • Step 1: Initially, a weight subsection of size S × R ×
(CSSpa), which addresses several key targets. Firstly, CSSpa Ct and an input feature subsection of size Wt × 1 × Ct
ensures that locally buffered input data is fully reused, effec- are loaded into the PE’s buffers. These subsections are
tively reducing the number of memory accesses. Secondly, represented by the black boxes in FIGURE 8.

VOLUME 11, 2023 108825

N.-S. Pham, T. Suh: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

• Step 2: Within the buffered input features and weights,

a weight subsection and an input feature subsection of
size S × 1 × Ct are selected to participate in a dot prod-
uct operation, resulting in one output psum. This data
size is referred to as the selection window. In FIGURE 8,
the selection windows are represented by the red boxes.
• Step 3: After completing the dot product in Step 2, the
selection window on the input features moves right with
the number of steps equal to the stride, and Step 2 is
repeated. This operation is denoted as ❶ in the figure.
FIGURE 9. Dataflow for convolution layers with 1 × 1 filters.
• Step 4: Once all locally buffered input features are
selected, the selection window on the weight tensor
moves down with stride steps to select another portion
of weights, and Steps 2−3 are repeated. This action is the width and height dimensions while keeping the channel
depicted as ❷ in the figure. dimension intact, creating multiple 1-D input feature vectors
• Step 5: Upon finishing Step 4, the data in the input called chunks. These input feature chunks are broadcast one-
feature buffer is flushed and replaced by another row of by-one to all the PEs, where each PE stores a 1 × 1 filter. The
input features with the size of Wt × 1 × Ct . This action broadcasted input feature chunks participate in dot-products
is illustrated as ❸. Steps 2−4 are then repeated. with the weight vectors stored in the PEs to produce all the
• Step 6: When the last row of input features is totally assigned outputs. This dataflow enables spatial reuse of the
processed, the weight buffer is flushed and replaced by input feature over multiple filters and allows temporal reuse
next Ct filter channels. Additionally, the input features of the filter over multiple input feature vectors, substantially
are moved channel-wise with Ct channels. Steps 1−5 reducing the number of memory accesses required.
are repeated, as shown as ❹ in the figure.
C. DATAFLOW FOR FULLY CONNECTED LAYER
The proposed CSCpa dataflow fulfills the established A fully connected layer can be considered as a special case
requirements: 1) Locally buffered input features and weights of a convolutional layer where the filters have the same size
are fully reused and replaced only when necessary (Steps as the input feature maps. Unlike traditional convolutional
5 and 6). 2) The dataflow applies an output-stationary layers that exhibit weight sharing and utilize filters multiple
approach in Step 2 to avoid moving the output psums. 3) times across different regions of the input, in a fully con-
Variable strides are easily handled, as demonstrated in Steps nected layer, each filter is only used once to participate in
3 and 4. This efficient dataflow design maximizes data reuse, a dot-product with the entire input feature map to produce
minimizes memory accesses, and significantly improves the a single output feature. Due to this unique characteristic,
overall performance and power efficiency of the accelerator. the dataflow for fully connected layers can be simplified.
The CSCpa dataflow can also be described in Algorithm 1. It can be efficiently achieved by applying an output-stationary
dataflow without any input reuse scheme. This means that
B. 1 × 1 FILTER CASE each PE will compute a single dot-product between a filter
The popularity of 1 × 1 filters in modern CNN models and the entire input feature map, producing one output fea-
has increased due to their advantages, such as reducing the ture. Since each filter is only used once and not shared, there
number of computations and introducing more non-linearity is no need for complex input reuse schemes like those used
to enhance accuracy [6]. The 1 × 1 filters can account for a in convolutional layers with weight sharing.
significant 39.2% of the parameters in GoogLeNet. However,
since the width and height dimensions of a 1 × 1 filter are VI. IMPLEMENTATION AND RESULTS
both one, each input feature can only be reused once by a We evaluated the CSSpa architecture and reported the speed
filter. Consequently, globally optimized sparse dataflow [32] under different configurations, speedup on actual models,
or Cartesian product [27] approaches are not efficient for hardware area, and power efficiency. Both synthetic models
convolution operations. Instead, the dot-product between the and actual CNN models were utilized for various mea-
channel-wise weight vector and the feature vector, as applied surements. We devised a cycle-accurate simulator using the
in Sparten [31], proves to be more efficient. Python programming language, which can precisely deter-
FIGURE 9 illustrates the dataflow or layers with 1 × mine the clock cycle requirements of the operations. To evalu-
1 filters. The dataflow has been modified to operate similarly ate the speed under different configurations, synthetic models
to Sparten. The workload allocation for the PEs remains the were processed by the simulator to report the total number of
same, where each PE is responsible for computing the final clock cycles required. Actual models were adopted to assess
sums of Qt × Pt output features. However, the difference the realistic impacts of different design aspects on inference
lies in the PE dataflow. Instead of storing a 2-D input fea- speed. Power consumption and hardware area were mea-
ture subsection, the input feature data space is divided in sured by developing an RTL version of the CSSpa core. The

108826 VOLUME 11, 2023

N.-S. Pham, T. Suh: et al.: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

TABLE 2. Benchmarks.

B. SPEEDUP ON ACTUAL CNN MODELS

The performance evaluation of CSSpa was conducted on
four commonly used CNN models: AlexNet, VGG16,
FIGURE 10. MAC utilization according to input sizes of the prefix sum. GoogLeNet, and ResNet50. These models were pruned and
trained on the CIFAR10 dataset [44] to achieve similar
accuracy as their dense counterparts. TABLE 2 provides
RTL implementation was synthesized on a Zynq-Ultrascale+ the average sparsity information for each of these models.
FPGA to report the hardware area and then executed with syn- To comprehensively assess the speedup achieved by CSSpa,
thetic models to report power consumption. Since the number we compared its performance against a dense architecture and
of global memory accesses is also important to provide a com- recent state-of-the-art CNN accelerators, including Sparten
prehensive picture of the overall power consumption, we also [31] and GoSPA [32]. Cycle-accurate simulators of these
provide these numbers by running inferences on various CNN architectures were designed with consistent configurations,
models. featuring 256 multipliers, 1MB on-chip SRAM for storing
output features, and DRAM for storing the models’ pretrained
A. DESIGN SPACE EXPLORATION weights. The simulators were developed using Python, allow-
The input size of the prefix sum can be viewed as a search ing convenient retrieval of model parameters through the
window for valid input pairs for the MAC. A narrow search PyTorch library [45]. The dense simulator used for compar-
window can cause the MAC to be idle due to input shortages, ison shared architectural similarities with CSSpa but lacked
especially when the inputs are highly sparse. To evaluate the sparsity-exploiting logic. Since Sparten and GoSPA do
the effect of the prefix sum input size on MAC utiliza- not support fully connected layers, performance comparisons
tion, only one PE was used for the simulation. This was to with these architectures were confined to convolutional lay-
avoid workload imbalances caused by concurrently simulat- ers. In contrast, the performance of fully connected layers was
ing multiple PEs. We used synthetic features and weights compared with that of dense architecture.
with sparsity levels ranging from 30% to 80% as inputs FIGURE 11 presents a comparison of convolution layer
to the simulators and recorded the MAC utilization during speedup among different CNN accelerators on various CNN
inference. models. The performance of CSSpa consistently outper-
FIGURE 10 depicts the MAC utilization ratio for different formed other architectures across all convolution layers.
input sizes of the prefix sum across various input sparsity Specifically, CSSpa achieved speedups of 5.6×, 1.4×, and
levels. Notably, at a sparsity level of 30%, the MAC achieved 2.9× compared to Dense, Sparten, and GoSPA on AlexNet,
full utilization for all configured input sizes of the prefix respectively. Similarly, for VGG16, the speedups were 2.7×,
sum. However, with an increase in sparsity level to 50%, 1.15×, and 1.4×, while for GoogLeNet, the speedups were
there was a slight reduction in MAC utilization when using 5.8×, 1.18×, and 2.96×, and for ResNet50, the speedups
an 8-bit prefix sum input. As the sparsity level continued to were 6.5×, 1.6×, and 2.18×, respectively. Furthermore,
rise and the prefix sum input size decreased, MAC utilization CSSpa’s performance on fully connected layers of AlxeNet
gradually declined. Specifically, with either an 8-bit or 16-bit and VGG16 was notably better than that of the dense archi-
prefix sum input, MAC utilization experienced a significant tecture due to its effective exploitation of sparsity in these
drop at high sparsity levels (70% − 80%). On the other hand, layers’ inputs.
when the prefix sum input was 32-bit, the MAC was nearly It is worth noting that the advantage claimed in the GoSPA
fully utilized, even at the common sparsity levels found in paper over Sparten in terms of timing is not applicable.
CNNs. Given the favorable MAC utilization with a 32-bit Our own implementation and verification of Sparten demon-
input for the prefix sum at typical sparsity levels, this configu- strated that there is no need for Sparten to stall its PEs’
ration was selected for designing the prefix sum and priority operations and wait for the prefix sum circuits to complete
encoder. Furthermore, it was employed in subsequent mea- their executions. As a result, the timing advantage of GoSPA
surements, representing an optimal choice for the system’s over Sparten is rendered invalid. Moreover, GoSPA’s archi-
efficiency and performance. tecture suffers from PE underutilization due to its inherent

VOLUME 11, 2023 108827

N.-S. Pham, T. Suh: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

FIGURE 11. Comparisons on normalized speedup on convolution layers of (a) AlexNet, (b) VGG16, (c) GoogLeNet, (d) ResNet50, and on fully connected
layers of (e) AlexNet and (f) VGG16 (baseline: Dense architecture).

data dependency, as explained in Section II. Consequently, C. AREA EFFICIENCY

in our experiments, the performance of Sparten outperformed To evaluate the area efficiency of CSSpa, we implemented
GoSPA. The superior performance of CSSpa compared to an RTL version of a single PE cluster using SystemVerilog,
Sparten can be attributed to our efficient desynchronization consisting of 32 PEs arranged in a row of the PE array.
of PE operations within one cluster. By allowing each PE Each PE was equipped with an 8-bit multiplier and a 24-bit
to independently read features from the shared ping-pong accumulator. For comparison purposes, we also developed an
input feature buffer, there is no need for PEs to wait for RTL version of the Sparten compute cluster with the same
others within the same cluster before swapping to the other number of multipliers. The configuration parameters of both
half of the buffer. This approach virtually increases the input architectures are provided in TABLE 3. The main differences
buffer size and effectively minimizes the workload imbalance between the two architectures are as follows: (1) The prefix
among the PEs. As a result, there is no need to apply the sum’s input size in Sparten is 128 bits, while in CSSpa it
greedy-balancing technique of Sparten, which significantly is reduced to 32 bits. (2) In CSSpa’s PE cluster, PEs share
reduces hardware costs. a ping-pong input feature buffer, whereas in Sparten, each

108828 VOLUME 11, 2023

N.-S. Pham, T. Suh: et al.: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

TABLE 3. PE configurations. TABLE 5. Asic area of sparten and CSSPA.

TABLE 4. Numbers of fpga primitives in synthesized pe clusters.

FIGURE 12. Energy consumption on Zynq-Ultrascale+ FPGA

(a) Normalized energy efficiency and (b) Energy breakdown.

comparisons regarding the sizes of Spaten and CSSpa. It is

essential to acknowledge that a divergence existed between
our implementation and that which was originally presented
PE possesses its own double-buffered input feature buffer. in the Spaten paper. This discrepancy arises from the inherent
(3) Sparten’s PE includes collocated weight and collocated challenges in designing an exact circuit replication in the
output to facilitate its greedy-balancing technique, which is absence of RTL source code. However, it is discernible
not required in CSSpa as explained in Section VI-B. To assess that prefix sum’s footprint in our designed Sparten remains
resource utilization, we synthesized the RTL of both architec- substantial, occupying a 35.12% of the chip’s area. Remark-
tures on a Zynq-Ultrascale+ FPGA. Despite Sparten being ably, through the application of our proposed approach, this
implemented on an ASIC, FPGA-implemented versions can occupied area is significantly reduced by a factor of 4.69.
still provide valuable insights and serve as reliable references. Furthermore, the dimensions of the priority encoder have
TABLE 4 presents the utilized numbers of FPGA resources been condensed by a factor of 3.67.
and their breakdown information. Thanks to the utilization of
a smaller input size for the prefix sums, the number of LUTs D. ENERGY EFFICIENCY
was reduced by 5.25×, decreasing from 100,160 in Sparten to FIGURE 12 presents the normalized energy efficiency and
19,072 in CSSpa. Similarly, the number of LUTs in priority energy breakdown of the CSSpa and Sparten’s PE clusters.
encoders was reduced by 75.55×, decreasing from 96,704 in The FPGA synthesized circuits were designed to execute at
Sparten to only 1,280 in CSSpa. While the reduction in the a 100 MHz frequency, and Switching Activity Interchange
input size of the prefix sums significantly contributes to the Format (SAIF) files were collected to report the energy con-
decrease in the number of LUTs, the reduction in the number sumption. Synthetic feature and weight tensors with different
of Flip-Flops (FF) in CSSpa is attributed to the decrease in the sizes and sparsity levels were utilized as inputs for the FPGA-
PE’s buffer sizes compared to Sparten, as shown in TABLE 3. synthesized RTLs. The energy consumption was determined
In consideration of the limitations in the FPGA- based on the average energy required to process these input
synthesized results, solely encompassing LUTs and FFs, tensors. Remarkably, the reduction in the input size of the
which is inefficient to offer a comprehensive overview of prefix sums and priority encoders in CSSpa led to a sig-
hardware area. To address this, we have also conducted ASIC nificant decrease in energy consumption, achieving a 3.3×
synthesis and subsequently provided results for the synthesis reduction compared to Sparten, as depicted in FIGURE 12(a).
of an individual PE cluster in both Sparten and CSSpa, FIGURE 12(b) provides a clear illustration of the effect of
employing the FreePDK45nm process technology. Notably, the prefix sums and priority encoders on the total energy
the generation of input and output buffers has been realized consumption of one PE cluster. Specifically, the energy con-
through the utilization of OpenRAM [46]. TABLE 5 presents sumptions of the prefix sum and priority encoder components

VOLUME 11, 2023 108829

N.-S. Pham, T. Suh: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

FIGURE 13. Comparisons on normalized SRAM accesses on convolution layers of (a) AlexNet, (b) VGG16, (c) GoogLeNet, (d) ResNet50, and on fully
connected layers of (e) AlexNet and (f) VGG16 (baseline: Dense architecture).

in CSSpa were substantially reduced from 59.5% and 2% accesses, as memory accesses are significant contributors to
in Sparten to 10.4% and 0.5% in CSSpa, respectively. It is the overall energy consumption. In FIGURE 13, we present
important to note that FIGURE 12 focuses solely on the a comparison of SRAM accesses when running inference on
comparison of the energy consumption of one PE cluster. various accelerators using actual CNN models. It is evident
The overall energy consumption of the accelerator is also that Sparten incurred a very large number of SRAM accesses
influenced by other functional blocks, with data movement due to its locally buffered data not being fully reused. On the
between the PE clusters and global memories playing a sig- other hand, GoSPA and CSSpa resulted in similar numbers
nificant role in the overall energy of the system. of SRAM accesses since both accelerators read and flush the
Since implementing hardware architectures for other accel- data only when they participate in all possible computations.
erators is time-consuming, conducting comprehensive energy The average numbers of SRAM accesses incurred by Sparten
consumption comparisons can be challenging. However, on AlexNet, VGG16, GoogLeNet, and ResNet50 were 9.4×,
energy consumption information can still be inferred from the 7.3×, 2×, and 5.6×, respectively, compared to those caused
energy consumption of the cores and the number of memory by CSSpa. This substantial reduction in SRAM accesses for

108830 VOLUME 11, 2023

N.-S. Pham, T. Suh: et al.: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

AlexNet can be attributed to its relatively wide filters, which [9] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, ‘‘Efficient processing of
allow for more effective data reuse in CSSpa. In contrast, deep neural networks: A tutorial and survey,’’ Proc. IEEE, vol. 105, no. 12,
pp. 2295–2329, Dec. 2017.
GoogLeNet, which contains numerous 1 × 1 filters, experi- [10] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
enced the least reduction in SRAM accesses. Combining this A. Moshovos, ‘‘Cnvlutin: Ineffectual-neuron-free deep neural network
information with the reported energy consumption compar- computing,’’ in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit.
(ISCA), Jun. 2016, pp. 1–13.
isons of the cores, the SRAM access data can provide a more [11] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, ‘‘Eyeriss: An
accurate perspective on the overall energy efficiency of the energy-efficient recognizable accelerator for deep convolutional neural
accelerators. net-works,’’ IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017.
[12] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro,
VII. CONCLUSION R. Tapiador-Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-
Barranco, S.-C. Liu, and T. Delbruck, ‘‘NullHop: A flexible convolutional
This paper presents an innovative inner join logic aimed at neural network accelerator based on sparse representations of feature
reducing hardware costs in sparsity exploiting accelerators. maps,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644–656,
The proposed inner join logic streamlines the input matching Mar. 2019.
circuits of the inner-product approach by efficiently pro- [13] M. Kim and J.-S. Seo, ‘‘Convolutional neural network accelerator featuring
conditional computing and low external memory access,’’ IEEE J. Solid-
cessing continuous search windows containing RVC-encoded State Circuits, vol. 56, no. 3, pp. 803–813, Mar. 2021.
inputs. This approach optimizes hardware size, minimizes [14] J.-S. Park, J.-W. Jang, H. Lee, D. Lee, S. Lee, H. Jung, S. Lee, S. Kwon,
power consumption, and maintains peak performance. Addi- K. Jeong, J.-H. Song, S. Lim, and I. Kang, ‘‘A 6K-MAC feature-map-
sparsity-aware neural processing unit in 5 nm flagship mobile SoC,’’ in
tionally, we introduce a novel dataflow, Channel Stacking IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021,
for Sparse Tensors (CSSpa), which maximizes the reuse of pp. 152–154.
input features buffered in a limited buffer, leading to a sub- [15] J.-W. Jang et al., ‘‘Sparsity-aware and re-configurable NPU architecture
for Samsung flagship mobile SoC,’’ in Proc. ACM/IEEE 48th Annu. Int.
stantial reduction in the number of SRAM accesses and an Symp. Comput. Archit. (ISCA), Jun. 2021, pp. 15–28.
overall improvement in the power efficiency of the baseline [16] X. Zhu, K. Guo, H. Fang, L. Chen, S. Ren, and B. Hu, ‘‘Cross view cap-
design. Furthermore, the CSSpa dataflow ensures adapt- ture for stereo image super-resolution,’’ IEEE Trans. Multimedia, vol. 24,
pp. 3074–3086, 2022, doi: 10.1109/TMM.2021.3092571.
ability to accommodate various types and strides of CNN
[17] X. Zhu, K. Guo, S. Ren, B. Hu, M. Hu, and H. Fang, ‘‘Lightweight image
layers. We also present a technique for sharing ping-pong super-resolution with expectation-maximization attention mechanism,’’
input feature buffers among PEs in the same cluster, effec- IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1273–1284,
Mar. 2022.
tively virtually enlarging the local buffer size. This mitigates
[18] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen,
workload imbalances and contributes to enhanced overall and Y. Chen, ‘‘Cambricon-X: An accelerator for sparse neural networks,’’
speed. The implementation and simulation results demon- in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO),
strate significant performance enhancements when compared Oct. 2016, pp. 1–12.
[19] N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang,
with state-of-the-art architectures. To thoroughly evaluate the ‘‘Tensaurus: A versatile accelerator for mixed sparse-dense tensor
impact of the proposed inner join and dataflow, we plan to computations,’’ in Proc. IEEE Int. Symp. High Perform. Comput.
conduct ASIC implementation of the accelerator in future Archit. (HPCA), San Diego, CA, USA, Feb. 2020, pp. 689–702, doi:
10.1109/HPCA47549.2020.00062.
work. [20] S. Han, J. Pool, J. Tran, and W. J. Dally, ‘‘Learning both weights and
connections for efficient neural networks,’’ 2015, arXiv:1506.02626.
[21] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, ‘‘Pruning filters
REFERENCES for efficient ConvNets,’’ in Proc. ICLR, 2016, pp. 1–13.
[1] K. Fukushima, ‘‘Neocognitron,’’ Scholarpedia, vol. 2, no. 1, p. 1717, 2007, [22] Y. He, X. Zhang, and J. Sun, ‘‘Channel pruning for accelerating very deep
doi: 10.4249/scholarpedia.1717. neural networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, pp. 1398–1406.
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, ‘‘Ima- [23] W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang, and B. Ren,
geNet large scale visual recognition challenge,’’ Int. J. Comput. Vis., ‘‘PatDNN: Achieving real-time DNN execution on mobile devices with
vol. 115, no. 3, pp. 211–252, Dec. 2015, doi: 10.1007/s11263-015-0816-y. pattern-based weight pruning,’’ in Proc. 25th Int. Conf. Architectural Sup-
[3] S. Kang, G. Park, S. Kim, S. Kim, D. Han, and H.-J. Yoo, ‘‘An overview port Program. Lang. Operating Syst., New York, NY, USA, Mar. 2020,
of sparsity exploitation in CNNs for on-device intelligence with software- pp. 907–922, doi: 10.1145/3373376.3378534.
hardware cross-layer optimizations,’’ IEEE J. Emerg. Sel. Topics Circuits [24] D. Kim, J. Ahn, and S. Yoo, ‘‘A novel zero weight/activation-aware
Syst., vol. 11, no. 4, pp. 634–648, Dec. 2021. hardware architecture of convolutional neural network,’’ in Proc. Design,
Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017, pp. 1462–1467.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
[25] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, ‘‘Eyeriss V2: A flexible
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
accelerator for emerging deep neural networks on mobile devices,’’ IEEE
cess. Syst. (NIPS), vol. 25, Dec. 2012, pp. 1097–1105.
J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Jun. 2019.
[5] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
[26] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally,
large-scale image recognition,’’ 2014, arXiv:1409.1556.
‘‘EIE: Efficient inference engine on compressed deep neural network,’’ in
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2016,
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’ pp. 243–254.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, [27] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany,
pp. 1–9. J. Emer, S. W. Keckler, and W. J. Dally, ‘‘SCNN: An accelerator for
[7] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image compressed-sparse convolutional neural networks,’’ in Proc. ACM/IEEE
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 44th Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2017, pp. 27–40.
Jun. 2016, pp. 770–778. [28] Z. Zhang, H. Wang, S. Han, and W. J. Dally, ‘‘SpArch: Efficient archi-
[8] A. Brock, S. De, S. L. Smith, and K. Simonyan, ‘‘High-performance large- tecture for sparse matrix multiplication,’’ in Proc. IEEE Int. Symp. High
scale image recognition without normalization,’’ 2021, arXiv:2102.06171. Perform. Comput. Archit. (HPCA), Feb. 2020, pp. 261–274.

VOLUME 11, 2023 108831

N.-S. Pham, T. Suh: Optimization of Microarchitecture and Dataflow for Sparse Tensor CNN Acceleration

[29] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, [42] Xilinx, Inc. (2021). Vivado. [Online]. Available: https://www.xilinx.
H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, ‘‘OuterSPACE: com/support/download.html
An outer product based sparse matrix multiplication accelerator,’’ in Proc. [43] Y. Yang, J. S. Emer, and D. Sanchez, ‘‘ISOSceles: Accelerating sparse
IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2018, CNNs through inter-layer pipelining,’’ in Proc. IEEE Int. Symp. High-
pp. 724–736. Perform. Comput. Archit. (HPCA), Montreal, QC, Canada, Feb. 2023,
[30] A. D. Lascorz, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify, pp. 598–610, doi: 10.1109/HPCA56546.2023.10071080.
M. Nikolic, K. Siu, and A. Moshovos, ‘‘Bit-tactical: A software/hardware [44] A. Krizhevsky, V. Nair, and G. Hinton. (2014). The CIFAR-10 Dataset.
approach to exploiting value and bit sparsity in neural networks,’’ in Proc. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html
24th Int. Conf. Architectural Support Program. Lang. Operating Syst., [45] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance
Apr. 2019, pp. 749–763. deep learning library,’’ in Proc. Adv. Neural Inf. Process. Syst., 2019,
[31] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, pp. 8024–8035.
‘‘SparTen: A sparse tensor accelerator for convolutional neural networks,’’ [46] M. R. Guthaus, J. E. Stine, S. Ataei, B. Chen, B. Wu, and M. Sarwar,
in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, Oct. 2019, ‘‘OpenRAM: An open-source memory compiler,’’ in Proc. IEEE/ACM
pp. 151–165. Int. Conf. Comput.-Aided Design (ICCAD), Austin, TX, USA, Nov. 2016,
[32] C. Deng, Y. Sui, S. Liao, X. Qian, and B. Yuan, ‘‘GoSPA: An energy- pp. 1–6, doi: 10.1145/2966986.2980098.
efficient high-performance globally optimized SParse convolutional neural
network accelerator,’’ in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput.
Archit. (ISCA), Jun. 2021, pp. 1110–1123.
[33] M. A. Qureshi and A. Munir, ‘‘Sparse-PE: A performance-
efficient processing engine core for sparse convolutional neural
networks,’’ IEEE Access, vol. 9, pp. 151458–151475, 2021, doi:
10.1109/ACCESS.2021.3126708.
[34] J. Yang, W. Fu, X. Cheng, X. Ye, P. Dai, and W. Zhao, ‘‘S2 engine:
NGOC-SON PHAM (Member, IEEE) received
A novel systolic architecture for sparse convolutional neural networks,’’
the B.S. degree from Vietnam National University,
IEEE Trans. Comput., vol. 71, no. 6, pp. 1440–1452, Jun. 2022, doi:
10.1109/TC.2021.3087946. Hanoi, Vietnam, in 2011, and the M.S. and Ph.D.
[35] Z.-G. Liu, P. N. Whatmough, Y. Zhu, and M. Mattina, ‘‘S2TA: Exploiting degrees in electrical and electronics engineering
structured sparsity for energy-efficient mobile CNN acceleration,’’ in Proc. from Chung-Ang University, Seoul, South Korea,
IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), Seoul, South in 2019. He is currently a Research Professor
Korea, Apr. 2022, pp. 573–586, doi: 10.1109/HPCA53966.2022.00049. with the Department of Computer Science and
[36] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, Engineering, Korea University. Prior to joining
E. Solomonik, J. Emer, and C. W. Fletcher, ‘‘ExTensor: An accelerator Korea University, he was an ASIC Design Engi-
for sparse tensor algebra,’’ in Proc. 52nd Annu. IEEE/ACM Int. Symp. neer with Zaram Technology, South Korea. His
Microarchitecture, New York, NY, USA, Oct. 2019, pp. 319–333, doi: research interests include mixed-signal integrated circuit design, embedded
10.1145/3352460.3358275. and real-time systems, and machine learning accelerators.
[37] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul,
and T. Krishna, ‘‘SIGMA: A sparse and irregular GEMM accelerator with
flexible interconnects for DNN training,’’ in Proc. IEEE Int. Symp. High
Perform. Comput. Archit. (HPCA), Feb. 2020, pp. 58–70.
[38] N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang,
‘‘MatRaptor: A sparse-sparse matrix multiplication accelerator based
on row-wise product,’’ in Proc. 53rd Annu. IEEE/ACM Int. Symp.
Microarchitecture (MICRO), Athens, Greece, Oct. 2020, pp. 766–780,
TAEWEON SUH (Member, IEEE) received the
doi: 10.1109/MICRO50266.2020.00068.
B.S. degree in electrical engineering from Korea
[39] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang,
‘‘SNAP: A 1.67–21.55 TOPS/W sparse neural acceleration processor for
University, Seoul, South Korea, in 1993, the
unstructured sparse deep neural network inference in 16 nm CMOS,’’ in M.S. degree in electronics engineering from Seoul
Proc. Symp. VLSI Circuits, Jun. 2019, pp. C306–C307. National University, in 1995, and the Ph.D. degree
[40] Y.-C. Lin and C.-Y. Su, ‘‘Faster optimal parallel prefix circuits: in electrical and computer engineering from the
New algorithmic construction,’’ J. Parallel Distrib. Comput., vol. 65, Georgia Institute of Technology, Atlanta, GA,
no. 12, pp. 1585–1595, Dec. 2005. USA, in 2006. He is currently a Professor with the
[41] M. Horowitz, ‘‘1.1 computing’s energy problem (and what we can do about Department of Computer Science and Engineer-
it),’’ in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, ing, Korea University.
Feb. 2014, pp. 10–14.