User Manual of 3D_NeuroSim_V1.0
User Manual of 3D_NeuroSim_V1.0
Index
1. Introduction ........................................................................................................................................... 1
2. New Feature Highlights in 3D+NeuroSim V1.0. .................................................................................. 2
3. System Requirements (Linux) .............................................................................................................. 4
4. Installation and Usage (Linux) .............................................................................................................. 4
5. Chip Level Architectures ...................................................................................................................... 6
5.1 Interconnect: H-Tree ..................................................................................................................... 6
5.2 Floorplan of Neural Networks ...................................................................................................... 7
5.3 Weight Mapping Methods ............................................................................................................ 9
5.4 Pipeline System ........................................................................................................................... 10
6. Circuit Level: Synaptic Array Architectures....................................................................................... 11
6.1 Parallel Synaptic Array Architectures......................................................................................... 11
6.2 Array Peripheral Circuits ............................................................................................................ 14
7. Algorithm Level: PyTorch Wrapper ................................................................................................... 18
8. Algorithm Level: Inference Accuracy Estimation .............................................................................. 19
9. How to run DNN +NeuroSim.............................................................................................................. 22
10. Reference ........................................................................................................................................ 27
1. Introduction
DNN+NeuroSim is an integrated framework, which is developed in C++ and wrapped by Pytorch, to
emulate the deep neural networks (DNN) inference performance (in V1.0-V1.3) or on-chip training (in
V2.0-V2.2) performance on the hardware accelerator based on near-memory computing or in-memory
computing architectures. This released 3D+NeuroSim is extended from DNN+NeuroSim to support
electrical-thermal co-simulation of 3D integrated hardware accelerators. Various device technologies are
supported, including SRAM, emerging non-volatile memory (eNVM) based on resistance switching (e.g.
RRAM, PCM, STT-MRAM), and ferroelectric FET (FeFET). SRAM is by nature 1-bit per cell, eNVMs
and FeFET in this simulator could support either 1-bit or multi-bit per cell. NeuroSim [1] is a circuit-level
macro model for benchmarking neuro-inspired architectures (including memory array, peripheral logic, and
interconnect routing) in terms of circuit-level performance metrics, such as chip area, latency, dynamic
energy and leakage power. With Pytorch wrapper, DNN +NeuroSim framework can support hierarchical
organization from the device level (transistors from 130 nm down to 7 nm, eNVM and FeFET device
properties) to the circuit level (periphery circuit modules such as analog-to-digital converters, ADCs), to
chip level (tiles of processing-elements built up by multiple sub-arrays, and global interconnect and buffer)
and then to the algorithm level (different convolutional neural network topologies), enabling instruction-
accurate evaluation on the inference accuracy as well as the circuit-level performance metrics at the run-
time of inference.
The target users for this simulator are circuit/architecture designers who wish to quickly estimate the
system-level performance with different network and hardware configurations (e.g. device technology
choices, sequential read-out or parallel read-out, etc.). Different from our earlier released simulators
(MLP+NeuroSim [2]), where the network was fixed to a 2-layer MLP and executed purely in C++
(consumes long run-time), this DNN+NeuroSim framework is an integrated simulator with Pytorch wrapper
(i.e. C++ wrapped by python). With the wrapper, users are able to define various network structures,
precisions of synaptic weights and neural activations, which guarantee efficient inference running with the
popular machine learning platforms. Meanwhile, the wrapper will automatically save the real traces
(synaptic weights and neural activations) during the inference, and send to NeuroSim for real-time and real-
traced hardware estimation. In this released version, three networks (VGG-8 network for CIFAR-10 dataset,
DenseNet-40 network for CIFAR-10 dataset, ResNet-18 network for ImageNet dataset) are provided as
default models in the wrapper, with 8-bit synaptic weights and neural activations, while users could modify
the precisions and neural network topologies. The hardware parameters (such as technology nodes, memory
cell properties, operation modes, and so on) will be defined under NeuroSim in Param.cpp.
b) Heterogeneous 3D Integration
Fig. 2. (a) Die stack model of heterogeneous 3D multi-tier CIM accelerator; (b) floorplan of heterogeneous 3D
multi-tier CIM accelerator (layer-by-layer operation).
In this version, we validate and calibrate the hardware performance (area, critical path delay and energy
consumption) prediction against a 40nm RRAM-based CIM macro [5] post-layout simulations. Some
adjustment factors are introduced to account for transistor sizing and wiring area in the layout, gate
switching activity and post-layout performance drop, etc. For details, α = 1.44 for the wire areas in level
shifter; β = 1.4 for the sensing cycle as the critical path; γ = 50% and δ = 15% separately for dynamic
energy of DFFs and adders in shift-add or accumulators; ϵ = 5% for dynamic energy of control circuits;
and ζ = 1.23 for post-layout energy increase. After these calibrations, the macro-level simulation from
NeuroSim is quite accurate with error under 1%. After considering these realistic factors, the predicted
performance would decrease to some degree compared to previous versions. Users could enable/disable
this option or change these factor values in Param.cpp.
3) Add synchronous and asynchronous mode (extend from DNN+NeuroSim V1.3)
In previous versions, the latency of the whole chip is accumulated by the critical path delay of each
module, which is clockless and asynchronous. Considering practical circuit design, we added the
synchronous mode, where the latency is measured by clock cycles. The clock period is decided by the
compute sensing cycle, which is the critical path from giving input to the memory array till the ADC
generating the digital partial sum, as this is an analog process and no digital buffer could be added in
between. The latency of other digital modules is measured as how many cycles are needed for the
processing because their timing could be adjusted by adding digital buffer. The predicted performance
especially the throughput of synchronous mode is lower than the asynchronous mode. Users could
change this option in Param.cpp.
4) Update technology file for FinFET (extend from DNN+NeuroSim V1.3)
The default transistor models in NeuroSim were calibrated with the predictive technology model (PTM)
[6], which is available to the public and has a wide range of technology nodes from 130nm to 7nm.
However, as the PTM model (of 14nm, 10nm and 7nm) was proposed far earlier than the industry
adoption of FinFET, their prediction of Fin size actually deviate from the actual values. We corrected
the Fin height, width and pitch following the recent trends in leading foundries in the Technology.cpp
and made some corresponding changes in standard cell height/width and interconnect wire pitch, and
switched to the assumption of using maximum electrical width/or fin number in the standard cell for
digital circuit design.
5) Add level shifter for eNVM (extend from DNN+NeuroSim V1.3)
Level shifter module is added for eNVM (e.g. RRAM/PCM/FeFET) with high write voltage (>1.5V).
※ The tool may not run correctly (stuck forever) if compiled with gcc 4.5 or below, because some C++11
features are not well supported.
Command Description
make Compile the NeuroSim codes and build the “main” program
make clean Clean up the directory by removing the object files and the “main” executable
※ The simulation uses OpenMP for multithreading, and it will use up all the CPU cores by default.
※ The wrapper is built under the python3.4 + pytorch 1.1.0 (GPU), and CUDA 10.0+cuDNN v7.5.0.
5. Chip Level Architectures
In this framework, we consider the on-chip memory is sufficient to store synaptic weights of the entire
neural network, thus the only off-chip memory access is to fetch in the input data. Fig. 3 shows the modeled
chip hierarchy, where the top level of chip is consist of multiple tiles, global buffer, accumulation units,
activation units (sigmoid or ReLU), and pooling units. Fig. 3 (b) shows the structure of a tile, which contains
several processing elements (PEs), tile buffer to load in neural activations, accumulation modules to add up
partial sums from PEs and output buffer. Similarly, as Fig. 3 (c) shows, a PE is built up by a groups of
synaptic sub-arrays, PE buffers, accumulation modules and output buffer. In Fig. 3 (d), it shows an example
of synaptic sub-array, which is based on one-transistor-one-resistor (1T1R) architecture for eNVMs. At
sub-array level, the array architecture is different for SRAM or FeFET (not shown in this figure).
Fig. 3. The diagram of (a) top level of chip architecture, which contains multiple tiles, global buffer, accumulation
units, activation units (sigmoid or ReLU) and pooling units; (b) a tile with multiple processing elements (PEs), tile
buffer to load in activations, accumulation modules to add up partial sums from PEs and output buffer; (c) a PE
contains a group of synaptic arrays, PE buffer and control units, accumulation modules and output buffer; (d) an
example of synaptic array based on one-transistor-one-resistor (1T1R) architecture.
Fig. 4 shows, a wire could be considered as a group of wire segments and repeaters, to find an optimal
length of wire segment between repeaters, which leads to minimum delay, a VLSI design function [7] is
introduced as EQ (4.1) shows, where 𝑅 is the resistance of a minimum-sized repeater, 𝐶 is the gate
Fig. 4. The diagram of wire with repeaters.
capacitance, and diffusion capacitance 𝐶𝑝 , 𝑅 and 𝐶 are the unit resistance and capacitance of wire,
respectively.
( )
𝐿 = (4.1)
𝑊= (4.2)
However, in practice, to limit the energy consumption of interconnect, we may find a semi-optimal design
option of trade-offs between wire latency and energy. In this framework, we introduce two parameter called
“globalBusDelayTolerance” (and “localBusDelayTolerance” for global bus and tile/PE local bus
respectively) to find the semi-optimal floorplan of bus with such delay sacrifice, which will be defined in
param.cpp.
Fig. 5 shows an example of H-tree structure for 4 × 4 computation units (either tiles or PEs), where the bus
width connected to each units is assumed to be same. We define the H-tree is built up by multiple stages
(horizontal and vertical) from the widest (main bus) to the most narrow ones (connected to computation
units). The wire length decrease by ×2 at each stage from wide to narrow ones, while the sum of bus width
at each stage is fixed, which equals to the width of main bus.
method (as Fig. 5 shows). Similarly, the input data which should be assigned to various spatial location in
each kernel, will be sent to the corresponding sub-matrix, respectively. Partial sums from sub-matrices
could be obtained in parallel. Later, an adder tree will be used to sum up the partial sums.
Hence, such group of sub-arrays with the necessary input and output buffers and accumulation modules can
be defined as a processing element (PE). The kernels are split into several PEs according to their spatial
locations, and assign the input data into corresponding ones, it is possible to reuse the input data among
these PEs, i.e., directly transfer input data among PEs which do not need to revisit upper-level buffers.
collected along columns simultaneously at one time with high-precision flash-ADCs based on multilevel
S/A by varying references. In both modes, the adders and shift registers are used to shift and accumulate
partial sums for multiple cycles of input vectors (which represent MSB to LSB of the analog neural
activations).
Fig. 10. (a) sequential-read-out and (b) parallel-read-out analog eNVM pseudo-1T1R synaptic arrays; (c) sequential-
read-out and (c) parallel-read-out analog FeFET synaptic arrays;
Cell Cell
WL WL
BL
SL BL eNVM SL
(a) Conventional 1T1R array (b) Pseudo-crossbar array
Fig. 11. Transformation from (a) conventional 1T1R array to (b) pseudo-crossbar array by 90 o rotation of BL to
enable weighted sum operation.
architecture, as shown in Fig. 11 (b). In weighted sum operation, all the transistors will be transparent when
all WLs are turned on. Thus, the input vector voltages are provided to the BLs, and the weighted sum
currents are read out through SLs in parallel. Then the weighted sum currents are digitalized by a current-
mode sense amplifier (S/A), and a Flash-ADC with multilevel S/A by varying references.
3) Analog eNVM crossbar array
The crossbar array structure has the most compact and simplest array structure for analog eNVM devices
to form a weight matrix, where each eNVM device is located at the cross point of a word line (WL) and a
bit line (BL), as shown in Fig. 10 (c). The crossbar array structure can achieve a high integration density of
4F2/cell (F is the lithography feature size). If the input vector is encoded by read voltage signals, the
weighted sum operation (matrix-vector multiplication) can be performed in a parallel fashion with the
crossbar array. Here, the crossbar array assumes there is an ideal two-terminal selector device connected to
each eNVM, which is desired for suppressing the sneak path currents during the row-by-row weight update.
It should be noted that ideal selector device is still under research and development.
4) Analog FeFET array
As shown in Fig. 10 (c) and (d), the analog FeFET array is in the pseudo-crossbar fashion as proposed in
[11], which is similar to the analog eNVM pseudo-crossbar one. It also has an access transistor for each
cell to prevent programming on other unselected rows during row-by-row weight update. As FeFET is a
three-terminal device, it needs two separate input signals to be fetched to activate WLs and introduce read
voltages to RS (read select), respectively, where RS is used to fetch in input vectors as Fig. 12 shown below.
Fig. 12. Operations of (a) write and (b) read in FeFET cell.
6.2 Array Peripheral Circuits
The periphery circuit modules used in the synaptic arrays in Fig.7 and Fig. 8 are described below:
1) Level shifter
Level-shifter is normally required for RRAM (or PCM/FeFET) array to support the need of higher write
voltage (e.g. >1.5V which is higher than logic VDD). In the simulator, we take a conventional level shifter
as shown in Figure.13. If the validation mode is selected, a wiring area factor α = 1.44 will be imposed on
this module for calibration.
VDDH
P1 P2
VDDH
OUT Q_B Q
VDDL
N1 N2
IN
2) Switch matrix
Switch matrices are used for fully parallel voltage input to the array rows or columns. Fig. 14 (a) shows the
BL switch matrix for example. It consists of transmission gates that are connected to all the BLs, with
control signals (B1 to Bn) of the transmission gates stored in the registers (not shown here). In the weighted
sum operation, the input vector signal is loaded to B1 to Bn, which decide the BLs to be connected to either
the read voltage or ground. In this way, the read voltage that is applied at the input of transmission gates
can pass to the BLs and the weighted sums are read out through SLs in parallel. If the input vector is higher
than 1 bit, it should be encoded using multiple clock cycles, as shown in Fig 14 (b). The reason why we do
not use analog voltage to represent the input vector precision is the I-V nonlinearity of eNVM cell, which
will cause the weighted sum distortion or inaccuracy as discussed above. In the simulator, all the switch
B1
B1[0] B1[1] B1[2] B1[k-1]
BL1 B1 V
0
≈
B1 VREAD
B2
B2[0] B2[1] B2[2] B2[k-1]
GND
V
BL2 B2
0
≈
B2
Bn
Bn[0] Bn[1] Bn[2] Bn[k-1]
V
BLn Bn
0
≈
Fig. 14 (a) Transmission gates of the BL switch matrix in the weighted sum operation. A vector of control signals
(B1 to Bn) from the registers (not shown here) decide the BLs to be connected to either a voltage source or ground.
(b) Control signals in a bit stream to represent the precision of the input vector.
matrices (slSwitchMatrix, blSwitchMatrix and wlSwitchMatrix) are instantiated from SwitchMatrix
class in SwitchMatrix.cpp, this module is used in parallel-read-out synaptic arrays.
3) Crossbar WL decoder
The crossbar WL decoder is modified from the traditional WL decoder. It has an additional feature to
activate all the WLs for making all the transistors transparent for weighted sum. The crossbar WL decoder
is constructed by attaching the follower circuits to every output row of the traditional decoder, as shown in
Fig. 15. If ALLOPEN=1, the crossbar WL decoder will activate all the WLs no matter what input address
is given, otherwise it will function as a traditional WL decoder. In the simulator, the crossbar WL decoder
contains a traditional WL decoder (wlDecoder) instantiated from RowDecoder class in RowDecoder.cpp
and a collection of follower circuits (wlDecoderOutput) instantiated from WLDecoderOutput class in
WLDecoderOutput.cpp, this module is used in sequential-read-out synaptic arrays.
Fig. 15 Circuit diagram of the crossbar WL decoder. Follower circuit is attached to every row of the decoder to
enable activation of all WLs when ALLOPEN=1.
4) Decoder driver
The decoder driver helps provide the voltage bias scheme for the write operation when its decoder selects
the cells to be programmed. As the digital eNVM crossbar array has the write voltage bias scheme for both
WLs and BLs, it needs the WL decoder driver (wlDecoderDriver) and column decoder driver
(colDecoderDriver). These decoder drivers can be instantiated from DecoderDriver class in
DecoderDriver.cpp, this module is used in sequential-read-out synaptic arrays.
5) New Decoder Driver and Switch Matrix
One should be noticed that, for eNVM pseudo-crossbar and FeFET synaptic arrays, the WLs and BLs/RSs
could be controlled by same input signals, but with different voltage values, thus, it could significantly save
the area for unnecessary BL/RS switch matrix. To achieve this function, there are several extra control gates
to be added into the WL decoder driver circuits, and into the WL switch matrix. Fig. 16 shows the circuit
diagram of new decoder driver and switch matrix for eNVM pseudo-1T1R synaptic array, which could be
used to control both WL and BL (or RS) at the same time. In Fig. 16 (a), with the input and decoder output,
both of WL and BL will be controlled, where the WLs will be either activated or not, and the BLs to be
connected to either the read voltage or ground. Similarly, in Fig. 16 (b), the each single WL switch matrix
has two extra transmission gates to be used to send two separate voltages into the corresponding WL and
Fig. 16. Circuit diagram of (a) decoder follower and (b) WL switch matrix, which are used to control both WLs and
BLs simultaneously, for pseudo-1T1R synaptic arrays.
BL. In FeFET synaptic arrays, the signals connected to BLs in this example, will be connected to RSs. In
the simulator, the WLNewDecoderDriver (decoder driver) is instantiated from WLNewDecoderDriver
class in NewDecoderDriver.cpp and the WLNewSwitchMatrix (WL switch matrix) is instantiated from
WLNewSwitchMatrix class in NewSwitchMatrix.cpp, these new decoder follower and switch matrix are
used in eNVM pseudo-1T1R and FeFET synaptic arrays.
6) Multiplexer (Mux) and Mux decoder
The Multiplexer (Mux) is used for sharing the read periphery circuits among synaptic array columns,
because the array cell size is much smaller than the size of read periphery circuits and it will not be area-
efficient to put all the read periphery circuits underneath the array. However, sharing the read periphery
circuits among synaptic array columns inevitably increases the latency of weighted sum as time
multiplexing is needed, which is controlled by the Mux decoder. In the simulator, the Mux (mux) is
instantiated from Mux class in Mux.cpp and the Mux decoder (muxDecoder) is instantiated from
RowDecoder class in RowDecoder.cpp.
7) Analog-to-digital converter (ADC)
To read out the partial-sums and further process them in the subsquent logic modules (such as activation
and pooling), ADCs are used at the end of SLs to generate digital outputs. In the simulator, different types
of ADC are supported such as Flash ADC using multilevel voltage-mode sense amplifiers (VSA) or current-
mode sense amplifier (CSA), and successive-approximation-register (SAR) ADC as shown in Fig. 17. They
VDD VDD
PRE PRE
P1 P2 P3 P4 P1 P2 P3 P4 Clock & Timing
OUT OUT_B OUT OUT_B
Vin
S/H OUT
N1 N2 N1 N2
SAR
N3 N4 EN
INN INP N3 N4 Comparator
VCLP VCLP
N5 N5 N6
EN DAC
IREF IBL Vref
(a) (b) (c)
Fig. 17 Schematic of (a) voltage sense amplifier (VSA); (b) current sense amplifier (CSA); (c) successive-
approximation-register (SAR) ADC.
have trade-offs in the area/power and latency. Taking the balance between energy consumption and latency
into consideration, flash-ADC has better performance for lower resolution (3-bit or below) while SAR-
ADC performs better for higher resolution (4-bit or above) especially when with high R ON (e.g. >100k Ω),
but the break-even point can be changed to 8-bit with low RON (e.g. <5k Ω).
To precisely estimate the latency and energy of S/A, we run Cadence simulation across technology from
130nm to 7nm, for each technology node, we chose reasonable BL current range (considering practical
device resistance range), and in the range we select multiple specific nodes IBL, detect the latency and power
trends of each specific IBL when sweeping Iref (i.e. from 0.001×IBL to 1000×IBL). As a detection of multiple
experiments based on Cadence simulation, when fix IBL and sweep Iref, both latency and energy varies
significantly, with various Iref/IBL values, when Iref/IBL is approaching to 1, the latency and energy will be
the maximum (extremely hard for S/A to sense the difference); however, if we fix the Iref/IBL to a minimum
value which leads to maximum latency and energy, and sweep the IBL, the changes are quite smooth and
not significant.
Then, we sweep the technology nodes, at each technology node, we sweep the IBL, and for each IBL, we
sweep the Iref. We collect all the simulated data from Cadence simulation, then fit the data and build up
functions of latency and energy in relation with IBL and Iref for each technology node. In this way, in
NeuroSim, we are able to estimate the latency and energy based on real traces (which gives specific IBL,
while Iref are automatically defined by NeuroSim according to Ron, Roff, synaptic array size and precision of
ADC). Fig. 18 shows an example of latency estimation based on the fitting functions, where the blue dots
are estimated results and red dots are simulated results from Cadence, the fitting function yields reasonable
mismatch with much faster simulation compared with Cadence.
Fig. 18. An example of latency estimation based on fitting functions compared with Cadence results.
To read out the partial-sums in parallel modes, it requires ADC with high enough precision, for example,
with synaptic array size 128×128, and each cell represents 1-bit synapse, the partial-sums along one column
would be 7-bit which is impractical as ADC precision, thus we have to truncate the precision of ADC (for
partial sums) to minimize the area and energy overhead.
As Fig. 19 shows, we perform 8-bit inference of VGG-8 network on CIFAR-10 dataset, to investigate the
effects of truncating ADC precision on the classification accuracy. We set the sub-array size to be 128×128,
and investigate three schemes with 1-bit cell, 2-bit cell and 4-bit cell. To minimize the ADC truncation
effects on the partial-sums, we utilize the nonlinear quantization with various quantization edges
(corresponding to different ADC precision), where the edges are determined according to the distribution
of partial-sums, as proposed in [12]. Compared to the baseline accuracy (no ADC truncation), the results
suggest that at least 4-bit ADC is required to prevent significant accuracy degradation. Compared to a prior
work on binary neural network where 3-bit ADC was reportedly sufficient [12], the results in Fig. 19
suggest that higher weight-precision generally requires higher ADC-precision. With larger synaptic array
size or higher cell precision, higher ADC precision is demanded.
100
Baseline (90.18%)
Classification Accuracy (%)
80
60
40
1-bit/cell
20 2-bit/cell
4-bit/cell
0
3 4 5
ADC Precision
Fig. 19. Classification accuracy of CIFAR-10 for an 8-bit CNN as a function of the ADC precision for partial sums.
As Fig.22 shows, to represent the weights from algorithm (floating-point) on the CIM architectures, due to
the limited precision of synaptic devices, one ideal way is to normalize the weights to decimal integers, and
then digitalize the integers to conductance levels. For example, as shown in Fig. 20, if we define the synaptic
weight precision to be 4-bit (decimal integer 0 to 15), and represented by 2-bit (conductance level 0 to 3)
synaptic devices, from algorithm, the floating-point weight “+0.8906” will be normalized to 15, and thus
Fig. 22. Mapping weight from algorithm to synaptic device conductance in CIM architecture.
be mapped to two synaptic devices, one as LSB and one as MSB, and each of them are on conductance
level 3 (i.e.15/4=3, 15%4=3).
1) Conductance On/Off Ratio
Ideally, the conductance levels of synaptic devices range from 0 to 2N, where N is the precision of synaptic
devices. However, the minimum conductance can be regarded as 0 only if the conductance on/off ratio
(=maximum conductance/minimum conductance) of synaptic devices is infinity, which is not feasible in
current technology.
Thus, in reality, the minimum conductance level is not an ideal “0”. For example, if we use a normalized
synaptic device conductance range as 0~1 (as 0~2N/2N), where the “1” can be represented as maximum
conductance, and “0” is minimum conductance, in algorithm aspect, the conductance level “1” represent
ideal “1”, while the conductance level “0” actually represent a non-zero value “1/(on/off ratio)”. In this
case, small on/off ratio will introduce such non-ideal zeros into the calculation, and significantly distort the
inference accuracy.
One approach to remedy this situation is to eliminate the effect of the OFF-state current in every weight
element with the aid of a dummy column. In this framework, as Fig. 23 shows, we map the algorithm
weights (range [-1, +1]) to synaptic devices (conductance range [Gmin, Gmax]) in the synaptic arrays, while
we set a group of dummy columns beside each synaptic array, and the devices in dummy columns are set
to the middle conductance (Gmin+Gmax)/2. Such that, by subtracting the real outputs with the dummy outputs,
the truncated conductance range will be [-(Gmax-Gmin)/2, +(Gmax-Gmin)/2], which is zero-centered as [-1, +1],
and the off-state current effects are perfectly removed.
The conductance on/off ratio is defined as one argument “args.onoffratio” in “inference.py” file.
Fig. 23. Introduce dummy column to cancel out the off-state current effects.
2) Conductance Variation
It is well known that the synaptic devices involving drift and diffusion of the ions/vacancies show
considerable variation from device to device, and even from pulse to pulse within one device. Thus, in
inference chip, although the weight-update operation is not required, conductance variation is still a concern
during initialization or programming of the synaptic arrays.
In this framework, the conductance variation is introduced as a percentage of variation of desired
conductance, for example, if the desired conductance is 0.5, with +0.1 conductance variation, the actual
conductance will be 0.55, similarly, with -0.2 conductance variation, the actual conductance will be 0.4.
Fig. 24. Different scenarios of conductance drift.
For a chip, the conductance variation could be different from array to array, and device to device, so we
module such variation as a function of random generator, to generate conductance variation of different
cells, while the standard deviation of this random generator will be the argument in “inference.py” file, as
“args.vari”.
3) Retention
Retention refers to the ability of memory device to retain its programmed state over a long period of time.
Typical retention specification for NVM in memory application is more than 10 years at 85 oC. Many binary
eNVM devices have been able to meet this requirement. However, there are no reported data for analog
eNVM that shows such retention, which can be attributed to the instability of intermediate conductance
states.
To be general, we consider four scenarios of conductance drift for the retention analysis, as show in Fig.
24, where the conductance can either drift toward its maximum, minimum or intermediate states, or just
randomly drift. The formula for modeling the conductance drift behavior is assumed to follow the one
shown below:
v
t
G=G0
t0
where “G0” is the initial conductance, “t” is the retention time, “v” is the drift coefficient and “t0” is the
time constant which is assumed to be 1 second in this framework.
To estimate the retention effect on inference accuracy, we define a function called “Retention” in file
“wage_quantizer.py”, where the retention time and drift coefficient are defined as “args.t” and “args.v”
separately, while “args.detect” is used to define the drift scenario, if “args.detect” is 1, then the drift scenario
is drifting to a fixed value (otherwise, it is random drift), and the targeted value is then defined as
“args.target”, the range is defined from 0 to 1. Those arguments can be defined in file “inference.py”.
4) ADC Quantization Effects
For CIM architecture, there are mainly two read-out schemes. A sequential processing method of the
matrix-vector multiplication is to read out the dot-products in a row-by-row manner, which leads to extra
energy and latency for accumulations along the rows. A more efficient method is parallel processing, where
multiple rows are activated simultaneously by a switch matrix, and the current summation will be read out
by an ADC. Therefore, the row-by-row accumulation periphery of sequential scheme is eliminated.
Fig. 25. Linear and non-linear ADC quantization.
However, since it is impractical to use very high-precision ADC at the edge of eNVM sub-arrays, we have
to truncate the precision of ADC (for partial sums) to minimize the area and energy overhead.
To minimize the ADC precision while guarantee the inference accuracy, it is necessary to run the simulation
of ADC quantization before hardware design. In this framework, we support two quantization methods:
linear and non-linear quantization. As Fig. 25 shows, in linear quantization, the ADC references are
distributed linearly across possible partial-sum value range in the synaptic array; while in non-linear
quantization, the ADC references are non-linearly distributed, according to the distribution of partial-sums,
the references are more spread in high-probability area, while less in low-probability part.
Normally, non-linear quantization can save ~1-bit ADC precision compared with linear quantization,
however, the choice of non-linear references and quantized outputs is quite sensitive, the detection of
partial-sum distribution is necessary. In this framework, we defined two functions called
“NonLinearQuantizeOut” and “LinearQuantizeOut” in file “wage_quantizer.py”, while the users can define
the “args.ADCprecision” in file “inference.py”.
IFM Followed
IFM IFM Kernel Kernel Kernel
Channel by pooling
Length Width Length Width Depth
Depth or not?
Layer 1 32 32 3 3 3 128 0
Layer 2 32 32 128 3 3 128 1
Layer 3 16 16 128 3 3 256 0
Layer 4 16 16 256 3 3 256 1
Layer 5 8 8 256 3 3 512 0
Layer 6 8 8 512 3 3 512 1
Layer 7 1 1 8192 1 1 1024 0
Layer 8 1 1 1024 1 1 10 0
In the default VGG-8 network, layer 1 to layer 6 are convolutional layers, and layer 7 to layer 8 are fully-
connected layers. In the Table II, the dimensions of each layer are defined in different rows, from layer 1
to layer 8 (row 1 to row 8), while the first three columns (column 1 to column 3) are used to define the
dimension of input feature maps (IFMs) of each layer. For example, the input image size of layer 1 is
32×32×3, thus, in first row, the first three cells should be filled by 32, 32 and 3 respectively, which indicated
the length, width and depth of the IFM. The next three columns (column 4 to column 6) are used to define
the dimension of kernels. For example, the kernel size of layer 3 is 3×3×128×256 (i.e. each single 3D kernel
is 3×3×128, the kernel depth is 256), since it is well known that the third dimension of kernel is defined by
the IFM channel depth, it is not necessary to define the third dimension again, thus, from the Table II, in
row 3, the fourth, fifth and sixth cell should be filled by 3, 3 and 256, which represent the length, width and
kernel depth (first, second and fourth dimension of kernel) respectively. One should notice that, the fully-
connected layer can also be represented in the similar way, by considering it as a special convolutional
layer, which has unit length and width for IFM and kernels. The last column is used to define whether the
current layer is followed by pooling, it will be read by NeuroSim, and properly estimate the hardware
performance for pooling function, in this framework, the activation function is considered to be integrated
in every layer.
2) Modify the hardware parameters in Param.cpp
After setting up the network structure, the users need to define the hardware parameters in Param.cpp. In
this file, the users could define the parameters, such as technology node (technode), device type
(memcelltype: SRAM, eNVM or FeFET), operation mode (operationmode: parallel or sequential analog,
synaptic sub-array size (numRowSubArray, numColSubArray), synaptic device precision (cellBit),
mapping method (conventional or novel), activation type (sigmoid or ReLU), cell height/width in feature
size (F), clock frequency and so on. We list some recommended device parameters as below:
Table III Scaling trend of SRAM cell area with technology nodes (assuming F is the same as the
technology node)
After modifying the NetWork.csv and Param.cpp files, or whenever any change is made in the files, the
codes have to be recompiled by using make command as stated in Installation and Usage (Linux) section.
If the compilation is successful, a screenshot like Fig. 26 can be expected.
4) Run the program with PyTorch wrapper
After compilation of NeuroSim, go back to the PyTorch wrapper, in the wrapper, there are three networks
(VGG-8 network for CIFAR-10 dataset, DenseNet-40 network for CIFAR-10 dataset, ResNet-18 network
for ImageNet dataset) as default, the users can modify their network structures, and run the simulator
correspondingly.
Instructions to run the wrapper:
PyTorch (https://pytorch.org/)
o The bitwidth could be set use optional parameter
o Train
Python train.py
The model will be saved at a hierarchical folders based one the option value.
o Inference
Python inference.py
Set model_path to the saved model *.pth file
[5]. W. Li, S. Huang, X. Sun, H. Jiang, S. Yu, "Secure-RRAM: A 40nm 16kb compute-in-memory
354 macro with reconfigurability, sparsity control, and embedded security," IEEE Custom
Integrated 355 Circuits Conference (CICC), 2021.
[6]. Predictive Technology Model (PTM). Available at http://ptm.asu.edu/
[7]. N. E. Weste and D. Harris, “CMOS VLSI Design – A Circuit and Systems Perspective, 4th edition,”
2007.
[8]. X. Peng, R. Liu and S. Yu, "Optimizing weight mapping and data flow for convolutional neural
networks on RRAM based processing-in-memory architecture," IEEE International Symposium on
Circuits and Systems (ISCAS), 2019.
[9]. P.-Y. Chen, et al., "Technology-design co-optimization of resistive cross-point array for accelerating
learning algorithms on chip," ACM/IEEE Design, Automation & Test in Europe Conference &
Exhibition (DATE), 2015.
[10]. W. Khwa et al., "A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with
2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors," IEEE
International Solid State Circuits Conference (ISSCC), 2018.
[11]. M. Jerry, et al., "Ferroelectric FET analog synapse for acceleration of deep neural network
training," IEEE International Electron Devices Meeting (IEDM), 2017.
[12]. X. Sun, S. Yin, X. Peng, R. Liu, J.-S. Seo, S. Yu, "XNOR-RRAM: A scalable and parallel resistive
synaptic architecture for binary neural networks," ACM/IEEE Design, Automation & Test in Europe
Conference (DATE), 2018.
[13]. S. Wu, et al. "Training and inference with integers in deep neural networks," arXiv: 1802.04680,
2018.
[14]. github.com/boluoweifenda/WAGE
[15]. github.com/stevenygd/WAGE.pytorch
[16]. github.com/aaron-xichen/pytorch-playground
[17]. P. Jain, U. Arslan, M. Sekhar, B.C. Lin, L. Wei, T. Sahu, J. Alzate-vinasco, A. Vangapaty, M.
Meterelliyoz, N. Strutt, A.B. Chen, "A 3.6 Mb 10.1 Mb/mm 2 Embedded Non-Volatile ReRAM Macro
in 22nm FinFET Technology with Adaptive Forming/Set/Reset Schemes Yielding Down to 0.5 V with
Sensing Time of 5ns at 0.7 V," IEEE International Solid-State Circuits Conference (ISSCC), 2019.
[18]. W. He, S. Yin, Y. Kim, X. Sun, J.J. Kim, S. Yu and J.S. Seo, "2-Bit-per-Cell RRAM based In-
Memory Computing for Area-/Energy-Efficient Deep Learning," IEEE Solid-State Circuits Letters, vol.
3, pp. 194-197, 2020.
[19]. W. Wu, H. Wu, B. Gao, P. Yao, X. Zhang, X. Peng, S. Yu, H. Qian, "A methodology to improve
linearity of analog RRAM for neuromorphic computing," IEEE Symposium on VLSI Technology
(VLSI), 2018.
[20]. W. Kim, R.L. Bruce, T. Masuda, G.W. Fraczak, N. Gong, P. Adusumilli, S. Ambrogio, H. Tsai, J.
Bruley, J.P. Han, M. Longstreet, "Confined PCM-based analog synaptic devices offering low
resistance-drift and 1000 programmable states for deep learning," IEEE Symposium on VLSI
Technology, 2019.
[21]. K. Ni, B. Grisafe, W. Chakraborty, A.K. Saha, S. Dutta, M. Jerry, J.A. Smith, S. Gupta, S. Datta,
"In-memory computing primitive for sensor data fusion in 28 nm HKMG FeFET technology," IEEE
International Electron Devices Meeting (IEDM), 2018.
[22]. L. Wei, J.G. Alzate, U. Arslan, et al. "A 7Mb STT-MRAM in 22FFL FinFET technology with 4ns
read sensing time at 0.9 V using write-verify-write scheme and offset-cancellation sensing technique"
IEEE International Solid-State Circuits Conference (ISSCC), 2019.