0% found this document useful (0 votes)
38 views15 pages

Hai Jin

Memory

Uploaded by

lakshna766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views15 pages

Hai Jin

Memory

Uploaded by

lakshna766
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received 8 March 2023; accepted 3 April 2023. Date of publication 17 April 2023; date of current version 3 November 2023.

The review of this article was arranged by Editor K. Ota.


Digital Object Identifier 10.1109/JEDS.2023.3265875

In-Memory Computing for Machine


Learning and Deep Learning
N. LEPRI (Graduate Student Member, IEEE), A. GLUKHOV (Graduate Student Member, IEEE),
L. CATTANEO (Graduate Student Member, IEEE), M. FARRONATO (Graduate Student Member, IEEE),
P. MANNOCCI (Graduate Student Member, IEEE), AND D. IELMINI (Fellow, IEEE)
Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano and IU.NET, 20133 Milan, Italy

CORRESPONDING AUTHOR: D. IELMINI (e-mail: [email protected])

This work was supported by the EU’s Horizon Europe Research and Innovation Programme under Grant 101070679.

ABSTRACT In-memory computing (IMC) aims at executing numerical operations via physical processes,
such as current summation and charge collection, thus accelerating common computing tasks including the
matrix-vector multiplication. While extremely promising for memory-intensive processing such as machine
learning and deep learning, the IMC design and realization must face significant challenges due to device
and circuit nonidealities. This work provides an overview of the research trends and options for IMC-
based implementations of deep learning accelerators with emerging memory technologies. The device
technologies, the computing primitives, and the digital/analog/mixed design approaches are presented.
Finally, the major device issues and metrics for IMC are discussed and benchmarked.

INDEX TERMS In-memory computing, deep learning, deep neural network, emerging memory technolo-
gies, matrix-vector multiplication.

I. INTRODUCTION itself [9], [10]. The range of operations that can be exe-
Today, artificial intelligence and its enabling technology, the cuted within memory devices includes stateful logic [11],
deep neural networks (DNN), have become largely popular in [12], pulse integration [13], [14], associative memory [15],
various applications such as image recognition, autonomous [16], and stochastic computing [17]. The most popular and
vehicles, speech recognition, and natural language process- enabling IMC operation is, however, matrix-vector multipli-
ing. In the last five years, a state-of-the-art deep neural cation (MVM) via Ohm’s and Kirchhoff’s law in a memory
network model increased the number of its parameters by array [18], [19]. IMC has been thus largely targeted for
about 4 orders of magnitude, leading to a significant increase hardware accelerators of DNN, where MVM is by far the
in computational and memory requirements for both the most intensive workload. The ability to execute MVM in a
training and the inference operations [1], [2], [3], [4], [5], single operation by activating all rows and all columns in
[6]. Traditional computing systems (Fig. 1a) typically store parallel represents a key benefit of IMC that is unrivaled
massive information on a memory unit that is physically by other technologies. Despite the simplicity of the MVM
connected to the computational unit by a data bus. The concept and the potential advantages of IMC, the design
continuous data movement between the processing and the options and the interaction between circuit operation and
memory units represents the main bottleneck due to the lim- device nonidealities still represent a key open challenge.
ited bandwidth, long latency, sequential data processing, and This work provides an overview of IMC for DNN accel-
high energy consumption [7], [8]. eration from the perspectives of device technology, circuit
To minimize the latency and energy overhead of con- design, device-circuit interaction, and its impact on comput-
ventional von Neumann computers, in-memory comput- ing accuracy. Section II illustrates the emerging nonvolatile
ing (IMC) aims at performing the computation in close memory technologies that are currently considered for IMC.
proximity to the memory or even in situ within the memory Section III presents an overview of various IMC circuit

c 2023 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/

VOLUME 11, 2023 587


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

even when the supply is disconnected, (ii) integration in the


back-end of line (BEOL), which allows compatibility of the
NVM process irrespective of the details of the front-end
technology and (iii) high density compared to SRAM. The
major NVM technologies for IMC applications are sketched
in Fig. 2.
The resistive-switching random access memory (RRAM
in Fig. 2a) consists of a metal-insulator-metal (MIM) stack,
where the insulator serves as active switching material [24].
The memory operation relies on the activation and deac-
tivation of a conductive filament across the switching
layer [25]. RRAM generally displays binary states, referred
to as low resistance state (LRS) and high resistance state
FIGURE 1. Several examples of CPU - memory integration. (a) Von (HRS) [26]. However, RRAM can also display multilevel
Neumann architecture, in which CPU and memory are separated and
operation [27] where the conductance can be tuned in the
connected through a high-bandwidth bus, (b) Near-memory computing,
which features the embedding of a nonvolatile memory on the same analog domain [28]. RRAM devices can be easily integrated
silicon as the CPU, for increased bandwidth and reduced data transfers, into crosspoint arrays [25] and scaled down to 22nm CMOS
(c) SRAM-based in-memory computing, in which the computation is
technology [29].
performed directly in the SRAM memory array, and (d) eNVM-based
in-memory computing, which features the integration of a high-density The phase change memory (PCM in Fig. 2b) relies on
memory allowing both parameter storage and calculation. the ability to electrically change the crystalline/amorphous
phase of an active chalcogenide material, where the resis-
tance correspondingly changes by at least two orders of
topologies for performing matrix-vector multiplication and magnitude [68]. The most typical material is Ge2 Sb2 Te5
their possible applications. Among these applications, the (GST) [69], although Ge-rich alloys are adopted for high-
most promising one is the IMC acceleration of DNN infer- temperature retention in embedded solutions [70]. The phase
ence, discussed in Section IV. Hence, Section V illustrates change is induced by Joule heating via the application of
the most critical device nonidealities affecting the accuracy voltage pulses. If the local temperature exceeds the melting
of IMC circuits. Section VI provides an overview of the temperature, the resulting phase is amorphous, correspond-
open challenges for the research field, while Section VII ing to a HRS. If instead, the local temperature is below the
concludes the work. melting temperature for sufficient time, the structure stabi-
lizes to crystalline, corresponding to LRS [71]. Thanks to
II. COMPUTATIONAL MEMORY TECHNOLOGIES the relatively mature technology, these devices have been
The main benefit of IMC is the improved energy effi- extensively used for IMC demonstrators [72].
ciency thanks to the reduction or suppression of data The ferroelectric random access memory (FeRAM in
movement. A first option to mitigate data movement is Fig. 2c) consists of a metal-ferroelectric-metal (MFM) struc-
to bring the main memory core directly on the chip via ture, where the ferroelectric layer exhibits a permanent and
high-density embedded DRAM [20] or embedded nonvolatile switchable electrical polarization [73]. FeRAM has received
memory (NVM). This approach, called near-memory com- renewed interest after the discovery of ferroelectric hafnium
puting and depicted in Fig. 1b, allows the storage of even oxides HfO2 with orthorhombic structure [74]. A key issue
megabytes of model parameters, such as synaptic weights with FeRAM is its destructive readout operation, due to
and activations, in close proximity to the processing unit. reading being performed above the coercive field. This lim-
A second option [21], [22], [23] is true IMC where com- itation is overcome by ferroelectric tunnel junction (FTJ),
putation is executed directly within the SRAM array as where different polarization states seem to show different
shown in Fig. 1c. A key limitation of this option is the resistances even at low voltages [75].
volatile nature of SRAM and the relatively low density The spin-transfer torque magnetic random access memory
compared to DRAM and emerging NVM. In fact, each (STT-MRAM in Fig. 2d) consists of a MIM stack where the
SRAM cell consists of at least 6 transistors and the bit value top and bottom metals are ferromagnetic (FM) metals, such
remains stored only until the power supply is switched off. as Fe, Co, Ni, and their alloys. The MIM displays a magneto-
To overcome these limitations, the third option embraces tunnel junction (MTJ) effect, where different orientations of
emerging NVM devices for both nonvolatile storage of the magnetic polarization in the two FM layers, namely a
computational parameters and in situ MVM acceleration parallel (P) or antiparallel (AP) state, result in a LRS or
(Fig. 1d). HRS, respectively [76]. STT-MRAMs feature fast switching
Here, we will focus on emerging NVM technologies that and good cycling endurance [77], despite suffering from a
are suitable for the IMC concept of Fig. 1d. In general, these relatively small resistance window and difficult multilevel
devices have three major advantages, namely (i) nonvolatile operation, which limits the use of STT-MRAM to binarized
storage which allows for the persistence of synaptic weights neural networks.

588 VOLUME 11, 2023


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

FIGURE 2. Graphic representation of the main emerging memory devices. (a) Resistive random access memory (RRAM). (b) Phase change memory (PCM).
(c) Ferroelectric random access memory (FeRAM). (d) Spin-transfer torque magnetic random access memory (STT-MRAM). (e) Ferroelectric field-effect
transistor (FeFET). (f) Spin-orbit torque magnetic random access memory (SOT-MRAM). (g) Electrochemical random access memory (ECRAM).
(h) Memtransistor device.

Devices in Fig. 2a-d have a two-terminal structure, which [86], [87]. The memory behavior can be obtained by migra-
makes them suitable for high-density crosspoint architec- tion of dislocations in polycrystalline MoS2 [88], lateral
tures [10]. In many cases, two-terminal devices are connected migration of Ag across the source/drain electrodes [85], or
to an access transistor resulting in a one-transistor/one- charge-trapping [89]. In some cases, MoS2 memtransistors
resistor (1T1R) structure with improved control of the device display gradual weight-update characteristics that are useful
current during programming and readout. Alternatively, for reservoir computing [89] and training accelerators [90].
three-terminal devices have been proposed. The ferroelec-
tric field-effect transistor (FeFET in Fig. 2e) consists of A. COMPARISON OF NVM TECHNOLOGIES
a field-effect transistor in which the gate stack contains In order to summarize and provide some quantitative
a ferroelectric layer [78]. The ferroelectric polarization is information, Table 1 shows a comparison between the main
reflected by the threshold voltage VT of the device, result- emerging memories and the charge-based CMOS memo-
ing in a memory effect similar to floating gate devices. ries [91]. Fig. 3a shows a correlation plot of speed, evaluated
FeFET arrays with ferroelectric HfO2 have been recently as the inverse of the read time, and density, evaluated as
demonstrated [35], [79]. the inverse of the cell area. Data from the literature are
The spin-orbit torque magnetic random access memory compared to the typical ranges for CMOS-based conven-
(SOT-MRAM in Fig. 2f) consists of a magnetic tunnel junc- tional memory technologies, such as SRAM, DRAM, and
tion (MTJ) structure deposited on top of a line of heavy NAND Flash. The performance/cost of emerging NVM is
metal, such as Pt or W [80]. The MTJ is programmed in usually intermediate between CMOS memories, where speed
a P/AP state by a current flowing across the heavy-metal approaches DRAM whereas density is still generally between
line via spin-orbit coupling. The cell is read by sensing the SRAM and DRAM.
MTJ resistance, as in the STT-MRAM. The three-terminal Fig. 3b shows the array size as a function of the tech-
structure allows the separation of the programming and the nology node for various NVM demonstrators. The capacity
reading paths, improving the cycling endurance and the write spans the whole range from embedded memory (1-100 MB)
speed [81]. to standalone memory (1-100 GB). Note that smaller tech-
The electrochemical random access memory (ECRAM in nology nodes do not necessarily lead to higher array capacity,
Fig. 2g) consists of a transistor device where the conductivity which is due to the different maturity levels of the technolo-
of the channel is modified in a nonvolatile way and can be gies. Fig. 3c shows the memory capacity of some NVM
reversed by injecting ionized dopants across an electrolyte demonstrators as a function of the year, highlighting the
layer [82]. ECRAM generally shows high endurance and continuous development of various memory technologies.
extremely low-power consumption thanks to the low mobility
channel, for instance, WO3 [83]. ECRAM also exhibits a III. IN-MEMORY MATRIX-VECTOR MULTIPLICATION
controllable, linear weight update that is suitable for training Most IMC implementations aim at accelerating matrix-vector
accelerators [82], [84]. multiplication (MVM), which is by far the most essential
The memtransistor (Fig. 2h) consists of a transistor device computing primitive in deep learning and machine learn-
with a 2D semiconductor material for the channel layer [85], ing [92]. Fig. 4 shows a sketch of the MVM concept

VOLUME 11, 2023 589


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

TABLE 1. Indicative performances and characteristics of different semiconductor memory technologies.

FIGURE 3. Performances and characteristics of various emerging memory demonstrators. (a) Memory speed (expressed as the inverse of the read time)
as a function of the device miniaturization (expressed as the inverse of the cell size) [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41],
[42], [43], [44], [45], [46], [47], [48], [49], [50]. (b) Memory capacity as a function of the technology node [29], [30], [31], [32], [33], [34], [35], [36], [37], [38],
[39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67].
(c) Memory array capacity during years [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50],
[51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67].

implemented in a crosspoint memory array. The applied volt-


age signals generate currents across each resistive element
that are given by the voltage-conductance multiplication
according to Ohm’s law. Currents reaching a grounded com-
mon row are the summation of individual cell currents
according to Kirchhoff’s current law (KCL). The output
current Ii at the ith row is thus given by:


N
Ii = Gi,j · Vj , (1) FIGURE 4. Crosspoint memory array based on resistive memories can
j perform matrix-vector multiplication directly in situ, by means of Ohm’s
law and Kirchhoff’s current law. By applying a voltage vector at the
where Gi,j is the conductance of the memory device at a columns, the analog conductive elements produce a current that is
certain position i, j, Vj the voltage applied at the jth column collected at the rows, conveniently biased at 0 V. The resulting output
current vector is the multiplication of the conductance matrix G with the
and N is the number of columns and rows [10], [93]. voltage vector V.
MVM can thus be carried out by physical laws, in situ,
without modifying or moving the stored parameters [10].
Most importantly, thanks to the inherent parallelism of the array, similar to Fig. 4, where device conductances can
array, the MVM computation is virtually performed in one be programmed in the binary [94], [95] or multilevel
step independently of the size of the matrix, thus achiev- domain [96], [97]. Steady-state currents collected at the
ing an outstanding time complexity of O(1). Note that the grounded rows are generally acquired by a readout chain
memory array is typically compatible with the BEOL pro- consisting of a transimpedance amplifier (TIA) and an
cess, allowing for 3D stacking and a memory density scalable analog-to-digital converter (ADC) [98]. A major limitation of
down to 4F 2 /N, where N is the number of stacked layers this architecture is the programming operation, where volt-
and F is the feature size of the lithographic process. age/currents might be difficult to control [99]. In particular,
Depending on the required specifications and the memory when applying various programming schemes [100], [101],
devices, various IMC implementations of MVM acceler- a certain number of half-selected cells experience a non-
ators are possible. Fig. 5a shows the resistive crosspoint negligible leakage current.

590 VOLUME 11, 2023


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

resulting voltage output will be proportional to the MVM


result [72].
Kirchhoff’s voltage law (KVL) can be used instead of
KCL for accumulation [109]. This is shown in Fig. 5f, where
the adoption of a 2T2R cell configuration enables a binary
XNOR multiplication between the input voltage and the con-
ductance. The multiplication activates only one of the two
paths, showing a LRS or an HRS depending on the result
of the multiplication. By sensing the series resistance sum-
mation at each column, it is possible to collect the results
of the MVM.

A. APPLICATIONS OF IMC MVM ACCELERATORS


FIGURE 5. Various implementations of IMC crosspoint accelerators of Since MVM is ubiquitous in a variety of algorithms and
MVM. (a) Resistive crosspoint array (1R). (b) Array with workloads, IMC circuits to accelerate MVM have thus been
one-transistor/one-resistor (1T1R) configuration. The transistor prevents
demonstrated in several data-intensive computing tasks, as
sneak path currents during the programming phase and allows finer
current control. (c) Array with one-selector/one-resistor (1S1R) schematically depicted in Fig. 6.
configuration. The highly non-linear selector prevents sneak path currents, Applications include image processing and image com-
maintaining the cell footprint. (d) Capacitive crosspoint array, composed of
pression (Fig. 6a) via the discrete cosine transform (DCT).
memory elements whose small-signal capacitance can be programmed.
(e) Temporal encoding of the input vector through gate voltage pulses Here, image processing/compression can be achieved by
whose widths represent the input signals. Integration is required to collect applying the concept of MVM between a fixed DCT
the transient currents. (f) MVM through resistance summation. An
matrix and the pixel intensity input vector, preserving only
XNOR-Multiply is performed by the 2T2R cell, which activates the path
corresponding to the multiplication result. The series of the resistive paths frequencies within a desired frequency band based on the
inherently performs the accumulation. compression ratio [19], [111].
In closed-loop IMC (CL-IMC), the MVM array core is
connected in the feedback loop of an array of operational
To address these programming issues, an access device amplifiers (OAs), as shown in Fig. 6b [112]. This class of cir-
is normally added in series to the resistive element. Fig. 5b cuits allows the acceleration of a broad range of linear algebra
shows the 1T1R configuration, which allows finer control operations, such as matrix inversion [113], eigenvector extrac-
of the program/read current, at the cost of a larger cell tion [114], linear regression [115], and ridge regression [116]
footprint and of an additional line for the transistor gate with a significant reduction in time complexity.
terminal [102], [103]. Fig. 5c shows the one-selector/one- Combinatorial optimization (Fig. 6c) relies on the intrin-
resistor (1S1R) configuration [104], [105]. A selector is a sic noise of the memory elements and the peripheral circuit
non-linear element capable of suppressing the leakage, also as an on-chip source of entropy to carry out a physical sim-
called sneak path, currents of half-selected cells during the ulated annealing to escape from local minima during the
programming phase, while maintaining a small cell footprint iterative search [117]. In these applications, MVM acceler-
and a compact two-terminal configuration [106], [107]. ators are typically used in recurrent architectures to map
Fig. 5d illustrates a crosspoint array based on capacitive restricted Boltzmann machines (RBM) [13], [118], [119]
memory elements, whose small-signal capacitance can be or Hopfield neural networks [120], [121], [122]. Similarly,
programmed. In this configuration, MVM computation is Bayesian neural networks (Fig. 6d) rely on the intrinsic vari-
typically carried out in two distinct phases. First, the capaci- ations of programmed conductance to model the probability
tors are pre-charged by applying a voltage proportional to the distributions of a Bayesian network [123].
input vector. Then, the capacitors are discharged by switches The most popular application for MVM remains DNN
placed at the end of columns and rows, while the accumu- inference (Fig. 6e) and training (Fig. 6f). A key differ-
lated charges are collected by analog integrators [108]. In ence between these applications is that synaptic weights
this case, multiplication is carried out by the characteristic are obtained from ex situ software-based training in the
law of the capacitance, namely Qi,j = Ci,j · Vj , where Ci,j case of inference accelerators, while they are trained in
serves as the weight and Vj is the applied input/activation. situ via iterative gradient descent algorithms in the case of
The input signals can generally be encoded either in DNN training accelerators. Typically, a training accelerator
the voltage amplitude, through amplitude encoding, or in is capable of performing inference via forward propagation,
the pulse width, through temporal encoding. This approach, while featuring also an in situ weight-update scheme gen-
shown in Fig. 5e, is typically implemented in 1T1R arrays, erally via vector-vector outer product within the crosspoint
where the memory elements are subject to a fixed voltage array [124]. Weight update requires linearity and symme-
V READ while the input signals are applied to the transistor try of the conductance update under the application of a
gates. By integrating the transient currents on a capaci- sequence of identical pulses, in line with the backpropaga-
tance or through the adoption of analog integrators, the tion algorithm. The best candidate materials to yield a linear

VOLUME 11, 2023 591


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

FIGURE 6. Example of applications that benefit from IMC matrix-vector multiplication. Depending on the frequency update requirements and the noise
sensitivity of the application, each hardware solution should combine memory devices with specific physical properties with adequate peripheral circuits.
For instance, applications that rely on one-time programming of weight values after an ex situ software-based training (e.g., DNN inference, CL-IMC, and
DCT) can trade off the need for accurate tuning algorithms with less stringent requirements on the cycling endurance of the device itself. On the other
hand, applications that demand frequent and continuous updates of the conductance matrix (e.g., DNN training) require efficient gradual programming
and endurance capabilities of the adopted memory device. Image “Pillars of Creation” from James Webb Space Telescope gallery [110].

FIGURE 7. DNN inference workload mainly consists of MVM, which is basically a Multiply-and-Accumulate operation. Crosspoint accelerators of DNN
inference can be classified depending on the way these two operations are performed. A fully digital approach relies on memory logic gates
implementing an XNOR-Multiply and on a counter for the accumulation. A mixed digital-analog approach requires an analog accumulation via Kirchhoff’s
current law (KCL). A fully analog approach relies on resistive elements, that allow the encoding of multilevel weights and activations. Going from digital
to analog, the parallelism and the information density of the accelerator increase, at the expenses of more severe parasitic effects and more complex
peripheral circuits. Further explorations of the fully analog approach are needed to unleash the potential of IMC for DNN inference acceleration.

weight update are ECRAM devices [125] and MoS2 -based being performed by analog or digital operations, three dif-
charge-trap memory [89], [90]. ferent options can be identified for MVM accelerators, as
depicted in Fig. 7.
IV. IN-MEMORY ACCELERATION OF DNN INFERENCE
The computational workload of a DNN mostly consists of A. FULLY DIGITAL CIRCUITS
MVM with variable input vectors and stationary weight The fully digital approach relies on memory logic gates to
matrices, which can be directly accelerated by a memory perform the multiplication, and on counters to perform the
array. Depending on multiply and accumulate operations sequential accumulation. To encode the binary alphabet of a

592 VOLUME 11, 2023


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

raises the voltage in one of the two output nodes, while


decreasing the other one. For instance, assuming A = 1 and
B = 1, the XNOR node potential increases while XNOR
decreases. The XNOR output is then digitally counted by a
popcount operation. Thanks to the binary comparison of the
two device resistances in the 2T2R structure, the memory cell
is resilient to drift, noise, device variability, and temperature
variations [128], [130].
FIGURE 8. Binary DNN usually adopt a (−1, 1) alphabet for activations and
weights. A multiplication in the (−1, 1) alphabet can be implemented in
SRAM-based digital accelerators have also been demon-
the classical (0, 1) binary domain through an XNOR logic gate, encoding −1 strated with various memory cells, ranging from six-
as 0. transistor (6T) cells to twelve-transistor (12T) cells [21],
[22], [23], [133]. While providing only volatile storage of
weights, SRAMs provide the advantage of a fully-CMOS
integration which can be manufactured even for extremely
scaled technology nodes, such as 5nm [133].
In general, the fully digital approach is exceptionally
robust to various nonidealities, such as device variability,
drift, noise, or IR drop, and it can have higher reconfigurabil-
ity [134], [135], [136]. However, because of the accumulation
through counting, the parallelism of the computation is lim-
ited to just one row at a time, thus limiting the available
throughput.

B. MIXED DIGITAL-ANALOG CIRCUITS


In a mixed digital-analog circuit for DNN acceleration, accu-
mulation is performed in the analog domain by KCL, thus
avoiding the sequential counting of the pulses, while multi-
plication remains implemented in the digital domain by an
FIGURE 9. Error-resilient implementation of an XNOR in a fully digital
XNOR gate.
accelerator, based on a differential 2T2R RRAM cell. The activation signal A
enables the connection of the cell to the sense amplifier in a straight or The mixed approach has been demonstrated either with
crossed path, allowing the sense amplifier to perform the comparison emerging NVM, such as RRAM [137], [138], [139],
between the two resistive states. Adapted with permission from [128].
[140], [141] or FeFET [142], [143], or with various SRAM
cells, generally from 6T to 12T [23], [144], [145], [146],
[147]. As for the fully digital approach, the XNOR gate can
binary neural network (BNN) [126], where activations and adopt the 1T1R cell [137], [140] with various differential
weights can be −1 or 1, the logic gates usually implement techniques with NVM [139], [142], [148] or SRAM, whose
an XNOR operation, that allows mapping a (−1, 1) mul- output result is typically stored in a capacitor as a binary
tiplication in the classical (0, 1) binary domain [127], as charge quantity [145], [149], [150].
schematically shown in Fig. 8. Fig. 10 shows the computing core of a mixed digital-
Digital accelerators have been proposed with various analog accelerator based on SRAM. XNOR is implemented
nonvolatile emerging memories, such as RRAM [128], in an eight-transistors/one capacitance (8T1C) cell, where
[129], [130], STT-MRAM [131], and FeFET [132]. The the weight a and its negated ab are stored in the SRAM,
memory logic gate is generally based on a single 1T1R while activations x and its negated xb are applied at the
or differential 2T2R cell. PMOS transistors connected to the cell capacitance [145].
Fig. 9 shows a building block based on differential 2T2R, Assuming a = 1 and x = 0, the complementary node ab
also displaying the XNOR gate and the sense amplifier (from is shorted to the capacitance, setting the output voltage
bottom to top). The binary weight is stored as a resistive pair to 0 V. Accumulation is then performed through charge
(HRS, LRS) or (LRS, HRS) in the 2T2R cell. For instance, to sharing of all cell capacitors to the shared bitline [145],
map a weight equal to 1, the memory element corresponding [151], [152]. Alternatively, charge accumulation has been
to B is programmed to LRS while its complementary B is proposed by charge redistribution on weighted capacitances
programmed to HRS. The activation (input) signal A and [150], [153], [154].
its complementary A connect the 2T2R cell to the sense When the multiplication results are produced in the form
amplifier in a straight path, for input A = 1, or crossed path, of steady state currents instead of charge, it is sufficient
for input A = 0. When the clock signal closes the conductive to collect them through a common node, exploiting KCL,
path to ground, the cross-coupled latch of the sense amplifier and acquire the output current sums through a readout cir-
compares the resistive states of the memory elements and cuit [137], [138], [139], [140]. Depending on the BNN,

VOLUME 11, 2023 593


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

FIGURE 11. Possible implementations of the computing core in a fully


analog approach based on 1T1R configuration, relying on current and
charge accumulation, respectively. Reprinted with permission from [155].
FIGURE 10. Implementation of an XNOR with an 8T1C SRAM cell in a
mixed digital-analog accelerator. Analog accumulation is performed by
charge sharing on a shared bitline. Reprinted with permission from [145].
Alternatively, multilevel weights are obtained through bit-
slicing techniques [156], [157], [162], differential imple-
the resulting current sum can also be directly compared to mentations, or more complex cell structures, allowing several
a reference current by means of a sense amplifier [138], conductive levels to be obtained [159]. Also, a hybrid binary-
thus operating a threshold-type activation function. When multilevel accelerator has been proposed to achieve the
adopting differential NVM cells, another proposed method best trade-off between accuracy and area efficiency [161].
to perform accumulation consists in implementing a voltage Alongside the increase in the number of conductive levels,
divider composed of pull-up or pull-down resistances accord- memory cells can contain a variable number of elements,
ing to XNOR results [141], [147], [148], and then acquiring for instance, 1T1R cell [95], [102], [164], differential 2T2R
the common node voltages, which are proportional to the cell [158], or higher-complexity cells such as 8T4R [159].
result of the MVM. In addition to multilevel weights, analog accelerators typi-
Overall, a mixed digital-analog approach takes advan- cally feature multilevel or analog activation signals that can
tage of the inherent parallelism of IMC, virtually reaching be modulated through amplitude [102], [159] or temporal
a computational complexity of O(1). However, the analog encoding [72], [156], [165].
accumulation requires a more complex peripheral circuitry, Fig. 11 shows two possible implementations of fully
often involving a bulky and energy-hungry readout chain, analog circuits, that rely on either current or charge accu-
and is more sensitive to parasitic effects, such as IR drop mulation. Current-mode sensing requires applying a clamped
and noise. Furthermore, when the multiplication relies on voltage to the source lines, thus generating current contribu-
a single NVM, without conductance comparisons or error- tions in each 1T1R cell that are collected and converted to
resilient circuits, also device variability and drift can affect a voltage by the current ADC. On the other hand, voltage-
the computation. mode sensing consists of two separate phases. First, the
multiplication results are stored in the source line capaci-
tances, then they are accumulated into a sample capacitance
C. FULLY ANALOG CIRCUITS by charge sharing. The voltage across the sample capacitance
Fully analog circuits perform accumulation by KCL and is finally collected by the ADC [155].
multiplication by resistive memory elements via Ohm’s law. Fully analog accelerators can harness the full potential of
The adoption of resistive memory elements limits possible IMC, thanks to the massive parallelism and the extremely
implementations to NVM technologies only, since SRAM high information density of multilevel weights and activa-
cells cannot provide ohmic behavior or work with analog tions. On the other hand, accurate readout and conversion
voltages. NVM-based analog accelerators have been imple- circuits are essential to fully benefit from these features,
mented with RRAM [94], [156], [157], [158], PCM [72], resulting in a significant overhead of area, power, and
[159], [160], [161], STT-MRAM [103], [162] and FeFET cost. Furthermore, analog computing is critically affected
devices [163]. by parasitic effects at device and circuit levels.
Thanks to the multilevel operation, resistive memories
are suitable for implementing non-binary weights in the V. MEMORY NONIDEALITY AND METRIC
same circuit footprint, thus enabling a higher area effi- Memory devices and circuits rely on physical, materials-
ciency, defined as the number of performed operations per based storage concepts that are never ideal. Fig. 12 sum-
area unit. Indeed, memory elements can be programmed marizes the main nonideality features, namely IR drop (a),
in binary [103], [163] or multilevel mode [102], [160]. conductance variation (b), drift (c), and fluctuations (d).

594 VOLUME 11, 2023


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

FIGURE 13. (a) Plot of the correlation between the conductance value G,
and its standard deviation, for a certain technology. (b) The simulated
relative current error of the MVM product as a function of the matrix size.
Device parameters were extracted from [79], [166], [184], [185].

levels can be discriminated despite a possible spread, com-


puting in the analog domain can be critically affected even
by a small variation.
Drift is generally observed in PCM, where the structural
relaxation of the amorphous phase causes an increase in resis-
FIGURE 12. Examples of memory nonidealities. (a) Parasitic resistance tance with time [179]. Drift can also affect the polycrystalline
along wire connections responsible for the IR drop, (b) programming phase in multilevel PCM devices, as a result of residual amor-
variability in multilevel programming, (c) conductance drift that affects the
cells, and (d) Gaussian noise that is measured during the readout of a phous regions [167], [180]. Fig. 12c shows the temporal decay
RRAM cell. Reprinted with permission from [166], [167], [168]. of conductance of multilevel analog states, described by their
slope ν on the bilogarithmic plot. Drift is also observed in
other devices, such as RRAM and FeFET, although the phys-
IR drop refers to the current-induced voltage drop across ical mechanism is different from PCM. Drift can be mitigated
parasitic wire resistances along the rows and the columns by adopting reference PCM cells [140], [181], [182], [183]
of the memory arrays (Fig. 12a). Wire resistance is non- or differential 2T2R structures [128].
negligible in very scaled arrays because of the small section Finally, various sources of noise and fluctuations may
of the metal lines. Furthermore, analog accumulation in par- affect NVM devices. For instance, Fig. 12d shows the 1/f
allel IMC requires several cells to be read at the same current noise of RRAM devices, causing an increasing rela-
time, thus increasing the wire current, hence the IR drop. tive spread of the measured current [168]. In addition to 1/f,
IR drop causes a modification of the effective cell voltage thermal and random telegraph noise (RTN) can contribute to
compared to the externally applied signal, thus resulting in time-dependent variations of the weights, thus affecting the
a current error that is proportional to the average device accuracy of the analog MVM. Noise might be mitigated by
conductance, to the wire resistance, and to the square of adopting analog integration of the readout current, although
the array size [99], [169]. In practice, the error induced at the cost of reduced speed of computation.
by IR drop is the main limitation to array size up-scaling, To properly benchmark various NVM technologies for use
thus preventing reaching the ideal computational complex- in mixed or fully analog DNN accelerators, it is important
ity of O(1). Generally, IR drop is reduced by adopting to set a common metric. To this purpose, Fig. 13a shows
low conductance devices, differential cells [158], or small a correlation plot between the average conductance value
computing-tile architectures [169]. More elaborated tech- G and the standard deviation σG . Data were obtained for
niques have been proposed at architectural level [170], [171], various NVM devices, including FeFET [79], PCM [185],
algorithmic level [169], [172], [173], and training level [174]. RRAM [166], and STT-MRAM [184]. The conductance
Multilevel operation allows the improvement of area effi- G should be minimized to reduce readout currents, hence
ciency [28], [175], [176]. However, NVMs have limited energy consumption and IR drop effects. Similarly, σG
precision in programming the conductance, for instance, should be minimized to improve the computing accuracy
due to size variations of the conductive filament in RRAM in analog/mixed circuits. The observed trend in the figure
or crystalline grain size in PCM. The limited precision is that σG and G approximately correlate with a formula
arises as a device-to-device (D2D) variability or a cycle-to- σG /G ≈ 0.15, irrespective of the NVM technology and the
cycle (C2C) variability within the same device [177], [178]. programmed state. Fig. 13b illustrates the relative current
D2D variability is shown in Fig. 12b, reporting a multilevel error for an MVM operation in the presence of variations
RRAM device with a non-negligible spread of the conduc- and IR drop as a function of the array size for the NVM
tive states. Differently from the digital domain, where binary devices in Fig. 13a. For relatively small array sizes, the error

VOLUME 11, 2023 595


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

decreases as a result of variability averaging among NVM VII. CONCLUSION


devices. As the array size increases, IR drop causes the error This work provides an overview of memory devices and
to steeply increase. The optimum size of the array, which circuit topologies for IMC-based acceleration of machine
is identified in correspondence with the minimum error, is learning and deep learning. Among various applications, a
dictated by σG and G, which control variability and IR drop, particular focus is given to the IMC acceleration of DNN
respectively. inference, for which various approaches are presented and
discussed, considering circuit overheads and parasitic effects
VI. OUTLOOK affecting the final accuracy. IMC is a potentially-disruptive
IMC circuits are dense, fast, energy-efficient, and scalable. paradigm shift, either in terms of architectural change or raw
Several solutions and applications have already been iden- computing performances. Further research on memory device
tified and explored for both machine learning and deep engineering and understanding as well as on the hardware-
learning. However, various technological and design chal- network synergy could eventually unleash the full potential
lenges have also been identified. Further development and of IMC.
industrialization of IMC require addressing these challenges
in two major directions.
REFERENCES
The first direction concerns the study of device technol-
[1] A. Gholami, Z. Yao, S. Kim, M. W. Mahoney, and K. Keutzer, AI
ogy and materials. IMC paradigm would greatly benefit from and Memory Wall, RiseLab Medium Post, Berkeley, CA, USA, 2021.
the adoption of precise, stable, and low-current memory [2] J. Kaplan et al. “Scaling Laws for Neural Language Models.”
devices that could be easily integrated in the BEOL of Jan. 2020. [Online]. Available: http://arxiv.org/abs/2001.08361
[3] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. “Inception-
extremely scaled lithographic processes, while also being V4, Inception-ResNet and the Impact of Residual Connections
programmable in multiple conductive levels. Investigation on Learning.” 2016. [Online]. Available: https://arxiv.org/abs/1602.
of materials and device physics can enlighten the phenom- 07261
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT:
ena underlying nonidealities such as fluctuations and drift, Pre-Training of Deep Bidirectional Transformers for Language
with the aim of developing new memory devices which Understanding.” 2018. [Online]. Available: https://arxiv.org/abs/1810.
are immune from parasitic effects. Besides device devel- 04805
opments, the engineering of the memory cell configuration, [5] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and
B. Catanzaro. “Megatron-LM: Training Multi-Billion Parameter
such as 1S1R or 1T1R structure, could drastically reduce Language Models Using Model Parallelism.” 2019. [Online].
the operating current, with strong advantages in terms of Available: https://arxiv.org/abs/1909.08053
lower energy consumption, lower IR drop, and higher area [6] T. B. Brown et al. “Language Models Are Few-Shot Learners.” 2020.
[Online]. Available: https://arxiv.org/abs/2005.14165
efficiency of the IMC system. In summary, developments at [7] M. Horowitz, “1.1 computing’s energy problem (and what we can do
the device level would boost IMC performance in terms of about it),” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers
increased information density, throughput, area, and energy (ISSCC), 2014, pp. 10–14.
[8] T. M. Conte, E. Track, and E. DeBenedictis, “Rebooting computing:
efficiency. New strategies for technology scaling,” Computer, vol. 48, no. 12,
The second direction to be explored is the study of com- pp. 10–13, 2015.
puting architectures and their interplay with the workload. [9] M. A. Zidan, J. P. Strachan, and W. D. Lu, “The future of elec-
tronics based on memristive systems,” Nat. Electron., vol. 1, no. 1,
To maximize the system performance, computing parallelism pp. 22–29, Jan. 2018. [Online]. Available: https://www.nature.com/
should be maximized to prevent multiplexing of the readout articles/s41928-017-0006-8
chain. This approach is usually challenging since periph- [10] D. Ielmini and H.-S. P. Wong, “In-memory computing with resis-
tive switching devices,” Nat. Electron., vol. 1, no. 6, pp. 333–343,
eral circuits consume the largest portion of energy and area Jun. 2018. [Online]. Available: https://www.nature.com/articles/
budget. However, these limitations could be relaxed by an s41928-018-0092-2
accurate co-design of the hardware and the neural network. [11] J. Borghetti et al., “A hybrid nanomemristor/transistor logic circuit
capable of self-programming,” Proc. Nat. Acad. Sci. USA, vol. 106,
On the one hand, IMC circuits must be designed specifically no. 6, pp. 1699–1703, Feb. 2009. [Online]. Available: https://www.
for an application, thus avoiding unnecessary features or pnas.org/doi/abs/10.1073/pnas.0806642106
excessive precision, for instance by reducing ADC quantiza- [12] M. Cassinerio, N. Ciocchini, and D. Ielmini, “Logic computation in
phase change materials by threshold and memory switching,” Adv.
tion or implementing simplified activation functions. On the Mater., vol. 25, no. 41, pp. 5975–5980, 2013. [Online]. Available:
other hand, given a target application, the neural network can https://onlinelibrary.wiley.com/doi/abs/10.1002/adma.201301940
be customized to adopt the features that are suitable for IMC [13] C. D. Wright, P. Hosseini, and J. A. V. Diosdado, “Beyond von-
Neumann computing with nanoscale phase-change memory devices,”
acceleration, such as low-level quantization or hardware- Adv. Funct. Mater., vol. 23, no. 18, pp. 2248–2254, 2013.
aware training procedures. Finally, an electronic design [14] T. Tuma, A. Pantazi, M. Le Gallo, A. Sebastian, and E. Eleftheriou,
automation (EDA) toolchain is needed in order to bridge “Stochastic phase-change neurons,” Nat. Nanotechnol., vol. 11, no. 8,
the gap between the end-user and the hardware system, pp. 693–699, Aug. 2016. [Online]. Available: https://www.nature.
com/articles/nnano.2016.70
ranging from application-specific, high-level-of-abstraction [15] L. Zheng, S. Shin, S. Lloyd, M. Gokhale, K. Kim, and S.-M. Kang,
design tools [186], to dedicated compilers [187], [188], [189] “RRAM-based TCAMs for pattern search,” in Proc. IEEE Int. Symp.
performing low-level core optimization in real-world imple- Circuits Syst. (ISCAS), May 2016, pp. 1382–1385.
[16] C. Li et al., “Analog content-addressable memories with memris-
mentations, similarly to existing CPU- and GPU-based tors,” Nat. Commun., vol. 11, no. 1, p. 1638, Apr. 2020. [Online].
computing system. Available: https://www.nature.com/articles/s41467-020-15254-4

596 VOLUME 11, 2023


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

[17] S. Gaba, P. Knag, Z. Zhang, and W. Lu, “Memristive devices [37] C.-C. Chou et al., “An N40 256K×44 embedded RRAM macro
for stochastic computing,” in Proc. IEEE Int. Symp. Circuits Syst. with SL-precharge SA and low-voltage current limiter to improve
(ISCAS), Jun. 2014, pp. 2592–2595. read and write performance,” in Proc. IEEE Int. Solid-State Circuits
[18] S. N. Truong and K.-S. Min, “New memristor-based cross- Conf. (ISSCC), Feb. 2018, pp. 478–480.
bar array architecture with 50-% area reduction and 48-% [38] T. Kim et al., “High-performance, cost-effective 2z nm two-deck
power saving for matrix-vector multiplication of analog neuro- cross-point memory integrated by self-align scheme for 128 Gb
morphic computing,” J. Semicond. Technol. Sci., vol. 14, no. 3, SCM,” in Proc. IEEE Int. Electron Devices Meeting (IEDM),
pp. 356–363, 2014. [Online]. Available: https://koreascience.kr/ Dec. 2018, pp. 37.1.1–37.1.4.
article/JAKO201420249945718.page [39] F. Arnaud et al., “Truly innovative 28nm FDSOI technology for auto-
[19] C. Li et al., “Analogue signal and image processing with large motive micro-controller applications embedding 16MB phase change
memristor crossbars,” Nat. Electron., vol. 1, no. 1, pp. 52–59, 2018. memory,” in Proc. IEEE Int. Electron Devices Meeting (IEDM),
[20] D. Keitel-Schulz and N. Wehn, “Embedded DRAM development: Dec. 2018, pp. 18.4.1–18.4.4.
Technology, physical design, and application issues,” IEEE Des. Test [40] Y.-C. Shih et al., “Logic process compatible 40-nm 16-Mb, embedded
Comput., vol. 18, no. 3, pp. 7–15, May/Jun. 2001. perpendicular-MRAM with hybrid-resistance reference, sub-&micro
[21] B. Yan et al., “A 1.041-Mb/mm2 27.38-TOPS/W signed-INT8 a sensing resolution, and 17.5-nS read access time,” IEEE J. Solid-
dynamic-logic-based ADC-less SRAM compute-in-memory macro State Circuits, vol. 54, no. 4, pp. 1029–1038, Apr. 2019.
in 28nm with reconfigurable bitwise operation for AI and embedded [41] L. Wei et al., “13.3 a 7Mb STT-MRAM in 22FFL FinFET technology
applications,” in Proc. IEEE Int. Solid- State Circuits Conf. (ISSCC), with 4ns read sensing time at 0.9V using write-verify-write scheme
vol. 65, Feb. 2022, pp. 188–190. and offset-cancellation sensing technique,” in Proc. IEEE Int. Solid-
[22] Y.-D. Chih et al., “16.4 an 89TOPS/W and 16.3TOPS/mm2 all-digital State Circuits Conf. (ISSCC), Feb. 2019, pp. 214–216.
SRAM-based full-precision compute-in memory macro in 22nm for [42] P. Jain et al., “13.2 a 3.6Mb 10.1Mb/mm2 embedded non-volatile
machine-learning edge applications,” in Proc. IEEE Int. Solid- State ReRAM macro in 22nm FinFET technology with adaptive form-
Circuits Conf. (ISSCC), vol. 64, Feb. 2021, pp. 252–254. ing/set/reset schemes yielding down to 0.5V with sensing time of
[23] A. Agrawal et al., “Xcel-RAM: Accelerating binary neural networks 5ns at 0.7V,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC),
in high-throughput SRAM compute arrays,” IEEE Trans. Circuits Feb. 2019, pp. 212–214.
Syst. I, Reg. Papers, vol. 66, no. 8, pp. 3064–3076, Aug. 2019. [43] Y.-D. Chih et al., “13.3 a 22nm 32Mb embedded STT-MRAM
[24] R. Waser and M. Aono, “Nanoionics-based resistive switch- with 10ns read speed, 1M cycle write endurance, 10 years reten-
ing memories,” in Nanoscience and Technology. London, U.K.: tion at 150◦ C and high immunity to magnetic field interference,”
Macmillan, Aug. 2009, pp. 158–165. [Online]. Available: http:// in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2020,
www.worldscientific.com/doi/abs/10.1142/9789814287005_0016 pp. 222–224.
[25] D. Ielmini, “Resistive switching memories based on metal oxides: [44] Y.-C. Shih et al., “A reflow-capable, embedded 8Mb STT-MRAM
Mechanisms, reliability and scaling,” Semicond. Sci. Technol., vol. 31, macro with 9nS read access time in 16nm FinFET logic CMOS
no. 6, Jun. 2016, Art. no. 063002. [Online]. Available: https:// process,” in Proc. IEEE Int. Electron Devices Meeting (IEDM),
iopscience.iop.org/article/10.1088/0268-1242/31/6/063002 Dec. 2020, pp. 11.4.1–11.4.4.
[26] H.-S. P. Wong et al., “Metal–oxide RRAM,” Proc. IEEE, vol. 100, [45] D. Edelstein et al., “A 14 nm embedded STT-MRAM CMOS
no. 6, pp. 1951–1970, Jun. 2012. [Online]. Available: http:// technology,” in Proc. IEEE Int. Electron Devices Meeting (IEDM),
ieeexplore.ieee.org/document/6193402/ Dec. 2020, pp. 11.5.1–11.5.4.
[27] S. Balatti, S. Larentis, D. C. Gilmer, and D. Ielmini, “Multiple [46] V. B. Naik et al., “JEDEC-qualified highly reliable 22nm
memory states in resistive switching devices through controlled size FD-SOI embedded MRAM for low-power industrial-grade, and
and orientation of the conductive filament,” Adv. Mater., vol. 25, extended performance towards automotive-grade-1 applications,” in
no. 10, pp. 1474–1478, Mar. 2013. [Online]. Available: https:// Proc. IEEE Int. Electron Devices Meeting (IEDM), Dec. 2020,
onlinelibrary.wiley.com/doi/10.1002/adma.201204097 pp. 11.3.1–11.3.4.
[28] S. Yu, Y. Wu, and H.-S. P. Wong, “Investigating the switching dynam- [47] A. Fazio, “Advanced technology and systems of cross point memory,”
ics and multilevel capability of bipolar metal oxide resistive switching in Proc. IEEE Int. Electron Devices Meeting (IEDM), Dec. 2020,
memory,” Appl. Phys. Lett., vol. 98, no. 10, 2011, Art. no. 103514. pp. 24.1.1–24.1.4.
[Online]. Available: https://doi.org/10.1063/1.3564883 [48] J. J. Sun et al., “Commercialization of 1Gb Standalone spin-transfer
[29] C.-C. Chou et al., “A 22nm 96KX144 RRAM macro with a self- torque MRAM,” in Proc. IEEE Int. Memory Workshop (IMW),
tracking reference and a low ripple charge pump to achieve a May 2021, pp. 1–4.
configurable read window and a wide operating voltage range,” in [49] T. Shimoi et al., “A 22nm 32Mb embedded STT-MRAM macro
Proc. IEEE Symp. VLSI Circuits, Jun. 2020, pp. 1–2. achieving 5.9ns random read access and 5.8MB/s write throughput
[30] H. Chung et al., “A 58nm 1.8V 1Gb PRAM with 6.4MB/s pro- at up to Tj of 150◦ C,” in Proc. IEEE Symp. VLSI Technol. Circuits
gram BW,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2011, (VLSI Technol. Circuits), Jun. 2022, pp. 134–135.
pp. 500–502. [50] S. M. Seo et al., “First demonstration of full integration and character-
[31] Y. Choi et al., “A 20nm 1.8V 8Gb PRAM with 40MB/s program ization of 4F2 1S1M cells with 45 nm of pitch and 20 nm of MTJ
bandwidth,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2012, size,” in Proc. Int. Electron Devices Meeting (IEDM), Dec. 2022,
pp. 46–48. pp. 10.1.1–10.1.4.
[32] T.-Y. Liu et al., “A 130.7 mm2 2-layer 32Gb ReRAM memory device [51] G. Servalli, “A 45nm generation phase change memory technology,”
in 24nm technology,” in IEEE Int. Solid-State Circuits Conf. Dig. in Proc. IEEE Int. Electron Devices Meeting (IEDM), Dec. 2009,
Tech. Papers, Feb. 2013, pp. 210–211. pp. 1–4.
[33] M.-F. Chang et al., “19.4 embedded 1Mb ReRAM in 28nm [52] C. Gopalan et al., “Demonstration of conductive bridging random
CMOS with 0.27-to-1V read using swing-sample-and-couple sense access memory (CBRAM) in logic CMOS process,” in Proc. IEEE
amplifier and self-boost-write-termination scheme,” in IEEE Int. Int. Memory Workshop, May 2010, pp. 1–4.
Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), Feb. 2014, [53] S. H. Lee et al., “Highly productive PCRAM technology platform
pp. 332–333. and full chip operation: Based on 4F2 (84nm pitch) cell scheme for
[34] J. Zahurak et al., “Process integration of a 27nm, 16Gb cu 1 Gb and beyond,” in Proc. Int. Electron Devices Meeting, Dec. 2011,
ReRAM,” in Proc. IEEE Int. Electron Devices Meeting, Dec. 2014, pp. 3.3.1–3.3.4.
pp. 6.2.1–6.2.4. [54] A. Kawahara et al., “Filament scaling forming technique and level-
[35] S. Dünkel et al., “A FeFET based super-low-power ultra-fast embed- verify-write scheme with endurance over 107 cycles in ReRAM,” in
ded NVM technology for 22nm FDSOI and beyond,” in Proc. IEEE IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2013,
Int. Electron Devices Meeting (IEDM), Dec. 2017, pp. 19.7.1–19.7.4. pp. 220–221.
[36] Y. J. Song et al., “Demonstration of highly manufacturable STT- [55] M. Ueki et al., “Low-power embedded ReRAM technology for
MRAM embedded in 28nm logic,” in Proc. IEEE Int. Electron IoT applications,” in Proc. Symp. VLSI Technol. (VLSI Technol.),
Devices Meeting (IEDM), Dec. 2018, pp. 18.2.1–18.2.4. Jun. 2015, pp. T108–T109.

VOLUME 11, 2023 597


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

[56] C. Park et al., “Systematic optimization of 1 Gbit perpendicular [76] C. Chappert, A. Fert, and F. N. Van Dau, “The emergence of
magnetic tunnel junction arrays for 28 nm embedded STT-MRAM spin electronics in data storage,” Nat. Mater., vol. 6, no. 11,
and beyond,” in Proc. IEEE Int. Electron Devices Meeting (IEDM), pp. 813–823, Nov. 2007. [Online]. Available: https://www.nature.
Dec. 2015, pp. 26.2.1–26.2.4. com/articles/nmat2024
[57] S.-W. Chung et al., “4Gbit density STT-MRAM using perpendicu- [77] R. Carboni et al., “Modeling of breakdown-limited endurance in spin-
lar MTJ realized with compact cell structure,” in Proc. IEEE Int. transfer torque magnetic memory under pulsed cycling regime,” IEEE
Electron Devices Meeting (IEDM), Dec. 2016, pp. 27.1.1–27.1.4. Trans. Electron Devices, vol. 65, no. 6, pp. 2470–2478, Jun. 2018.
[58] Y. J. Song et al., “Highly functional and reliable 8Mb STT-MRAM [Online]. Available: https://ieeexplore.ieee.org/document/8338113/
embedded in 28nm logic,” in Proc. IEEE Int. Electron Devices [78] A. I. Khan, A. Keshavarzi, and S. Datta, “The future of ferro-
Meeting (IEDM), Dec. 2016, pp. 27.2.1–27.2.4. electric field-effect transistor technology,” Nat. Electron., vol. 3,
[59] D. Shum et al., “CMOS-embedded STT-MRAM arrays in 2x nm no. 10, pp. 588–597, Oct. 2020. [Online]. Available: https://www.
nodes for GP-MCU applications,” in Proc. Symp. VLSI Technol., nature.com/articles/s41928-020-00492-7
Jun. 2017, pp. T208–T209. [79] M. Trentzsch et al., “A 28nm HKMG super low power embedded
[60] J. Y. Wu et al., “A 40nm low-power logic compatible phase change NVM technology based on ferroelectric FETs,” in Proc. IEEE Int.
memory technology,” in Proc. IEEE Int. Electron Devices Meeting Electron Devices Meeting (IEDM), Dec. 2016, pp. 11.5.1–11.5.4.
(IEDM), Dec. 2018, pp. 27.6.1–27.6.4. [80] I. M. Miron et al., “Perpendicular switching of a single ferromag-
[61] F. Arnaud et al., “High density embedded PCM cell in 28nm netic layer induced by in-plane current injection,” Nature, vol. 476,
FDSOI technology for automotive micro-controller applications,” no. 7359, pp. 189–193, Aug. 2011. [Online]. Available: http://www.
in Proc. IEEE Int. Electron Devices Meeting (IEDM), Dec. 2020, nature.com/articles/nature10309
pp. 24.2.1–24.2.4. [81] K. Garello et al., “Ultrafast magnetization switching by spin-
[62] C.-F. Yang et al., “Industrially applicable read disturb model and orbit torques,” Appl. Phys. Lett., vol. 105, no. 21, Nov. 2014,
performance on mega-bit 28nm embedded RRAM,” in Proc. IEEE Art. no. 212402. [Online]. Available: https://aip.scitation.org/doi/full/
Symp. VLSI Technol., Jun. 2020, pp. 1–2. 10.1063/1.4902443
[63] S. H. Han et al., “28-nm 0.08 mm2 /Mb embedded MRAM for [82] J. Tang et al., “ECRAM as scalable synaptic cell for high-speed,
frame buffer memory,” in Proc. IEEE Int. Electron Devices Meeting low-power neuromorphic computing,” in Proc. IEEE Int. Electron
(IEDM), Dec. 2020, pp. 11.2.1–11.2.4. Devices Meeting (IEDM), Dec. 2018, pp. 13.1.1–13.1.4. [Online].
Available: https://ieeexplore.ieee.org/document/8614551/
[64] D. Min et al., “18nm FDSOI technology platform embedding PCM
[83] J. Lee, R. D. Nikam, D. Kim, and H. Hwang, “Highly scal-
& innovative continuous-active construct enhancing performance for
able (30 nm) and ultra-low-energy (∼5fJ/pulse) vertical sensing
leading-edge MCU applications,” in Proc. IEEE Int. Electron Devices
ECRAM with ideal synaptic characteristics using ion-permeable
Meeting (IEDM), Dec. 2021, pp. 13.1.1–13.1.4.
Graphene electrodes,” in Proc. Int. Electron Devices Meeting (IEDM),
[65] K. Lee et al., “28nm CIS-compatible embedded STT-MRAM for
Dec. 2022, pp. 2.2.1–2.2.4. [Online]. Available: https://ieeexplore.
frame buffer memory,” in Proc. IEEE Int. Electron Devices Meeting
ieee.org/document/10019326/
(IEDM), Dec. 2021, pp. 2.1.1–2.1.4.
[84] S. Kim et al., “Metal-oxide based, CMOS-compatible ECRAM
[66] T. Ito et al., “A 20Mb embedded STT-MRAM array achieving 72% for deep learning accelerator,” in Proc. IEEE Int. Electron Devices
write energy reduction with self-termination write schemes in 16nm Meeting (IEDM), Dec. 2019, pp. 35.7.1–35.7.4. [Online]. Available:
FinFET logic process,” in Proc. IEEE Int. Electron Devices Meeting https://ieeexplore.ieee.org/document/8993463/
(IEDM), Dec. 2021, pp. 2.2.1–2.2.4.
[85] M. Farronato, M. Melegari, S. Ricci, S. Hashemkhani, A. Bricalli, and
[67] C. Peters, F. Adler, K. Hofmann, and J. Otterstedt, “Reliability D. Ielmini, “Memtransistor devices based on MoS2 multilayers with
of 28nm embedded RRAM for consumer and industrial prod- volatile switching due to AG cation migration,” Adv. Electron. Mater.,
ucts,” in Proc. IEEE Int. Memory Workshop (IMW), May 2022, vol. 8, no. 8, Jan. 2022, Art. no. 2101161. [Online]. Available:
pp. 1–3. https://onlinelibrary.wiley.com/doi/10.1002/aelm.202101161
[68] S. Raoux, W. Wełnic, and D. Ielmini, “Phase change materials and [86] H. Lee et al., “Dual-gated MoS2 memtransistor crossbar
their application to nonvolatile memories,” Chem. Rev., vol. 110, array,” Adv. Funct. Mater., vol. 30, no. 45, Nov. 2020,
no. 1, pp. 240–267, Jan. 2010. [Online]. Available: https://pubs.acs. Art. no. 2003683. [Online]. Available: https://onlinelibrary.wiley.
org/doi/10.1021/cr900040x com/doi/10.1002/adfm.202003683
[69] M. Wuttig and N. Yamada, “Phase-change materials for rewriteable [87] R. A. John et al., “Ultralow power dual-gated subthreshold
data storage,” Nat. Mater., vol. 6, no. 11, pp. 824–832, Nov. 2007. oxide neuristors: An enabler for higher order neuronal tempo-
[Online]. Available: https://www.nature.com/articles/nmat2009 ral correlations,” ACS Nano, vol. 12, no. 11, pp. 11263–11273,
[70] P. Zuliani et al., “Overcoming temperature limitations in phase Nov. 2018. [Online]. Available: https://pubs.acs.org/doi/10.1021/
change memories with optimized Gex Sby Tez ,” IEEE Trans. Electron acsnano.8b05903
Devices, vol. 60, no. 12, pp. 4020–4026, Dec. 2013. [88] V. K. Sangwan et al., “Multi-terminal memtransistors from polycrys-
[71] D. Ielmini, A. Lacaita, A. Pirovano, F. Pellizzer, and R. Bez, talline monolayer molybdenum disulfide,” Nature, vol. 554, no. 7693,
“Analysis of phase distribution in phase-change nonvolatile mem- pp. 500–504, Feb. 2018. [Online]. Available: https://www.nature.
ories,” IEEE Electron Device Lett., vol. 25, no. 7, pp. 507–509, com/articles/nature25747
Jul. 2004. [Online]. Available: http://ieeexplore.ieee.org/document/ [89] M. Farronato, P. Mannocci, M. Melegari, S. Ricci,
1308435/ C. M. Compagnoni, and D. Ielmini, “Reservoir computing with
[72] P. Narayanan et al., “Fully on-chip MAC at 14nm enabled by accu- charge-trap memory based on a MoS2 channel for neuromorphic
rate row-wise programming of PCM-based weights and parallel engineering,” Adv. Mater., Oct. 2022, Art. no. 2205381.
vector-transport in duration-format,” in Proc. Symp. VLSI Technol., [90] M. Farronato, M. Melegari, S. Ricci, S. Hashemkani,
Jun. 2021, pp. 1–2. C. M. Compagnoni, and D. Ielmini, “Low-current, highly lin-
[73] T. Mikolajick et al., “FeRAM technology for high density ear synaptic memory device based on MoS2 transistors for online
applications,” Microelectron. Rel., vol. 41, no. 7, pp. 947–950, training and inference,” in Proc. IEEE 4th Int. Conf. Artif. Intell.
Jul. 2001. [Online]. Available: https://linkinghub.elsevier.com/ Circuits Syst. (AICAS), 2022, pp. 1–4.
retrieve/pii/S002627140100049X [91] D. Ielmini and S. Ambrogio, “Emerging neuromorphic devices,”
[74] T. S. Böscke, J. Müller, D. Bräuhaus, U. Schröder, and U. Böttger, Nanotechnology, vol. 31, no. 9, Feb. 2020, Art. no. 092001.
“Ferroelectricity in hafnium oxide thin films,” Appl. Phys. Lett., [Online]. Available: https://iopscience.iop.org/article/10.1088/1361-
vol. 99, no. 10, Sep. 2011, Art. no. 102903. [Online]. Available: 6528/ab554b
http://aip.scitation.org/doi/10.1063/1.3634052 [92] S. Shukla et al., “A scalable multi-TeraOPS core for AI train-
[75] S. Majumdar, “Back’ end CMOS compatible and flexible ferroelec- ing and inference,” IEEE Solid-State Circuits Lett., vol. 1, no. 12,
tric memories for neuromorphic computing and adaptive sensing,” pp. 217–220, Dec. 2018.
Adv. Intell. Syst., vol. 4, no. 4, Apr. 2022, Art. no. 2100175. [93] A. Chen, “A comprehensive crossbar array model with solutions
[Online]. Available: https://onlinelibrary.wiley.com/doi/10.1002/aisy. for line resistance and nonlinear device characteristics,” IEEE Trans.
202100175 Electron Devices, vol. 60, no. 4, pp. 1318–1326, Apr. 2013.

598 VOLUME 11, 2023


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

[94] W.-H. Chen et al., “A 65nm 1Mb nonvolatile computing-in-memory [112] Z. Sun, G. Pedretti, E. Ambrosi, A. Bricalli, W. Wang, and D. Ielmini,
ReRAM macro with sub-16ns multiply-and-accumulate for binary “Solving matrix equations in one step with cross-point resistive
DNN AI edge processors,” in Proc. IEEE Int. Solid-State Circuits arrays,” Proc. Nat. Acad. Sci. USA, vol. 116, no. 10, pp. 4123–4128,
Conf. (ISSCC), Feb. 2018, pp. 494–496. 2019.
[95] S. D. Spetalnick et al., “A 40nm 64kb 26.56TOPS/W 2.37Mb/mm2 [113] P. Mannocci, G. Pedretti, E. Giannone, E. Melacarne, Z. Sun, and
RRAM binary/compute-in-memory macro with 4.23x improvement D. Ielmini, “A universal, analog, in-memory computing primitive for
in density and >75% use of sensing dynamic range,” in Proc. linear algebra using memristors,” IEEE Trans. Circuits Syst. I, Reg.
IEEE Int. Solid-State Circuits Conf. (ISSCC), vol. 65, Feb. 2022, Papers, vol. 68, no. 12, pp. 4889–4899, Dec. 2021.
pp. 1–3. [114] Z. Sun, E. Ambrosi, G. Pedretti, A. Bricalli, and D. Ielmini,
[96] T.-H. Kim, J. Lee, S. Kim, J. Park, B.-G. Park, and H. Kim, “In-memory PageRank accelerator with a cross-point array of resis-
“3-bit multilevel operation with accurate programming scheme tive memories,” IEEE Trans. Electron Devices, vol. 67, no. 4,
in TiOx /Al2 O3 memristor crossbar array for quantized neuro- pp. 1466–1470, Apr. 2020.
morphic system,” Nanotechnology, vol. 32, no. 29, Apr. 2021, [115] Z. Sun, G. Pedretti, A. Bricalli, and D. Ielmini, “One-step regression
Art. no. 295201. doi: 10.1088/1361-6528/abf0cc. and classification with cross-point resistive memory arrays,” Sci. Adv.,
[97] V. Milo et al., “Multilevel HfO2 -based RRAM devices for low- vol. 6, no. 5, 2020, Art. no. eaay2378.
power neuromorphic networks,” APL Mater., vol. 7, no. 8, Aug. 2019, [116] P. Mannocci, E. Melacarne, and D. Ielmini, “An analogue in-memory
Art. no. 081120. [Online]. Available: https://aip.scitation.org/doi/full/ ridge regression circuit with application to massive MIMO acceler-
10.1063/1.5108650 ation,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 12, no. 4,
[98] I. Yeo, M. Chu, S.-G. Gi, H. Hwang, and B.-G. Lee, “Stuck- pp. 952–962, Dec. 2022.
at-fault tolerant schemes for memristor crossbar array-based neu- [117] M. Mahmoodi et al., “An analog neuro-optimizer with adaptable
ral networks,” IEEE Trans. Electron Devices, vol. 66, no. 7, annealing based on 64 × 64 0T1R crossbar circuit,” in Proc. IEEE
pp. 2937–2945, Jul. 2019. Int. Electron Devices Meeting (IEDM), 2019, pp. 14–7.
[99] D. Ielmini and G. Pedretti, “Device and circuit architectures for [118] M. N. Bojnordi and E. Ipek, “Memristive Boltzmann machine:
in-memory computing,” Adv. Intell. Syst., vol. 2, no. 7, 2020, A hardware accelerator for combinatorial optimization and deep
Art. no. 2000040. [Online]. Available: https://onlinelibrary.wiley. learning,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit.
com/doi/abs/10.1002/aisy.202000040 (HPCA), 2016, pp. 1–13.
[100] Y.-C. Chen et al., “An access-transistor-free (0T/1R) non-volatile [119] Y. Kiat, Y. Vortman, and N. Sapir, “Feather moult and bird appearance
resistance random access memory (RRAM) using a novel threshold are correlated with global warming over the last 200 years,” Nat.
switching, self-rectifying chalcogenide device,” in Proc. IEEE Int. Commun., vol. 10, no. 1, p. 2540, 2019.
Electron Devices Meeting, Dec. 2003, pp. 37.4.1–37.4.4. [120] J. J. Hopfield, “Neural networks and physical systems with emergent
[101] D. Ielmini and Y. Zhang, “Physics-based analytical model of collective computational abilities,” Proc. Nat. Acad. Sci. USA, vol. 79,
chalcogenide-based memories for array simulation,” in Proc. Int. no. 8, pp. 2554–2558, 1982.
Electron Devices Meeting, Dec. 2006, pp. 1–4.
[121] G. Pedretti et al., “A spiking recurrent neural network with phase-
[102] M. Hu et al., “Memristor-based analog computation and neural
change memory neurons and synapses for the accelerated solution
network classification with a dot product engine,” Adv. Mater.,
of constraint satisfaction problems,” IEEE J. Explor. Solid-State
vol. 30, no. 9, 2018, Art. no. 1705914. [Online]. Available: https://
Computat. Devices Circuits, vol. 6, no. 1, pp. 89–97, Jun. 2020.
onlinelibrary.wiley.com/doi/abs/10.1002/adma.201705914
[122] F. Cai et al., “Power-efficient combinatorial optimization using intrin-
[103] H. Cai et al., “Proposal of analog in-memory computing with mag-
sic noise in memristor Hopfield neural networks,” Nat. Electron.,
nified tunnel magnetoresistance ratio and universal STT-MRAM
vol. 3, no. 7, pp. 409–418, 2020.
cell,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 69, no. 4,
pp. 1519–1531, Apr. 2022. [123] T. Dalgaty, E. Esmanhotto, N. Castellani, D. Querlioz, and
[104] J. M. Lopez et al., “1S1R optimization for high-frequency infer- E. Vianello, “Ex situ transfer of Bayesian neural networks to resistive
ence on binarized spiking neural networks,” Adv. Electron. Mater., memory-based inference hardware,” Adv. Intell. Syst., vol. 3, no. 8,
vol. 8, no. 8, 2022, Art. no. 2200323. [Online]. Available: https:// 2021, Art. no. 2000103.
onlinelibrary.wiley.com/doi/abs/10.1002/aelm.202200323 [124] S. Agarwal et al., “Resistive memory device requirements for a
[105] J. M. Lopez et al., “1S1R sub-threshold operation in crossbar arrays neural algorithm accelerator,” in Proc. Int. Joint Conf. Neural Netw.
for low power BNN inference computing,” in Proc. IEEE Int. Memory (IJCNN), 2016, pp. 929–938.
Workshop (IMW), May 2022, pp. 1–4. [125] X. Xu et al., “40× retention improvement by eliminating resis-
[106] G. W. Burr et al., “Access devices for 3D crosspoint memory,” tance relaxation with high temperature forming in 28 nm RRAM
J. Vacuum Sci. Technol. B, vol. 32, no. 4, Jul. 2014, chip,” in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2018,
Art. no. 040802. [Online]. Available: https://avs.scitation.org/doi/ pp. 20.1.1–20.1.4.
full/10.1116/1.4889999 [126] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio.
[107] S. A. Chekol, J. Song, J. Park, J. Yoo, S. Lim, and H. Hwang, “Binarized Neural Networks: Training Deep Neural Networks With
“Chapter 5—Selector devices for emerging memories,” in Weights and Activations Constrained to +1 or −1.” Mar. 2016.
Memristive Devices for Brain-Inspired Computing (Woodhead [Online]. Available: http://arxiv.org/abs/1602.02830
Publishing Series in Electronic and Optical Materials), S. Spiga, [127] T. Simons and D.-J. Lee, “A review of binarized neural networks,”
A. Sebastian, D. Querlioz, and B. Rajendran, Eds. London, Electronics, vol. 8, no. 6, p. 661, Jun. 2019. [Online]. Available:
U.K.: Woodhead, Jan. 2020, pp. 135–164. [Online]. Available: https://www.mdpi.com/2079-9292/8/6/661
https://www.sciencedirect.com/science/article/pii/B97800810278200 [128] M. Bocquet et al., “In-memory and error-immune differential RRAM
00058 implementation of binarized deep neural networks,” in Proc. IEEE
[108] Y.-C. Luo, A. Lu, J. Hur, S. Li, and S. Yu, “Design of non-volatile Int. Electron Devices Meeting (IEDM), Dec. 2018, pp. 20.6.1–20.6.4.
capacitive crossbar array for in-memory computing,” in Proc. IEEE [129] H. Kim, Y. Kim, and J.-J. Kim, “In-memory batch-normalization
Int. Memory Workshop (IMW), May 2021, pp. 1–4. for resistive memory based binary neural network hardware,” in
[109] S. Jung et al., “A crossbar array of magnetoresistive memory Proc. 24th Asia South Pac. Design Autom. Conf. (ASPDAC), 2019,
devices for in-memory computing,” Nature, vol. 601, no. 7892, pp. 645–650. [Online]. Available: https://doi.org/10.1145/3287624.
pp. 211–216, Jan. 2022. [Online]. Available: https://www.nature. 3287718
com/articles/s41586-021-04196-6 [130] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and
[110] “Pillars of Creation (NIRCam and MIRI Composite Image).” P.-E. Gaillardon, “A robust digital RRAM-based convolutional
Accessed: Mar. 7, 2023. [Online]. Available: https://webbtelescope. block for low-power image processing and learning applica-
org/contents/media/images tions,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 2,
[111] S. N. Truong, S. Shin, S.-D. Byeon, J. Song, and K.-S. Min, pp. 643–654, Feb. 2019.
“New twin crossbar architecture of binary memristors for low-power [131] S. Angizi, Z. He, A. Awad, and D. Fan, “MRIMA: An MRAM-based
image recognition with discrete cosine transform,” IEEE Trans. in-memory accelerator,” IEEE Trans. Comput.-Aided Design Integr.
Nanotechnol., vol. 14, no. 6, pp. 1104–1111, Nov. 2015. Circuits Syst., vol. 39, no. 5, pp. 1123–1136, May 2020.

VOLUME 11, 2023 599


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

[132] Y. Long et al., “A ferroelectric FET-based processing-in-memory [149] Z. Jiang, S. Yin, J.-S. Seo, and M. Seok, “C3SRAM: In-memory-
architecture for DNN acceleration,” IEEE J. Explor. Solid-State computing SRAM macro based on capacitive-coupling comput-
Comput. Devices Circuits, vol. 5, no. 2, pp. 113–122, Dec. 2019. ing,” IEEE Solid-State Circuits Lett., vol. 2, no. 9, pp. 131–134,
[133] H. Fujiwara et al., “A 5-nm 254-TOPS/W 221-TOPS/mm2 Sep. 2019.
fully-digital computing-in-memory macro supporting wide-range [150] H. Wang et al., “A 32.2 TOPS/W SRAM compute-in-memory macro
dynamic-voltage-frequency scaling and simultaneous MAC and write employing a linear 8-bit C-2C ladder for charge domain computation
operations,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), in 22nm for edge inference,” in Proc. IEEE Symp. VLSI Technol.
vol. 65, Feb. 2022, pp. 1–3. Circuits (VLSI Technol. Circuits), Jun. 2022, pp. 36–37.
[134] F. Tu et al., “A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 [151] H. Jia et al., “15.1 a programmable neural-network inference
reconfigurable digital CIM processor with unified FP/INT pipeline accelerator based on scalable in-memory computing,” in Proc.
and bitwise in-memory booth multiplication for cloud deep learning IEEE Int. Solid-State Circuits Conf. (ISSCC), vol. 64, Feb. 2021,
acceleration,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), pp. 236–238.
vol. 65, Feb. 2022, pp. 1–3. [152] H. Valavi, P. J. Ramadge, E. Nestler, and N. Verma, “A mixed-signal
[135] H. Kim, Q. Chen, T. Yoo, T. T.-H. Kim, and B. Kim, “A 1-16b binarized convolutional-neural-network accelerator integrating dense
precision reconfigurable digital in-memory computing macro featur- weight storage and multiplication for reduced data movement,” in
ing column-MAC architecture and bit-serial computation,” in Proc. Proc. IEEE Symp. VLSI Circuits, Jun. 2018, pp. 141–142.
IEEE 45th Eur. Solid-State Circuits Conf. (ESSCIRC), Sep. 2019, [153] D. Bankman, L. Yang, B. Moons, M. Verhelst, and B. Murmann,
pp. 345–348. “An always-on 3.8 &micro J/86% CIFAR-10 mixed-signal binary
[136] C.-F. Lee et al., “A 12nm 121-TOPS/W 41.6-TOPS/mm2 all digital CNN processor with all memory on chip in 28-nm CMOS,” IEEE
full precision SRAM-based compute-in-memory with configurable J. Solid-State Circuits, vol. 54, no. 1, pp. 158–172, Jan. 2019.
bit-width for AI edge applications,” in Proc. IEEE Symp. VLSI [154] M. E. Sinangil et al., “A 7-nm compute-in-memory SRAM macro
Technol. Circuits (VLSI Technol. Circuits), Jun. 2022, pp. 24–25. supporting multi-bit input, weight and output and achieving 351
[137] H. Oh, H. Kim, N. Kang, Y. Kim, J. Park, and J.-J. Kim, “Single TOPS/W and 372.4 GOPS,” IEEE J. Solid-State Circuits, vol. 56,
RRAM cell-based in-memory accelerator architecture for binary neu- no. 1, pp. 188–198, Jan. 2021.
ral networks,” in Proc. IEEE 3rd Int. Conf. Artif. Intell. Circuits Syst. [155] W. Wan et al., “A compute-in-memory chip based on resistive
(AICAS), Jun. 2021, pp. 1–4. random-access memory,” Nature, vol. 608, no. 7923, pp. 504–512,
Aug. 2022. [Online]. Available: https://www.nature.com/articles/
[138] X. Sun, S. Yin, X. Peng, R. Liu, J.-S. Seo, and S. Yu, “XNOR-
s41586-022-04992-8
RRAM: A scalable and parallel resistive synaptic architecture for
binary neural networks,” in Proc. Design Autom. Test Europe Conf. [156] H. Jiang, W. Li, S. Huang, and S. Yu, “A 40nm analog-input ADC-
Exhibit. (DATE), Mar. 2018, pp. 1423–1428. free compute-in-memory RRAM macro with pulse-width modulation
between sub-arrays,” in Proc. IEEE Symp. VLSI Technol. Circuits
[139] S. Yin, X. Sun, S. Yu, and J.-S. Seo, “High-throughput in-memory
(VLSI Technol. Circuits), Jun. 2022, pp. 266–267.
computing for binary deep neural networks with monolithically inte-
grated RRAM and 90-nm CMOS,” IEEE Trans. Electron Devices, [157] C.-X. Xue et al., “15.4 a 22nm 2Mb ReRAM compute-in-memory
vol. 67, no. 10, pp. 4185–4192, Oct. 2020. macro with 121-28TOPS/W for Multibit MAC computing for tiny AI
edge devices,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC),
[140] Y.-F. Qin, R. Kuang, X.-D. Huang, Y. Li, J. Chen, and X.-S. Miao, Feb. 2020, pp. 244–246.
“Design of high robustness BNN inference accelerator based on
[158] Q. Liu et al., “33.2 a fully integrated analog ReRAM based
binary memristors,” IEEE Trans. Electron Devices, vol. 67, no. 8,
78.4TOPS/W compute-in-memory chip with fully parallel MAC
pp. 3435–3441, Aug. 2020.
computing,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC),
[141] A. P. Chowdhury, P. Kulkarni, and M. N. Bojnordi, “MB-CNN: Feb. 2020, pp. 500–502.
Memristive binary convolutional neural networks for embedded
[159] R. Khaddam-Aljameh et al., “HERMES-core—A 1.59-TOPS/mm2
mobile devices,” J. Low Power Electron. Appl., vol. 8, no. 4, p. 38,
PCM on 14-nm CMOS in-memory compute core using 300-ps/LSB
Dec. 2018. [Online]. Available: https://www.mdpi.com/2079-9268/8/
linearized CCO-based ADCs,” IEEE J. Solid-State Circuits, vol. 57,
4/38
no. 4, pp. 1027–1038, Apr. 2022.
[142] D. Saito et al., “Analog in-memory computing in FeFET-based [160] V. Joshi et al., “Accurate deep neural network inference using com-
1T1R array for edge AI applications,” in Proc. Symp. VLSI Circuits, putational phase-change memory,” Nat. Commun., vol. 11, no. 1,
Jun. 2021, pp. 1–2. p. 2473, May 2020. [Online]. Available: https://www.nature.com/
[143] C. Matsui, K. Toprasertpong, S. Takagi, and K. Takeuchi, “Energy- articles/s41467-020-16108-9
efficient reliable HZO FeFET computation-in-memory with local [161] W.-S. Khwa et al., “A 40-nm, 2M-cell, 8b-precision, hybrid SLC-
multiply & global accumulate array for source-follower & charge- MLC PCM computing-in-memory macro with 20.5–65.0TOPS/W for
sharing voltage sensing,” in Proc. Symp. VLSI Circuits, Jun. 2021, tiny-Al edge devices,” in Proc. IEEE Int. Solid-State Circuits Conf.
pp. 1–2. (ISSCC), vol. 65, Feb. 2022, pp. 1–3.
[144] J.-W. Su et al., “16.3 a 28nm 384kb 6T-SRAM computation-in- [162] P. Deaville, B. Zhang, and N. Verma, “A 22nm 128-kb MRAM
memory macro with 8b precision for AI edge chips,” in Proc. row/column-parallel in-memory computing macro with memory-
IEEE Int. Solid-State Circuits Conf. (ISSCC), vol. 64, Feb. 2021, resistance boosting and multi-column ADC readout,” in Proc. IEEE
pp. 250–252. Symp. VLSI Technol. Circuits (VLSI Technol. Circuits), Jun. 2022,
[145] H. Jia, H. Valavi, Y. Tang, J. Zhang, and N. Verma, “A programmable pp. 268–269.
heterogeneous microprocessor based on bit-scalable in-memory com- [163] T. Soliman et al., “Ultra-low power flexible precision FeFET based
puting,” IEEE J. Solid-State Circuits, vol. 55, no. 9, pp. 2609–2621, analog in-memory computing,” in Proc. IEEE Int. Electron Devices
Sep. 2020. Meeting (IEDM), Dec. 2020, pp. 29.2.1–29.2.4.
[146] Q. Dong et al., “15.3 a 351TOPS/W and 372.4GOPS compute-in- [164] C.-X. Xue et al., “16.1 a 22nm 4Mb 8b-precision ReRAM
memory SRAM macro in 7nm FinFET CMOS for machine-learning computing-in-memory macro with 11.91 to 195.7TOPS/W for tiny AI
applications,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), edge devices,” in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC),
Feb. 2020, pp. 242–244. vol. 64, Feb. 2021, pp. 245–247.
[147] S. Yin, Z. Jiang, J.-S. Seo, and M. Seok, “XNOR-SRAM: In-memory [165] J.-M. Hung et al., “An 8-Mb DC-current-free binary-to-8b precision
computing SRAM macro for binary/ternary deep neural networks,” ReRAM nonvolatile computing-in-memory macro using time-space-
IEEE J. Solid-State Circuits, vol. 55, no. 6, pp. 1733–1743, readout with 1286.4-21.6TOPS/W for edge-AI devices,” in Proc.
Jun. 2020. [Online]. Available: https://ieeexplore.ieee.org/document/ IEEE Int. Solid-State Circuits Conf. (ISSCC), vol. 65, Feb. 2022,
8959407/ pp. 1–3.
[148] P.-F. Chiu, W. H. Choi, W. Ma, M. Qin, and M. Lueker-Boden, [166] A. Glukhov et al., “Statistical model of program/verify algorithms in
“A binarized neural network accelerator with differential crosspoint resistive-switching memories for in-memory neural network accel-
memristor array for energy-efficient MAC operations,” in Proc. IEEE erators,” in Proc. IEEE Int. Rel. Phys. Symp. (IRPS), Mar. 2022,
Int. Symp. Circuits Syst. (ISCAS), May 2019, pp. 1–5. pp. 1–7.

600 VOLUME 11, 2023


LEPRI et al.: IMC FOR MACHINE LEARNING AND DEEP LEARNING

[167] S. Ambrogio et al., “Reducing the impact of phase-change memory [179] D. Ielmini, S. Lavizzari, D. Sharma, and A. L. Lacaita, “Physical
conductance drift on the inference of large-scale hardware neural interpretation, modeling and impact on phase change memory
networks,” in Proc. IEEE Int. Electron Devices Meeting (IEDM), (PCM) reliability of resistance drift due to chalcogenide structural
Dec. 2019, pp. 6.1.1–6.1.4. relaxation,” in Proc. IEEE Int. Electron Devices Meeting, 2007,
[168] S. Ambrogio, S. Balatti, V. McCaffrey, D. C. Wang, and pp. 939–942.
D. Ielmini, “Noise-induced resistance broadening in resistive switch- [180] N. Ciocchini, E. Palumbo, M. Borghi, P. Zuliani, R. Annunziata,
ing memory—Part I: Intrinsic cell behavior,” IEEE Trans. Electron and D. Ielmini, “Modeling resistance instabilities of set and
Devices, vol. 62, no. 11, pp. 3805–3811, Nov. 2015. reset states in phase change memory with ge-rich GeSbTe,”
[169] N. Lepri, M. Baldo, P. Mannocci, A. Glukhov, V. Milo, and IEEE Trans. Electron Devices, vol. 61, no. 6, pp. 2136–2144,
D. Ielmini, “Modeling and compensation of IR drop in crosspoint Jun. 2014.
accelerators of neural networks,” IEEE Trans. Electron Devices, [181] Y.-H. Lin et al., “Performance impacts of analog ReRAM non-
vol. 69, no. 3, pp. 1575–1581, Mar. 2022. ideality on neuromorphic computing,” IEEE Trans. Electron Devices,
[170] N. Lepri, A. Glukhov, and D. Ielmini, “Mitigating read-program vol. 66, no. 3, pp. 1289–1295, Mar. 2019.
variation and IR drop by circuit architecture in RRAM-based neural [182] I. Muñoz-Martín, S. Bianchi, O. Melnic, A. G. Bonfanti,
network accelerators,” in Proc. IEEE Int. Rel. Phys. Symp. (IRPS), and D. Ielmini, “A drift-resilient hardware implementation of
Mar. 2022, pp. 1–6. neural accelerators based on phase change memory devices,”
[171] F. L. Aguirre, N. M. Gomez, S. M. Pazos, F. Palumbo, J. Suñé, IEEE Trans. Electron Devices, vol. 68, no. 12, pp. 6076–6081,
and E. Miranda, “Minimization of the line resistance impact on Dec. 2021.
memdiode-based simulations of multilayer perceptron arrays applied [183] M. Bertuletti, I. Munoz-Martín, S. Bianchi, A. G. Bonfanti, and
to pattern recognition,” J. Low Power Electron. Appl., vol. 11, no. 1, D. Ielmini, “A multilayer neural accelerator with binary activations
p. 9, Mar. 2021. [Online]. Available: https://www.mdpi.com/2079- based on phase-change memory,” IEEE Trans. Electron Devices,
9268/11/1/9 vol. 70, no. 3, pp. 986–992, Mar. 2023.
[172] C. Mackin et al., “Optimised weight programming for analogue [184] C.-C. Chang et al., “NV-BNN: An accurate deep convolu-
memory-based deep neural networks,” Nat. Commun., vol. 13, no. 1, tional neural network based on binary STT-MRAM for adap-
p. 3765, Jun. 2022. [Online]. Available: https://www.nature.com/ tive AI edge,” in Proc. 56th Annu. Design Autom. Conf. (DAC),
articles/s41467-022-31405-1 2019, pp. 1–6. [Online]. Available: https://doi.org/10.1145/3316781.
[173] F. Zhang and M. Hu, “Mitigate parasitic resistance in resis- 3317872
tive crossbar-based convolutional neural networks,” ACM J. Emerg. [185] M. Le Gallo, A. Sebastian, G. Cherubini, H. Giefers, and
Technol. Comput. Syst., vol. 16, no. 3, pp. 1–25, 2020. [Online]. E. Eleftheriou, “Compressed sensing with approximate message pass-
Available: https://doi.org/10.1145/3371277 ing using in-memory computing,” IEEE Trans. Electron Devices,
[174] D. Joksas et al., “Nonideality-aware training for accurate and robust vol. 65, no. 10, pp. 4304–4312, Oct. 2018.
low-power memristive neural networks,” Adv. Sci., vol. 9, no. 17, [186] P.-Y. Chen, X. Peng, and S. Yu, “NeuroSim+: An integrated device-
2022, Art. no. 2105784. [Online]. Available: https://onlinelibrary. to-algorithm framework for benchmarking synaptic devices and array
wiley.com/doi/abs/10.1002/advs.202105784 architectures,” in Proc. IEEE Int. Electron Devices Meeting (IEDM),
[175] V. Milo et al., “Accurate program/verify schemes of resistive switch- Dec. 2017, pp. 6.1.1–6.1.4. [Online]. Available: http://ieeexplore.ieee.
ing memory (RRAM) for in-memory neural network circuits,” IEEE org/document/8268337/
Trans. Electron Devices, vol. 68, no. 8, pp. 3832–3837, Aug. 2021. [187] S. Achour, R. Sarpeshkar, and M. C. Rinard, “Configuration syn-
[176] A. Athmanathan, M. Stanisavljevic, N. Papandreou, H. Pozidis, and thesis for programmable analog devices with Arco,” ACM SIGPLAN
E. Eleftheriou, “Multilevel-cell phase-change memory: A viable tech- Notices, vol. 51, no. 6, pp. 177–193, Aug. 2016. [Online]. Available:
nology,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 6, no. 1, https://dl.acm.org/doi/10.1145/2980983.2908116
pp. 87–100, Mar. 2016. [188] S. Achour and M. Rinard, “Noise-aware dynamical system com-
[177] S. Ambrogio, S. Balatti, A. Cubeta, A. Calderoni, N. Ramaswamy, pilation for analog devices with Legno,” in Proc. 25th Int.
and D. Ielmini, “Statistical fluctuations in HfOx resistive-switching Conf. Archit. Support Program. Lang. Oper. Syst., Mar. 2020,
memory: Part I—Set/reset variability,” IEEE Trans. Electron Devices, pp. 149–166. [Online]. Available: https://dl.acm.org/doi/10.1145/
vol. 61, no. 8, pp. 2912–2919, Aug. 2014. 3373376.3378449
[178] E. Pérez et al., “Analysis of the statistics of device-to- [189] S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard,
device and cycle-to-cycle variability in TiN/Ti/Al:HfO2 /TiN “Chisel: Reliability- and accuracy-aware optimization of approximate
RRAMs,” Microelectron. Eng., vol. 214, pp. 104–109, Jun. 2019. computational kernels,” ACM SIGPLAN Notices, vol. 49, no. 10,
[Online]. Available: https://www.sciencedirect.com/science/article/ pp. 309–328, Dec. 2014. [Online]. Available: https://dl.acm.org/doi/
pii/S0167931719301303 10.1145/2714064.2660231

VOLUME 11, 2023 601

You might also like