0% found this document useful (0 votes)
37 views

Cim Iscas 2021

The document proposes a computing-in-memory (CIM) architecture using a single-ended disturb-free 7T 1Kb SRAM. The CIM performs basic boolean operations, 4-bit addition, and 4-bit signed number multiplication, as well as normal and retention modes for built-in self-testing. It aims to replace the SRAM in edge systems to reduce the calculation workload of the arithmetic logic unit. The CIM features a ripple carry adder and multiplier unit built using full swing-gate diffusion input technology for good voltage swing, low power, and small area. It implements the most number of operations and functions compared to related works.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Cim Iscas 2021

The document proposes a computing-in-memory (CIM) architecture using a single-ended disturb-free 7T 1Kb SRAM. The CIM performs basic boolean operations, 4-bit addition, and 4-bit signed number multiplication, as well as normal and retention modes for built-in self-testing. It aims to replace the SRAM in edge systems to reduce the calculation workload of the arithmetic logic unit. The CIM features a ripple carry adder and multiplier unit built using full swing-gate diffusion input technology for good voltage swing, low power, and small area. It implements the most number of operations and functions compared to related works.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO.

8, AUGUST 2015 1

A 40-nm CMOS Multifunctional


Computing-in-Memory (CIM) Using Single-Ended
Disturb-Free 7T 1Kb SRAM
Chua-Chin Wang, Senior Member, IEEE, Lean Karlo S. Tolentino, Student Member, IEEE, Chia-Yi Huang, and
Chia-Hung Yeh, Senior Member, IEEE

Abstract—This investigation proposes a computing-in-memory [2]–[4]. To resolve these limitations, Computing-in-Memory


(CIM) design to circumvent the von Neumann bottleneck which (CIM) is proposed which directly implements calculations in
causes poor computation throughput. The proposed CIM per- the memory array. With this, it does not need to execute data
forms multiple operations such as single-instruction basic boolean
operations, addition, and signed number multiplication and transfer from memory arrays to processors [2]–[4].
multiple functions such as normal mode and retention mode CIM circumvents the von Neumann bottleneck and directly
for the built-in self test (BIST). Its 2T-Switch requires only two implements calculations in the memory array. With this, it
transistors to be utilized for SRAM array; thus, the arithmetic does not need to implement data transfer from memory arrays
unit can be chosen easily and area is minimized. Its Ripple
Carry Adder and Multiplier (RCAM) unit which is based on
to processors [2]–[4]. It performs operations by reading and
single-ended disturb-free 7T 1Kb SRAM was developed using writing them back to the memory. To achieve a structure that
the Full Swing-Gate Diffusion Input Technology (FS-GDI) that can perform CIM, a small additional circuit for operations
has full voltage swing resolution, low power consumption, and less must be allotted. This circuit must be able to switch the source
chip area consumption. Its Auto-Switching Write Back Circuit and purpose of the calculation flexibly.
restores addition and multiplication operations automatically to
assigned memory address. The CIM is implemented using TSMC SRAMs are commonly used as CIM devices over DRAMS
40-nm CMOS process at a clock frequency of 100 MHz. Its [3], [4] because they perform bit-wise logical operations and
core area is 432.81×510.265 µm2 . Among the related works, the read out data faster and are more highly reliable which are
proposed CIM performs the most number of operations and needed for AI applications. They have more advantages than
functions. DRAMs speaking of faster speed and higher reliability which
Index Terms—computing-in-memory (CIM), disturb-free, full are needed for AI applications. However, they have higher
swing-gate diffusion input (FS-GDI), SRAM, von Neumann power and area consumption. For this reason, a 4T-load less
bottleneck.
SRAM was developed [5], but there are disturbances in bit line
during the reading and writing of data in SRAMs which causes
I. I NTRODUCTION static noise margin (SNM) to get worse [6]. As a solution, a
RTIFICIAL Intelligence (AI) and Neural Networks have disturb-free property for SRAMs is recommended [7] where it
A contributed much to the development of Industry 4.0.
These applications utilized von Neumann architecture which
was implemented using a write assist-loop based on multi-Vth
transistors.
consists of memory as storage device and arithmetic logic In this work, we propose a CIM architecture based on
unit (ALU) as calculation device. However, von Neumann disturb-free 7T 1Kb SRAM that performs single-instruction
bottleneck [1] is still a serious problem wherein large amount boolean operations, 4-bit addition and 4-bit signed number
of data flow between the memory and the ALU leading to multiplication and multiple functions such as normal mode
overhead, throughput, and energy efficiency limitations. Sev- and retention mode for the built-in self test (BIST). The CIM
eral studies have been conducted to resolve these limitations architecture aims to replace the SRAM in the system edge
by implementing Computing-in-Memory (CIM) architecture for the ALU to have less calculation work. Its Ripple Carry
Adder and Multiplier (RCAM) unit was constructed using
Chua-Chin Wang is with Department of Electrical Engineering, National
Sun Yat-Sen University (NSYSU), Kaohsiung 80424, Taiwan and also with Full Swing-Gate Diffusion Input (FS-GDI) technology [8], [9]
Institute of Undersea Technology (IUT), National Sun Yat-Sen University which has the advantages of good voltage swing resolution,
(NSYSU), Kaohsiung 80424, Taiwan (e-mail: [email protected]). low power consumption, and small chip area consumption.
Lean Karlo S. Tolentino and Chia-Yi Huang are with Depart-
ment of Electrical Engineering, National Sun Yat-Sen University, Kaoh-
siung 80424, Taiwan (e-mail: [email protected],
[email protected]). II. CIM BASED ON S INGLE -E NDED D ISTRUB -F REE 7T
Chia-Hung Yeh is with Department of Electrical Engineering, National
Taiwan Normal University, Taipei 10610, Taiwan and Department of Electrical SRAM S YSTEM A RCHITECTURE
Engineering, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan (e-
mail: [email protected]). The CIM architecture based on single-ended distrub-free 7T
This study was partially funded by Ministry of Science and Technology, 1Kb SRAM is shown in Fig. 1 while its schematic is shown
Taiwan, under grant MOST 108-2218-E-110-002 and 109-2218-E-110-007.
The authors would like to thank Taiwan Semiconductor Research Institute in Fig. 2. It composed of several featured blocks which will
(TSRI) for the fabrication and measurements of the chip. be discussed in the following subsections.
2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2015

Dffwr
BIST_EN
BIST_Pass On the contrary, when writing 1, WA is turned on and Qbxx
CLK BIST Column
Data_out 5 5 is pulled to ground. In this way, VDD will pull Qxx to high
Retention Decoder
BIST_wr_en
5
bword_addr[4:0]
BIST_data_in
WS[4:0]
through MP202. When reading, WA and WLx are turned on
Data_out
RST
Word_addr[4:0] 5
SRAM
WA[31:0] 32
1K Bit to transmit Qbxx to BLBx and generate the value of BLx
5 WAB[31:0] 32 Single- 32
Column
Bit_addr[4:0]
Data Control PreD[31:0] 32
Ended Selector through an inverter. The SRAM Control Circuit controls WA
wr_en Circuit Row
CLK
WL[31:0] SRAM and WAB based on the inputs, namely, Data inx, WLx and
Decoder 32 Array
Dffwr

5 with
Current
Compensation PreD as shown in the following Table I.
Circuit
5
CIM_Data
Auto-Switching 2T-Switch
WL_auto[4:0]
Write Back

CNOR[31:0]

CB[31:0]

CAND[31:0]
BL_auto[4:0]
Cimprec

Circuit TABLE I
Cimprec
32 32 32
SRAM C ONTROL C IRCUIT F UNCTION TABLE
MUL WRSel
CIM 10
Augend_addr[4:0] 5

5
Dffwrb

DffC, S_addr[4:0]
CO[31:0]

SUM[31:0]
FS-GDI Data inx PreD WLx WA WAB
Addend_addr[4:0]
Product_addr[4:0] 5 CIM Sign[31:0]
Ripple Carry Standby X X 0 0 0
Signproduct[31:0]
Signproduct_addr[4:0]
5
Control Adder Read X 0 1 1 0
Circuit Write 0 1 1 0 1
Write 1 1 1 1 0
96 PreC[31:0], C[31:0], S[31:0]

Fig. 1. Block diagram of the 1Kb CIM architecture.


B. 2T Switch
BL0 BL1 BL2 BL3 Fig. 4 shows the 2T Switch in Fig. 1. It increases the
Pre-Charge Circuit Pre-Charge Circuit Pre-Charge Circuit Pre-Charge Circuit
energy efficiency of the CIM. Referring to Fig. 5, its function
WL0 cell 00 cell 01 cell 02 cell 03

S0
C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000000000000000000000000000000000000000000000000000000000000000000
Q00 Qb00 Q01 Qb01 Q02 Qb02 Q03 Qb03

can be described step-by-step as follows: initialization, start-


00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000000000000000000000000000000000000000000000000000000000000000000
00 00 00 00 00 00 00 00 00 00 00 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
00 00 00 00 00 00 00 00 00 00 002T 000000000000000000000000000000000000000000000000000000000000000000
up, computation phase I, and computation phase II. During
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0 0 0 Switch
00000000000000000000000000000000000000000000000000000000000000000000000000

WL1
S1
Q10
cell 10
Qb10
cell 11
Q11 Qb11 Q12
cell 12
Qb12
cell 13
Q13 Qb13 initialization, the PreC signal is low in each write cycle. After
C1
CBx, CNORx and CANDx are precharged to high, PreC is
WL2 Q20
cell 20
Qb20
cell 21
Q21 Qb21 Q22
cell 22
Qb22
cell 23
Q23 Qb23
pulled high to operate at start-up mode. Two sets of control
S2
C2 signals Sx and Cx control 2T Switch to couple Qxx or Qbxx to
CBx, CNORx and CANDx. During computation phase I, only
WL3 cell 30 cell 31 cell 32 cell 33

S3
C3
Q30 Qb30 Q31 Qb31 Q32 Qb32 Q33 Qb32

one of the signals (S0 to S3) can be logic high at the same
time to select the SRAM Cell where the Carry bit is placed
CB0

&125
RCA unit
CNOR0 CAND0 CB1

&125
CNOR1

RCA unit
CAND1 CB2

CNOR2
CNOR2

RCA unit
CAND2 CB3

&1253
CNOR3

RCA unit
CAND3

for addition. When Qxx is logic 1, CBx is logic 0 as shown


. &1$1' :ULWHBLQ .
&1$1' Write_in1 . CNAND2 Write_in2 . &1$1'3 Write_in3
.
.
.
.
.
.
&25
&$1'
;25
&,
680
Sign
08; 'DWDBLQ
.
.
.
.
.
.
&25
&$1'
;25
&O0
SUM1
Sign1
MUX 'DWDBLQ
.
.
.
.
.
.
&252
&$1'2
;252
CO1
SUM2
Sign2
MUX 'DWDBLQ
.
.
.
.
.
.
&253
&$1'3
;253
CO2
SUM3
Sign3
08; 'DWDBLQ
in the left side of Fig. 5. At computation phase II, when one
. &, . &, . &,2 . &,3
.
.
.
.
&2
680
Sign
Product
Product0
In_sel0
In_sel1
.
.
.
.
&2
680
Sign1
Product1
Product1
In_sel0
In_sel1
.
.
.
.
&22
6802
Sign2
Product2
Product2
In_sel0
In_sel1
.
.
.
.
&23
6803
Sign3
Product3
Product3
In_sel0
In_sel1
of Qxx signals is logic 1, CNORx is at logic 0 through the 2T
In_sel2 In_sel2 In_sel2 In_sel2

Switch. C0 to C3 turn on any two signals at the same time for


Fig. 2. CIM schematic (partial). NOR operation. To perform the AND operation of Qbxx, two
CX signals are turned on. When the two Qbxx signals are both
logic 0, the CANDx signal remains logic 1. The layout of the
A. 7T SRAM and Its Control Circuit 7T SRAM together with the 2T Switch is shown in Fig. 6.
As shown in Fig. 3, this 7T SRAM in Fig. 1 uses word
lines (WLx), memory cell control lines (WA and WAB) and C. Ripple Carry Adder and Multiplier (RCAM) unit
pre-discharge lines (PreD) to read and write data. WLx can
The Ripple Carry Adder and Multiplier (RCAM unit) in
only select one row at a time. In addition, MN204 and MN205
Fig. 1 is shown in Fig. 7. It is composed of combinational
provide a discharge path which allows the values of Qxx and
circuits and reduced FS-GDI. As shown in Fig. 7, the right
Qbxx to be stably maintained.
side is a degenerated FS-GDI logic where all the transistors,
BLBx BLx
as seen from the left side of the said figure, whose source
Control Circuit 7T SRAM cell
Data_inx are connected to VDD or GND are removed. With this, the
WLx WA

PreD
MP202 MP201 number of transistors and the power consumption is reduced
WAB WA WLx
WAB
Qxx Qbxx
and the computational complexity is increased.
MN203 MN202 MN201

Vleak
MN204 MN205
TABLE II
MN206 PreD L OGICAL FUNCTIONS IN AN RCAM UNIT

CB = CIx XOR = A ⊕ B
CAN D = AB Sum = (A ⊕ B) ⊕ CIx
Fig. 3. 7T SRAM and Its Control Circuit.
CN AN D = AB CO = (A ⊕ B) · CIx + A · B
CN OR = A + B Sign = XOR = C ⊕ D
When writing 0, PreD is kept at high condition to pull the P roduct = CAN D = EF
reverse bit line (BLBx) to ground, and at the same time, the A, B = input addition bit; CI = carry in; CO= carry out;
cell is selected to be written. When WLx and WAB are high, C, D = input sign bit; E, F = input multiplication bit
WA is turned off and the value of Qxx is pulled to ground.
WANG et al.:A 40-NM CMOS MULTIFUNCTIONAL COMPUTING-IN-MEMORY (CIM) USING SINGLE-ENDED DISTURB-FREE 7T 1KB SRAM 3

BL0

:ULWHBLQ
&,
Pre-Charge Circuit 680 08; 'DWDBLQ
Sign
Product0
In_sel0
PreC In_sel1
In_sel2

Pre-Charge Circuit

WL0 cell 00
Q00 Qb00

S0
C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000000000000000000000000000000000000000000000000000000000000000000
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000000000000000000000000000000000000000000000000000000000000000000
00 00 00 00 00 00 00 00 00 00 002T 000000000000000000000000000000000000000000000000000000000000000000
0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000 000 000 Switch
000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000

WL1 cell 10
Q10 Qb10
Fig. 6. Layout of the 7T SRAM and 2T Switch.
S1
C1
&$1'[ &1$1'[
&,[ &1$1'[ &$1'[
;125[
&125[
;25[ ;25[ ;125[
MP3

WL2 cell 20
Q20 Qb20
&%[ &,[ &2[

S2 ;25[ &,[ ;125[ &%[


&%[ MN7 MP5 MN6 MP4
C2

MN8
MN9 MP6 MN10 MP7

680[

WL3 cell 30
Q30 Qb30

S3 Fig. 7. Ripple Carry Adder and Multiplier unit.


C3

D. CIM Control Circuit


CB0 CNOR0 CAND0

CIM Control Circuit in Fig. 1 is shown in Fig. 8. It consists


Fig. 4. 2T Switch and its Control Circuit. of CIM Timing Control Circuit, Auto-switching Pre-charge
Control Circuit, Address Selecting Control Circuit, and a CIM
Control Circuit unit. When one of the CIM or MUL signal is
pulled to high, OP is also pulled to high and the CIM Control
Circuit starts to generate the corresponding calculation control
S0 C0 signals to initiate the needed calculations. If the CIM signal
is pulled to high, the memory will perform addition on the
Cell 0X Cell 0X
Q0X Qb0X Q0X Qb0X specified operation address from BL0 in Fig. 2 one by one
until CIM signal is turned off. On the other hand, when the
S1 C1 MUL signal is pulled to high, the multiplication is performed
Cell 1X Cell 1X by the memory on the specified operation address from BL0
Q1X Qb1X Q1X Qb1X
until the MUL signal turns off.
S2 C2
Referring to Fig. 8 and Fig. 9, when the CIM signal is
pulled to high, the Prec signal is pre-charged, and the S
Cell 2X Cell 2X
Q2X Qb2X Q2X Qb2X
signal selected by the Carry addr address starts to perform
calculations. Dffwr is the waveform after wr en in Fig. 1 is
S3 C3 sampled by CLK, and represents the read and write interval.
Cell 3X Cell 3X Write is performed at high potential, and read at low potential.
Q3X Qb3X Q3X Qb3X

E. Auto-Switching Write Back Circuit


CBX CNORX CANDX
As shown in Fig. 10, when OP is pulled to a high potential,
the Auto-Switching Write Back Circuit in Fig. 1 writes back
Fig. 5. 2T Switch.
the sum or product to an assigned address starting from LSB
(BL0) to MSB. Its Data Switching Circuit supervises the data
4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2015

selection. Its output CIM Data generates Carry and Sum for
addition while Product and Sign Product for multiplication.
Auto-switching Pre-charge
CIM Timing Control Circuit CIM Control Circuit unit
Control Circuit
CIM
MUL
OP
D Q D Q Cimp_auto OP sc x32 Cimprec
Opprec clk Pre-Charge ctr reset
Wrsel D Q clk BL_auto[4:0]
reset Dffwrb pcc
OP Counter
Dffwr Counter
D Q D Q 5
5
sc
Qb Wrsel Prec
Dffwr pcc Prec[31:0] BL auto-switching circuit
reset reset
OP OP
cim_cell C C[31:0]
32
Decoder
D Q bit_addr
Dffwrb S S[31:0]
32
CLK
w0 w2
Sum[63:0] MUX32 add_Data

Augend_addr[4:0] CIM
D Q 5
32
5 32
selector Decoder Carry[63:0] MUX32
SignX_mul[4:0] D Q 5

CIM_Data
32
Addend_addr[4:0] Q 5
Product[63:0] MUX32 mul_Data
D
5
selector Decoder 32 MUL
5 32
SignY_mul[4:0] D Q Signproduct[31:0] MUX32
Data switching circuit

5
Carry_addr[4:0] D Q Carry
5 D Q
selector Decoder 32
Dffwrb Qb
5 OP reset
CIM_datasel
Signproduct_mul[4:0] D Q Signproduct CIM
Switch clk

Sumaddr_addr[4:0] D Q 5
Sum[4:0] CIM en
5
Carry[4:0] 5

Product_mul[4:0] D Q 5 Sum[4:0] 5
Product[4:0]

WL_auto[4:0]
Address selecting Control Circuit
MUL en
5
Signproduct[4:0] 5
Fig. 8. CIM Control Circuit.
Product[4:0] 5
WL auto-switching circuit

Fig. 10. Auto-switching write back circuit.

F. Built-in Self-Test Circuit


The Built-In Self Test (BIST) in Fig. 1 is presented in
Fig. 11. When self-test control line (BIST EN) is high, the
BIST circuit is activated. It has two modes: normal and
retention modes where they send data and read or write signals
through the BIST Controller (Fig. 12). The Pattern Generator
(PG) as shown in Fig. 13 generates specific and sequential self-
test addresses. It sends to the 1Kb SRAM array for reading and
writing tests. When signal Retention is high, the retention time
test mode is activated. This mode will latch BIST Data and set
BIST WR to low to make the memory read for a long time.
The rest of the actions can be referred to the Timing Diagram
of Self-Test mode in Fig. 14. Finally, the data (BIST Data) and
read/write signal (BIST WR) are sent to the Output Response
Analysis (ORA) circuit as shown in Fig. 15 and the data after
read/write (Data out b) are compared. When the read/write
mode is low (BIST WR = 0), the data is read in the memory
Fig. 9. Auto-switching write back operation.
cell and the BIST Data and Data out are compared. If these
two signals are the same in state, the resulting BIST Pass at
high state is generated.
WANG et al.:A 40-NM CMOS MULTIFUNCTIONAL COMPUTING-IN-MEMORY (CIM) USING SINGLE-ENDED DISTURB-FREE 7T 1KB SRAM 5

BIST_EN BIST_Addr[4:0] Normal testing mode


Pattern Generator(PG)

CLK

Retention BIST_Data
BIST Controller BIST_Data
clk BIST_WR

BIST_WR
Dout Output Response
Data_out Analyzer(ORA) BIST_Pass

dffwr Retention testing mode

Fig. 11. Built-in Self-Test (BIST) circuit. CLK

Retention_test
D0 Q BIST_WR

clkk
BIST_Data
clk CLK QB
RESET

BIST_WR

Fig. 14. BIST Circuit Timing Diagram.


D1 Q D2 Q BIST_Data

dffwr
clkk CLK QB CLK QB
RESET Dout BIST_Pass
RESET BIST_Data D0 Q
Retention

dffwr CLK
Fig. 12. BIST Controller circuit.

Fig. 15. Output Response Analysis (ORA) circuit.


G. 2T Switch Current Compensation Circuit
As mentioned earlier, calculations are started when signals
S2 and PreC are pulled to high. However, due to the high As shown in Fig. 18, Agilent E3631A and Agilent 81250
leakage current of the high-end process, and the charge-sharing were used as 0.9V power supply and pattern generator, re-
architecture of 2T Switch, the value of CBx will be maintained spectively, while Keysight MXR254 was used to measure the
at high (i.e. Q2x is at high), but the loss of charge still causes CIM.
the voltage to drop, so the Current Compensation Circuit in The CIM’s power consumption is 3.911 mW. Its instruc-
8
the dashed box in Fig. 16 needs to be added. tional performance or throughput is 1.43×10 −9 = 5.594 GOPS
−9
= 0.00594 TOPS where 1.43 × 10 is the precharge value
of a normal processing element (PE) cell for the addition and
III. I MPLEMENTATION AND M EASUREMENT multiplication operation; and since there are 4 input bits in
The CIM is realized using TSMC 40-nm CMOS process. 1 set and there are two operations in the RCAM, there is a
Fig. 17 shows the layout and die micrograph. The core area is total of 8 sets in parallel calculation simultaneously. Its energy
432.81×510.265 µm2 . The whole chip size is 840.9×867.31 efficiency (product of throughput and power consumption) is
µm2 . 1.43 TOPS/W. Its area efficiency (throughput/core area) is
0.027 TOPS/mm2 . The density of on-chip SRAM is 22.513%.
500 sets of Monte Carlo simulations were performed for
this architecture. Figs. 19, 20, and 21 show the simulation
results of data transition, reading data 0, and reading data 1,
D0 Q D1 Q D2 Q D3 Q D4 Q D5 Q D6 Q
respectively. During data transition, the drift range of the time
CLK
set reset
CLK
set reset
CLK
set reset
CLK
set reset
CLK
set reset
CLK
set reset
CLK
set reset
axis is about 375.08 ns to 375.19 ns. Meanwhile, when the
vdd vdd vdd vdd vdd vdd vdd
data 0 is read, voltage drift range is about 2.4 µV to 10.1 µV.
BIST_DATA

BIST_EN
Finally, when reading data 1, the voltage drift range is about
BIST_ADDR[0] BIST_ADDR[1] BIST_ADDR[2] BIST_ADDR[3] BIST_ADDR[4] 899.78 mV to 900 mV. The Monte Carlo simulations results
prove that this framework can effectively prevent the result-
Fig. 13. Pattern Generator (PG) circuit. ing circuit characteristics’ fluctuations, thereby improving the
6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2015

wafer yield.

Fig. 19. Monte Carlo simulation results of data transition

Fig. 16. 2T-Switch Current Compensation Circuit.

Fig. 20. Monte Carlo simulation results of reading data 0

Fig. 17. (a) Layout and (b) die micrograph of the CIM. Fig. 21. Monte Carlo simulation results of reading data 1

Fig. 22 shows the Monte Carlo Bell Chart of data 1 (high


potential) versus times when data 1 is read by this CIM. From
the bell-shaped statistical graph, its average value (µ) is 0.9 V,
which is in line with the designed operating voltage, and its
standard deviation (σ) is 3.66 µV. Since the upper limit of the
operating voltage is 0.9 V, there is no µ + 3σ data when using
bell statistics. Based on Fig. 22, most of the data are within
the range of positive two standard deviations and negative
three standard deviations of the mean. As shown in the said
Figure, read data 1 corresponds to the normal distribution and
conforms to the three-sigma rule of thumb.
Fig. 23 shows the Monte Carlo of data 0 (low potential)
Fig. 18. Measurement setup for the proposed CIM. versus times when the CIM reads data 0. From the bell-
shaped statistical graph, its µ is 2.83 µV, which is close to
WANG et al.:A 40-NM CMOS MULTIFUNCTIONAL COMPUTING-IN-MEMORY (CIM) USING SINGLE-ENDED DISTURB-FREE 7T 1KB SRAM 7

Fig. 22. Monte Carlo Bell Chart of Reading Data 1

low potential data 0. σ is 857 nV because the lower limit of


data 0 is 0 V. When bell-shaped statistics are used, there is
no data for µ − 2σ and µ − 3σ. However, based on the same
Figure, most of the sub-data are all within the range of positive
three standard deviations and negative one standard deviation
of the mean. This shows that read data 0 corresponds to the Fig. 24. Plot of the CIM’s SNM
normal distribution and conforms to the three-sigma rule of
thumb.

Fig. 25. Plot of the CIM’s DNM

Y), respectively. We represent multiplicand bits +0, -1, +0, and


-1 (i.e. X = 0101) and multiplier bits +1, +0, -0, and -1 (i.e.
Fig. 23. Monte Carlo Bell Chart of Reading Data 0 Y = 1001) which generate results namely, +1, -0, -0, and +0
(i.e. Product = 0001) as shown in Fig. 28. With reference to
Fig. 24and Fig. 25 show the static noise margin (SNM) Fig. 28, the product of the bits on WL0 and Wl1 is reflected on
and dynamic noise margin (DNM) of the proposed CIM, WL3 while the sign of the Product is reflected on WL2. The
respectively. The SNM has a good value of 840 mV. For the sequence of the resulting bits can be seen at the arrow flow
DNM, any noise can be resisted under a pulse width of 90 ps from cell 30 to cell 23. The output waveform for the CIM’s
at a pulse voltage of 0.3 V when supply voltage (VDD) is 0.9 multiplier is presented in Figs. 29. Figs. 30 and 31 shows the
V and clock frequency is 100 MHz. actual waveforms for the BIST’s normal and retention modes,
To clarify the addition operation of the CIM, we let augend respectively.
X (1101) + addend Y (0011) = sum Y (0000) as shown in
Fig. 26. With reference to Fig. 26, cell 20 (Carry bit) is initially > Ϭ > ϭ > Ϯ > ϯ

y
>ϰ >ϱ >ϲ >ϳ

^ŝŐŶy
low. Next, the values of first 3 cells along the block line (BL0) t>Ϭ
ĐĞůůϬϬ ĐĞůůϬϭ ĐĞůůϬϮ ĐĞůůϬϯ ĐĞůůϬϰ ĐĞůůϬϱ ĐĞůůϬϲ ĐĞůůϬϳ
ϭ Ϭ ϭ ϭ ϭ ϭ Ϭ Ϭ
namely, 00, 10, and 20, are added; the sum is written in 4th
z ^ŝŐŶz
cell of the same block line (cell 30) and the carry is written ĐĞůůϭϬ ĐĞůůϭϭ ĐĞůůϭϮ ĐĞůůϭϯ
>^

ĐĞůůϭϰ ĐĞůůϭϱ


ĐĞůůϭϲ
D^ 

ĐĞůůϭϳ
t>ϭ
in the 3rd cell of the next block line (BL1). The process is ϭ ϭ Ϭ Ϭ ϭ ϭ Ϭ Ϭ
repeated until the sum is expected. The output waveform for ĂƌƌLJ
ĐĞůůϮϬ ĐĞůůϮϭ ĐĞůů ϮϮ ĐĞůůϮϯ ĐĞůůϮϰ ĐĞůůϮϱ ĐĞůů Ϯϲ ĐĞůůϮϳ
the CIM’s adder is displayed in Fig. 27. The sequence of the t>Ϯ



Ϭ ϭ ϭ ϭ
resulting bits can be seen at the arrow flow from cell 20 to cell
33. Meanwhile, to explain the multiplication operation of the t>ϯ
ĐĞůůϯϬ
Ϭ
ĐĞůůϯϭ
Ϭ
ĐĞůů ϯϮ
Ϭ
ĐĞůůϯϯ
Ϭ
ĐĞůůϯϰ ĐĞůůϯϱ ĐĞůů ϯϲ ĐĞůůϯϳ

CIM, we implement the multiplication on a sample word for ^Ƶŵ

demonstration purpose to show our proposed chip can execute


4-bit multiplication. The signed bits for positive and negative Fig. 26. Operation example of the 4-bit addition.
are represented as 0 and 1 (as reflected in Sign X and Sign
8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2015

TABLE III
P ERFORMANCE C OMPARISON OF CIM A RCHITECTURES

TVLSI [10] TCAS-1 [4] JSSC [11] CICC [12] This work
Year 2017 2018 2017 2015 2021
Process 65 nm PTM 4 5nm 4 0nm 28 nm TSMC 40 nm
Verification Meas. Simul. Simul. Simul. Simul. Meas. Meas.
Supply Voltage (V) 1.2 1.1 0.6 0.7 0.9
Cell Type 6T 8T 8T 8+ T 5T 8T 7T (single-ended)
NAND, NOR, XOR
NAND NAND SRAM SRAM SRAM for
NOR IMP NOR for for AI Applications
Operation SRAM XOR XOR XOR Face Image Addition
RCS RCS RCS Recognition Recognition Multiplication
Normal mode
Retention mode
Array Size 32×32 (1 Kb) N.A. 4 Mb 64 Kb 1 Kb
Frequency (MHz) 100 18.2 100
Write N.A. N.A. N.A. N.A.
Norm. energy1 (fJ/bit) 39.5 23.7 168
Frequency (MHz) 166 100 18.2 100
Read N.A. N.A. N.A.
Norm. energy2 (fJ/bit) 4.1 64 51.8 224
Norm. avg. energy3 (fJ/bit) 21.8 8.5 5.5 14.6 N.A. 37.7 196
1 Norm. write energy = W rite energy
P rocess2
× 103
2 Norm. read energy = Read energy
P rocess2
× 103
3 Norm. avg. energy = Avg. energy
P rocess2
× 103

1 1 0 1 WƌŽĚƵĐƚ 1st bit 2nd bit 3rd bit 4th bit


ϰ ϯ Ϯ ϭ
ƚŚ ƌĚ ŶĚ Ɛƚ

ƵŐĞŶĚ;yͿ ĚĚĞŶĚ;zͿ ^ŝŐŶy ^ŝŐŶz ^Ƶŵ


+ DƵůƚŝƉůŝĐĂŶĚ;yͿ
ϰ ϯ Ϯ ϭ
DƵůƚŝƉůŝĞƌ;zͿ
ϰ ϯ Ϯ ϭ
^ŝŐŶy
ϰ ϯ Ϯ ϭ
^ŝŐŶz
ϰ ϯ Ϯ ϭ
^ŝŐŶ
ϰ ϯ Ϯ ϭ
(+) 0 (-) 1 (+) 0 (-) 1
>^ D^ >^ D^ >^ D^ >^ D^ >^  D^
0 0 1 1 ƚŚ ƌĚ ŶĚ Ɛƚ ƚŚ ƌĚ ŶĚ Ɛƚ ƚŚ ƌĚ ŶĚ Ɛƚ ƚŚ ƌĚ ŶĚ Ɛƚ ƚŚ ƌĚ ŶĚ Ɛƚ

x (+) 1 x (+) 0 x (-) 0 x (-) 1


1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0
0 0 0 0 (+) 0 (-) 0 (-) 0 (+) 1

Data_out Data_out

clk clk

wr_en wr_en

Data_in Data_in

Word_addr0 Word_addr0

Bit_addr0 Bit_addr0

Fig. 27. Waveform for the 4-bit addition. Fig. 29. Waveform for the 4-bit multiplication

> Ϭ > ϭ > Ϯ > ϯ >ϰ >ϱ >ϲ >ϳ

y ^ŝŐŶy
ĐĞůůϬϬ ĐĞůůϬϭ ĐĞůůϬϮ ĐĞůůϬϯ ĐĞůůϬϰ ĐĞůůϬϱ ĐĞůůϬϲ ĐĞůůϬϳ
t>Ϭ
ϭ Ϭ ϭ Ϭ ϭ Ϭ ϭ Ϭ
z >^  D^ 
^ŝŐŶz
ĐĞůůϭϬ ĐĞůůϭϭ ĐĞůůϭϮ ĐĞůůϭϯ ĐĞůůϭϰ ĐĞůůϭϱ ĐĞůůϭϲ ĐĞůůϭϳ pass
t>ϭ
ϭ Ϭ Ϭ ϭ ϭ ϭ Ϭ Ϭ
^ŝŐŶWƌŽĚƵĐƚ
ĐĞůůϮϬ ĐĞůůϮϭ ĐĞůů ϮϮ ĐĞůůϮϯ ĐĞůůϮϰ ĐĞůůϮϱ ĐĞůů Ϯϲ ĐĞůůϮϳ Data_out
t>Ϯ
Ϭ ϭ ϭ Ϭ






clk


ĐĞůůϯϬ ĐĞůůϯϭ ĐĞůů ϯϮ ĐĞůůϯϯ ĐĞůůϯϰ ĐĞůůϯϱ ĐĞůů ϯϲ ĐĞůůϯϳ


t>ϯ
ϭ Ϭ Ϭ Ϭ wr_en

WƌŽĚƵĐƚ Data_in

Word_addr0

Bit_addr0
Fig. 28. Operation of the multiplication demonstrated on 1 word sample

Fig. 30. Waveform for the BIST normal mode


Table III shows the performance comparison of the proposed
CIM with the other CIM architectures. The proposed CIM chip
measurement results show that the normalized average energy
23.7+51.8
is higher, since our design (40-nm process) attains the highest 18.2 = 4.15. Besides, our CIM has the most number
clock rate. If compared with [12] which also had measure- of performed operations (NAND, NOR, XOR, addition, and
ment results, our Read/Write energy/MHz= 168+224 100 = 3.92 multiplication) and functions (normal and retention modes for
which is smaller than that of [12] (28-nm process) which is BIST) among the prior CIMs.
WANG et al.:A 40-NM CMOS MULTIFUNCTIONAL COMPUTING-IN-MEMORY (CIM) USING SINGLE-ENDED DISTURB-FREE 7T 1KB SRAM 9

pass

Data_out

clk

wr_en

Data_in

Word_addr0

Bit_addr0

Fig. 31. Waveform for the BIST retention mode

IV. C ONCLUSION
This work presents a 40-nm CMOS-based multifunctional
CIM architecture using single-ended disturb-free 7T 1Kb
SRAM. The proposed CIM resolves the problem of von Neu-
mann bottleneck, the accumulation issues in the 5T SRAM,
and the high power consumption and large chip area through
FS-GDI. Finally, it performs the most number of operations
and functions among the prior CIMs.

R EFERENCES
[1] J. Backus, “Can programming be liberated from the von Neumann style?:
A functional style and its algebra of programs,” Commun. ACM, vol. 21,
no. 8, pp. 613-641, Aug. 1978.
[2] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory
with spin-transfer torque magnetic RAM,” IEEE Trans. on Very Large
Scale Integration Systems (TVLSI), vol. 26, no. 3, pp. 470-483, Mar. 2018.
[3] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada, S.
Miyoshi, M. Yasuda, D. Blaauw, and D. Sylvester, “A 4 + 2T SRAM for
searching and in-memory computing with 0.3-V VDDmin ,” IEEE Journal
of Solid-State Circuits (JSSC), vol. 53, no. 4, pp. 1006-1015, Apr. 2018.
[4] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling in-
memory boolean computations in CMOS static random access memories,”
IEEE Trans. Circuits Syst. I, Reg. Papers (TCAS-I), pp. 1-14, Jul 2018.
[5] C.-C. Wang, Y.-L. Tseng, H.-Y. Leo, and R. Hu, “A 4-kB 500-MHz 4-
T CMOS SRAM using low-V/sub THN/ bitline drivers and high-V/sub
THP/ latches,” IEEE Trans. on Very Large Scale Integration Systems
(TVLSI), vol. 12, no. 9, pp. 901-909, Sep. 2004.
[6] C.-C. Wang, C.-L. Lee, and W.-J. Lin, “A 4-Kb low power SRAM
design with negative word-line scheme,” IEEE Trans. Circuits Syst. I,
Reg. Papers (TCAS-I), vol. 54, no. 5, pp. 1069-1076, May 2007.
[7] C.-C. Wang, and C.-L. Hsieh, “Disturb-free 5T loadless SRAM cell
design with multi-vth transistors using 28 nm CMOS process,” in Proc.
IEEE Inter. SoC Design Conf. (ISOCC), pp. 103-104, Oct. 2016.
[8] A. Morgenshtein, A. Fish, and I. A. Wagner “Gate-diffusion input (GDI):
a power-efficient method for digital combinatorial circuits,” IEEE Trans.
on Very Large Scale Integration Systems (TVLSI), vol. 10, no. 5, pp.
566-581, Oct. 2002.
[9] M. A. Ahmed, and M. A. Abdelghany, “Low power 4-Bit arithmetic logic
unit using full-swing GDI technique,” in Proc. Inter. Conf. on Innovative
Trends in Computer Engineering (ITCE), pp. 193-196, Feb. 2018.
[10] J. Lee, D. Shin, Y. Kim, and H. J. Yoo, “A 17.5-fJ/bit energy-efficient
analog SRAM for mixed-signal processing,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2714-2723,
Oct. 2017.
[11] D. Jeon, Q. Dong, Y. Kim, X. Wang, S. Chen, H. Yu, D. Blaauw, and
D. Sylvester, “A 23-mW face recognition processor with mostly-read 5T
memory in 40-nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 52,
no. 6, pp. 1628-1642, Jun. 2017
[12] H. Mori, T. Nakagawa, Y. Kitahara, Y. Kawamoto, K. Takagi, S.
Yoshimoto, S. Izumi, K. Nii, H. Kawaguchi, and M. Yoshimoto, “A 298-
fJ/writecycle 650-fJ/readcycle 8T three-port SRAM in 28-nm FD-SOI
process technology for image processor,” in Proc. 2015 IEEE Custom
Integrated Circuits Conference (CICC), pp. 1-4, Sept. 2015.

You might also like