Cim Iscas 2021
Cim Iscas 2021
8, AUGUST 2015 1
Dffwr
BIST_EN
BIST_Pass On the contrary, when writing 1, WA is turned on and Qbxx
CLK BIST Column
Data_out 5 5 is pulled to ground. In this way, VDD will pull Qxx to high
Retention Decoder
BIST_wr_en
5
bword_addr[4:0]
BIST_data_in
WS[4:0]
through MP202. When reading, WA and WLx are turned on
Data_out
RST
Word_addr[4:0] 5
SRAM
WA[31:0] 32
1K Bit to transmit Qbxx to BLBx and generate the value of BLx
5 WAB[31:0] 32 Single- 32
Column
Bit_addr[4:0]
Data Control PreD[31:0] 32
Ended Selector through an inverter. The SRAM Control Circuit controls WA
wr_en Circuit Row
CLK
WL[31:0] SRAM and WAB based on the inputs, namely, Data inx, WLx and
Decoder 32 Array
Dffwr
5 with
Current
Compensation PreD as shown in the following Table I.
Circuit
5
CIM_Data
Auto-Switching 2T-Switch
WL_auto[4:0]
Write Back
CNOR[31:0]
CB[31:0]
CAND[31:0]
BL_auto[4:0]
Cimprec
Circuit TABLE I
Cimprec
32 32 32
SRAM C ONTROL C IRCUIT F UNCTION TABLE
MUL WRSel
CIM 10
Augend_addr[4:0] 5
5
Dffwrb
DffC, S_addr[4:0]
CO[31:0]
SUM[31:0]
FS-GDI Data inx PreD WLx WA WAB
Addend_addr[4:0]
Product_addr[4:0] 5 CIM Sign[31:0]
Ripple Carry Standby X X 0 0 0
Signproduct[31:0]
Signproduct_addr[4:0]
5
Control Adder Read X 0 1 1 0
Circuit Write 0 1 1 0 1
Write 1 1 1 1 0
96 PreC[31:0], C[31:0], S[31:0]
S0
C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000000000000000000000000000000000000000000000000000000000000000000
Q00 Qb00 Q01 Qb01 Q02 Qb02 Q03 Qb03
WL1
S1
Q10
cell 10
Qb10
cell 11
Q11 Qb11 Q12
cell 12
Qb12
cell 13
Q13 Qb13 initialization, the PreC signal is low in each write cycle. After
C1
CBx, CNORx and CANDx are precharged to high, PreC is
WL2 Q20
cell 20
Qb20
cell 21
Q21 Qb21 Q22
cell 22
Qb22
cell 23
Q23 Qb23
pulled high to operate at start-up mode. Two sets of control
S2
C2 signals Sx and Cx control 2T Switch to couple Qxx or Qbxx to
CBx, CNORx and CANDx. During computation phase I, only
WL3 cell 30 cell 31 cell 32 cell 33
S3
C3
Q30 Qb30 Q31 Qb31 Q32 Qb32 Q33 Qb32
one of the signals (S0 to S3) can be logic high at the same
time to select the SRAM Cell where the Carry bit is placed
CB0
&125
RCA unit
CNOR0 CAND0 CB1
&125
CNOR1
RCA unit
CAND1 CB2
CNOR2
CNOR2
RCA unit
CAND2 CB3
&1253
CNOR3
RCA unit
CAND3
PreD
MP202 MP201 number of transistors and the power consumption is reduced
WAB WA WLx
WAB
Qxx Qbxx
and the computational complexity is increased.
MN203 MN202 MN201
Vleak
MN204 MN205
TABLE II
MN206 PreD L OGICAL FUNCTIONS IN AN RCAM UNIT
CB = CIx XOR = A ⊕ B
CAN D = AB Sum = (A ⊕ B) ⊕ CIx
Fig. 3. 7T SRAM and Its Control Circuit.
CN AN D = AB CO = (A ⊕ B) · CIx + A · B
CN OR = A + B Sign = XOR = C ⊕ D
When writing 0, PreD is kept at high condition to pull the P roduct = CAN D = EF
reverse bit line (BLBx) to ground, and at the same time, the A, B = input addition bit; CI = carry in; CO= carry out;
cell is selected to be written. When WLx and WAB are high, C, D = input sign bit; E, F = input multiplication bit
WA is turned off and the value of Qxx is pulled to ground.
WANG et al.:A 40-NM CMOS MULTIFUNCTIONAL COMPUTING-IN-MEMORY (CIM) USING SINGLE-ENDED DISTURB-FREE 7T 1KB SRAM 3
BL0
:ULWHBLQ
&,
Pre-Charge Circuit 680 08; 'DWDBLQ
Sign
Product0
In_sel0
PreC In_sel1
In_sel2
Pre-Charge Circuit
WL0 cell 00
Q00 Qb00
S0
C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000000000000000000000000000000000000000000000000000000000000000000
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000000000000000000000000000000000000000000000000000000000000000000000000
00 00 00 00 00 00 00 00 00 00 002T 000000000000000000000000000000000000000000000000000000000000000000
0 0 0 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000 000 000 Switch
000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000
WL1 cell 10
Q10 Qb10
Fig. 6. Layout of the 7T SRAM and 2T Switch.
S1
C1
&$1'[ &1$1'[
&,[ &1$1'[ &$1'[
;125[
&125[
;25[ ;25[ ;125[
MP3
WL2 cell 20
Q20 Qb20
&%[ &,[ &2[
MN8
MN9 MP6 MN10 MP7
680[
WL3 cell 30
Q30 Qb30
selection. Its output CIM Data generates Carry and Sum for
addition while Product and Sign Product for multiplication.
Auto-switching Pre-charge
CIM Timing Control Circuit CIM Control Circuit unit
Control Circuit
CIM
MUL
OP
D Q D Q Cimp_auto OP sc x32 Cimprec
Opprec clk Pre-Charge ctr reset
Wrsel D Q clk BL_auto[4:0]
reset Dffwrb pcc
OP Counter
Dffwr Counter
D Q D Q 5
5
sc
Qb Wrsel Prec
Dffwr pcc Prec[31:0] BL auto-switching circuit
reset reset
OP OP
cim_cell C C[31:0]
32
Decoder
D Q bit_addr
Dffwrb S S[31:0]
32
CLK
w0 w2
Sum[63:0] MUX32 add_Data
Augend_addr[4:0] CIM
D Q 5
32
5 32
selector Decoder Carry[63:0] MUX32
SignX_mul[4:0] D Q 5
CIM_Data
32
Addend_addr[4:0] Q 5
Product[63:0] MUX32 mul_Data
D
5
selector Decoder 32 MUL
5 32
SignY_mul[4:0] D Q Signproduct[31:0] MUX32
Data switching circuit
5
Carry_addr[4:0] D Q Carry
5 D Q
selector Decoder 32
Dffwrb Qb
5 OP reset
CIM_datasel
Signproduct_mul[4:0] D Q Signproduct CIM
Switch clk
Sumaddr_addr[4:0] D Q 5
Sum[4:0] CIM en
5
Carry[4:0] 5
Product_mul[4:0] D Q 5 Sum[4:0] 5
Product[4:0]
WL_auto[4:0]
Address selecting Control Circuit
MUL en
5
Signproduct[4:0] 5
Fig. 8. CIM Control Circuit.
Product[4:0] 5
WL auto-switching circuit
CLK
Retention BIST_Data
BIST Controller BIST_Data
clk BIST_WR
BIST_WR
Dout Output Response
Data_out Analyzer(ORA) BIST_Pass
Retention_test
D0 Q BIST_WR
clkk
BIST_Data
clk CLK QB
RESET
BIST_WR
dffwr
clkk CLK QB CLK QB
RESET Dout BIST_Pass
RESET BIST_Data D0 Q
Retention
dffwr CLK
Fig. 12. BIST Controller circuit.
BIST_EN
Finally, when reading data 1, the voltage drift range is about
BIST_ADDR[0] BIST_ADDR[1] BIST_ADDR[2] BIST_ADDR[3] BIST_ADDR[4] 899.78 mV to 900 mV. The Monte Carlo simulations results
prove that this framework can effectively prevent the result-
Fig. 13. Pattern Generator (PG) circuit. ing circuit characteristics’ fluctuations, thereby improving the
6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2015
wafer yield.
Fig. 17. (a) Layout and (b) die micrograph of the CIM. Fig. 21. Monte Carlo simulation results of reading data 1
y
>ϰ >ϱ >ϲ >ϳ
^ŝŐŶy
low. Next, the values of first 3 cells along the block line (BL0) t>Ϭ
ĐĞůůϬϬ ĐĞůůϬϭ ĐĞůůϬϮ ĐĞůůϬϯ ĐĞůůϬϰ ĐĞůůϬϱ ĐĞůůϬϲ ĐĞůůϬϳ
ϭ Ϭ ϭ ϭ ϭ ϭ Ϭ Ϭ
namely, 00, 10, and 20, are added; the sum is written in 4th
z ^ŝŐŶz
cell of the same block line (cell 30) and the carry is written ĐĞůůϭϬ ĐĞůůϭϭ ĐĞůůϭϮ ĐĞůůϭϯ
>^
ĐĞůůϭϰ ĐĞůůϭϱ
ĐĞůůϭϲ
D^
ĐĞůůϭϳ
t>ϭ
in the 3rd cell of the next block line (BL1). The process is ϭ ϭ Ϭ Ϭ ϭ ϭ Ϭ Ϭ
repeated until the sum is expected. The output waveform for ĂƌƌLJ
ĐĞůůϮϬ ĐĞůůϮϭ ĐĞůů ϮϮ ĐĞůůϮϯ ĐĞůůϮϰ ĐĞůůϮϱ ĐĞůů Ϯϲ ĐĞůůϮϳ
the CIM’s adder is displayed in Fig. 27. The sequence of the t>Ϯ
Ϭ ϭ ϭ ϭ
resulting bits can be seen at the arrow flow from cell 20 to cell
33. Meanwhile, to explain the multiplication operation of the t>ϯ
ĐĞůůϯϬ
Ϭ
ĐĞůůϯϭ
Ϭ
ĐĞůů ϯϮ
Ϭ
ĐĞůůϯϯ
Ϭ
ĐĞůůϯϰ ĐĞůůϯϱ ĐĞůů ϯϲ ĐĞůůϯϳ
TABLE III
P ERFORMANCE C OMPARISON OF CIM A RCHITECTURES
TVLSI [10] TCAS-1 [4] JSSC [11] CICC [12] This work
Year 2017 2018 2017 2015 2021
Process 65 nm PTM 4 5nm 4 0nm 28 nm TSMC 40 nm
Verification Meas. Simul. Simul. Simul. Simul. Meas. Meas.
Supply Voltage (V) 1.2 1.1 0.6 0.7 0.9
Cell Type 6T 8T 8T 8+ T 5T 8T 7T (single-ended)
NAND, NOR, XOR
NAND NAND SRAM SRAM SRAM for
NOR IMP NOR for for AI Applications
Operation SRAM XOR XOR XOR Face Image Addition
RCS RCS RCS Recognition Recognition Multiplication
Normal mode
Retention mode
Array Size 32×32 (1 Kb) N.A. 4 Mb 64 Kb 1 Kb
Frequency (MHz) 100 18.2 100
Write N.A. N.A. N.A. N.A.
Norm. energy1 (fJ/bit) 39.5 23.7 168
Frequency (MHz) 166 100 18.2 100
Read N.A. N.A. N.A.
Norm. energy2 (fJ/bit) 4.1 64 51.8 224
Norm. avg. energy3 (fJ/bit) 21.8 8.5 5.5 14.6 N.A. 37.7 196
1 Norm. write energy = W rite energy
P rocess2
× 103
2 Norm. read energy = Read energy
P rocess2
× 103
3 Norm. avg. energy = Avg. energy
P rocess2
× 103
Data_out Data_out
clk clk
wr_en wr_en
Data_in Data_in
Word_addr0 Word_addr0
Bit_addr0 Bit_addr0
Fig. 27. Waveform for the 4-bit addition. Fig. 29. Waveform for the 4-bit multiplication
y ^ŝŐŶy
ĐĞůůϬϬ ĐĞůůϬϭ ĐĞůůϬϮ ĐĞůůϬϯ ĐĞůůϬϰ ĐĞůůϬϱ ĐĞůůϬϲ ĐĞůůϬϳ
t>Ϭ
ϭ Ϭ ϭ Ϭ ϭ Ϭ ϭ Ϭ
z >^ D^
^ŝŐŶz
ĐĞůůϭϬ ĐĞůůϭϭ ĐĞůůϭϮ ĐĞůůϭϯ ĐĞůůϭϰ ĐĞůůϭϱ ĐĞůůϭϲ ĐĞůůϭϳ pass
t>ϭ
ϭ Ϭ Ϭ ϭ ϭ ϭ Ϭ Ϭ
^ŝŐŶWƌŽĚƵĐƚ
ĐĞůůϮϬ ĐĞůůϮϭ ĐĞůů ϮϮ ĐĞůůϮϯ ĐĞůůϮϰ ĐĞůůϮϱ ĐĞůů Ϯϲ ĐĞůůϮϳ Data_out
t>Ϯ
Ϭ ϭ ϭ Ϭ
clk
WƌŽĚƵĐƚ Data_in
Word_addr0
Bit_addr0
Fig. 28. Operation of the multiplication demonstrated on 1 word sample
pass
Data_out
clk
wr_en
Data_in
Word_addr0
Bit_addr0
IV. C ONCLUSION
This work presents a 40-nm CMOS-based multifunctional
CIM architecture using single-ended disturb-free 7T 1Kb
SRAM. The proposed CIM resolves the problem of von Neu-
mann bottleneck, the accumulation issues in the 5T SRAM,
and the high power consumption and large chip area through
FS-GDI. Finally, it performs the most number of operations
and functions among the prior CIMs.
R EFERENCES
[1] J. Backus, “Can programming be liberated from the von Neumann style?:
A functional style and its algebra of programs,” Commun. ACM, vol. 21,
no. 8, pp. 613-641, Aug. 1978.
[2] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory
with spin-transfer torque magnetic RAM,” IEEE Trans. on Very Large
Scale Integration Systems (TVLSI), vol. 26, no. 3, pp. 470-483, Mar. 2018.
[3] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada, S.
Miyoshi, M. Yasuda, D. Blaauw, and D. Sylvester, “A 4 + 2T SRAM for
searching and in-memory computing with 0.3-V VDDmin ,” IEEE Journal
of Solid-State Circuits (JSSC), vol. 53, no. 4, pp. 1006-1015, Apr. 2018.
[4] A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-SRAM: Enabling in-
memory boolean computations in CMOS static random access memories,”
IEEE Trans. Circuits Syst. I, Reg. Papers (TCAS-I), pp. 1-14, Jul 2018.
[5] C.-C. Wang, Y.-L. Tseng, H.-Y. Leo, and R. Hu, “A 4-kB 500-MHz 4-
T CMOS SRAM using low-V/sub THN/ bitline drivers and high-V/sub
THP/ latches,” IEEE Trans. on Very Large Scale Integration Systems
(TVLSI), vol. 12, no. 9, pp. 901-909, Sep. 2004.
[6] C.-C. Wang, C.-L. Lee, and W.-J. Lin, “A 4-Kb low power SRAM
design with negative word-line scheme,” IEEE Trans. Circuits Syst. I,
Reg. Papers (TCAS-I), vol. 54, no. 5, pp. 1069-1076, May 2007.
[7] C.-C. Wang, and C.-L. Hsieh, “Disturb-free 5T loadless SRAM cell
design with multi-vth transistors using 28 nm CMOS process,” in Proc.
IEEE Inter. SoC Design Conf. (ISOCC), pp. 103-104, Oct. 2016.
[8] A. Morgenshtein, A. Fish, and I. A. Wagner “Gate-diffusion input (GDI):
a power-efficient method for digital combinatorial circuits,” IEEE Trans.
on Very Large Scale Integration Systems (TVLSI), vol. 10, no. 5, pp.
566-581, Oct. 2002.
[9] M. A. Ahmed, and M. A. Abdelghany, “Low power 4-Bit arithmetic logic
unit using full-swing GDI technique,” in Proc. Inter. Conf. on Innovative
Trends in Computer Engineering (ITCE), pp. 193-196, Feb. 2018.
[10] J. Lee, D. Shin, Y. Kim, and H. J. Yoo, “A 17.5-fJ/bit energy-efficient
analog SRAM for mixed-signal processing,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2714-2723,
Oct. 2017.
[11] D. Jeon, Q. Dong, Y. Kim, X. Wang, S. Chen, H. Yu, D. Blaauw, and
D. Sylvester, “A 23-mW face recognition processor with mostly-read 5T
memory in 40-nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 52,
no. 6, pp. 1628-1642, Jun. 2017
[12] H. Mori, T. Nakagawa, Y. Kitahara, Y. Kawamoto, K. Takagi, S.
Yoshimoto, S. Izumi, K. Nii, H. Kawaguchi, and M. Yoshimoto, “A 298-
fJ/writecycle 650-fJ/readcycle 8T three-port SRAM in 28-nm FD-SOI
process technology for image processor,” in Proc. 2015 IEEE Custom
Integrated Circuits Conference (CICC), pp. 1-4, Sept. 2015.