568
• 2024 IEEE International Solid-State Circuits Conference
ISSCC 2024 / SESSION 34 / COMPUTE-IN-MEMORY / 34.2
34.2 A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-
Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W
and 33.2-91.2TFLOPS/W for AI-Edge Devices
Win-San Khwa*
1
, Ping-Chun Wu*
2
, Jui-Jen Wu
1
, Jian-Wei Su
2,3
, Ho-Yu Chen
2
,
Zhao-En Ke
2
, Ting-Chien Chiu
2
, Jun-Ming Hsu
2
, Chiao-Yen Cheng
2
,
Yu-Chen Chen
2
, Chung-Chuan Lo
2
, Ren-Shuo Liu
2
, Chih-Cheng Hsieh
2
,
Kea-Tiong Tang
2
, Meng-Fan Chang
1,2
1
TSMC Corporate Research, Hsinchu, Taiwan
2
National Tsing Hua University, Hsinchu, Taiwan
3
Industrial Technology Research Institute, Hsinchu, Taiwan
*Equally Credited Authors (ECAs)
Advanced AI-edge chips require computational flexibility and high-energy efficiency (EEF)
with sufficient inference accuracy for a variety of applications. Floating-point (FP)
numerical representation can be used for complex neural networks (NN) requiring a high
inference accuracy; however, such an approach requires higher energy and more
parameter storage than does a fixed-point integer (INT) numerical representation. Many
compute-in-memory (CIM) designs have a good EEF for INT multiply-and-accumulate
(MAC) operations; however, few support FP-MAC operations [1-3]. Implementing INT/FP
dual-mode (DM) MAC operations presents challenges (Fig. 34.2.1), including (1) low-
area efficiency, since FP-MAC functions become idle during INT-MAC operations; (2) a
high system-level latency, due to NN data update interruptions on small-capacity SRAM-
CIM without concurrent write-and-compute functionality; and (3) high-energy
consumption, due to repeated system-to-CIM data transfers during computation. This
work presents an INT/FP DM macro featuring (1) a DM zone-based input (IN) processing
scheme (ZB-IPS) to eliminate subtraction in exponent (EXP) computation, while reusing
the alignment circuit in INT-mode to improve EEF and area efficiency (AEF); (2) a DM
local-computing-cell (DM-LCC), which reuses the EXP addition as an adder tree stage
for INT-MAC to improve AEF in INT mode; and (3) a stationary-based two-port gain-cell
(GC) array (SB-TP-GCA) to support concurrent data updates and computation, while
reducing system-to-CIM and internal data accesses to improve EEF and latency (T
MAC
).
A 16nm 96-Kb INT-FP DM GC-CIM macro with 4T GCs is fabricated to support FP-MAC
with 64 accumulations (N
ACCU
) for BF16-IN, BF16-W, and FP32-OUT as well as an INT-
MAC with N
ACCU
=128 for 8b-IN, 8b-W, and 23b-OUT. This CIM macro achieves a
163.3TOPS/W INT-MAC and a 91.2TFLOPS/W FP-MAC EEF.
Figure 34.2.2 illustrates the CIM structure and dataflow. The conventional FP-CIM
structure has low area efficiency in INT mode, as EXP adders and alignment circuits sit
idle. The DM CIM structure uses DM adders (DM-ADDs) as an adder tree for 2× N
ACCU
,
and the alignment circuit as an IN-sparsity-aware circuit (INAC) to improve EEF and AEF
in INT mode. The macro consists of 24 banks, each with an output channel. The DM CIM
bank includes a DM zone-based IN processing unit (ZB-IPU), DM GC computing array
(DM-GCCA), a digital shift-and-adder (DSaA), and a timing controller (CTRL). The DM-
GCCA consists of 64 GC computing blocks (GC-CB), each containing a SB-TP-GCA for
64b storage data and 16b stationary data, and a DM-LCC comprising a DM-ADD and DM
multiplexers (DM-MUX). In BF16 mode, each SB-TP-GCA stores weight (W) parameters
with a 1b-sign (S), 7b-W mantissa (MAN) (W
M
), and an 8b-W-EXP (W
E
). In phase-1, the
DM-ADD sums the 8b-IN-EXP (IN
E
) and 8b-W
E
to derive the product-EXP (PD
E
). In phase-
2, the ZB-IPU finds the maximum PD
E
(PD
E-MAX
) and aligns each IN-MAN (IN
M
)
accordingly to an aligned IN
M
(IN
MA
). In phase-3, the DM-MUX computes IN
MA
× W
M
and
generates the product-MAN (PD
M
). In phase-4, the DSaA combines 64 PD
M
and PD
E-MAX
with place-values to output a full-precision FP32 MACV. In INT8 mode, each SB-TP-GCA
stores two 8b-INT-W (i.e. W
0
[7:0] and W
1
[7:0]). In phase-1, the DM-ADD sums 8b-W
0
and 8b-W
1
to derive a pre-compute sum (pSUM = W
0
+ W
1
), which can be reused for
multiple computations by exploiting W data re-use. In phase-2, the ZB-IPU detects IN-
sparsity to reduce MAC energy consumption in DM-GCCA and DSaA, and decodes two
bitwise INs (i.e. IN
0
[k] and IN
1
[k]) as the select signals of the DM-MUX. In phase-3, the
DM-MUX performs partial MAC (pMAC) for IN
0
and IN
1
and generates the pMAC value
(pMACV = IN
0
× W
0
+ IN
1
× W
1
). In phase-4, the DSaA accumulates 64 pMACVs (N
ACCU
=
128) to output a full-precision 23b MACV.
Figure 34.2.3 illustrates the ZB-IPU scheme. A typical FP-MAC flow [1] uses the full
PD
E-MAX
bit width to compute the number of shifting bits (N
SH
= PD
E-MAX
– PD
E
) and
requires extended MAN alignment bits (exMANb) to suppress truncation data loss. The
use of exMANb increases area overhead and results in low area-utilization in INT mode.
The ZB-IPU adopts 2-phase alignment with large exMANb to increase inference accuracy
in FP-mode with a small area overhead. Each ZB-IPU comprises 64 DM IN processing
blocks (DM-IPB), a partial-PD
E-MAX
finder (pEMAXF), a zone bias unit (ZBU), and a zone
detector (ZD) for the zone-detect-based alignment (ZDBA) scheme. The ZDBA scheme
includes two stages: (St1) pEMAXF finds the MSB-6b (PD
E-MAX[8:3]
) of PD
E-MAX
. The ZBU
then generates 3 zone-references (PD
E-REF1~3
) according to PD
E-MAX[8:3]
, namely PD
E-REF1[8:0]
= PD
E-MAX[8:3]
+ 111, PD
E-REF2
= PD
E-REF1
– 8 and PD
E-REF3
= PD
E-REF1
– 16. (St2) Each PD
E(N)
is classified into one of the three zones based on its zone-flag (ZFG). The DM-IPB aligns
the IN
M
according to the zone-shift number (N
SHZ
) obtained by inverting the PD
E[2:0]
(LSB-
3b), which is the difference between PD
E
and PD
E-REF.
The IN
M
alignments for PD
E
with
ZFG = 1 and 2 are executed in Ph1 and Ph2, respectively. The IN
M
alignment for PD
E
with
ZFG = 3 triggers the INAC to reduce compute energy. For example, if PD
E(0)
= 011111101
(253) is the PD
E-MAX
and PD
E-REF1
= 011111111 (255), then PD
E(0)
is in zone-0 (ZFLG = 1)
and proceeds to IN
M
alignment in Ph1 with N
SH(0)
= 2 (inversion of PD
E(0)[2:0]
). For PD
E(63)
= 236 (ZFLG = 3), then its IN
M
alignment is skipped without data loss (INAC activated)
and MAN is zero. Eliminating extra physical bit width in alignment circuits and using only
3 small inverters instead of a 9b subtractor to find N
SH
significantly reduces the energy
and area overhead of the ZDBA. Pipelining the adder accumulation of Ph1 and the MAN
multiplication of Ph2 shortens T
MAC
. For INT8 mode MAC operations, PD
E-REF1
is set to 0
and the ZD serves as INAC to improve EEF.
Figure 34.2.4 illustrates the SB-TP-GCA, comprising 16 stationary-based two-port GC
columns (SB-TP-GCC) operable in three modes: stationary-update, storage-update, and
self-refresh. Each SB-TP-GCC is equipped with four 4T GCs, a 4T self-refresh unit (SRU),
and a 7T stationary unit (STU). In stationary-update mode, the data stored in the accessed
GC (RWL = 1) is transmitted to the read BL and SRU. By activating the write-assist mode
of the STU (PGATE = 1, LLAT = 1, RLAT = 1), the full-swing differential signal on read
BL (RBL) and write BL (WBL) drives the accessed data into the STU. Stationary-data can
be re-used by DM-LCC over multiple MAC computations exploiting the advantage of
weight data re-use. After updating the STU data, the STU is decoupled from WBL and
RBL by setting PGATE = LLAT = RLAT = 0, while RBL is pre-charged to V
DD
via the MP
transistor (PRE = 0). In storage-update mode, write data is passed from the global BL
(GBL) to the SRU via N0 transistor (HWL = 1). The inverter (RP and RN) of the SRU then
drives the WBL to write the data into the selected GC with the write WL (WWL) activated.
In self-refresh mode, the SRU reads data from RBL and then drives the WBL (as storage-
update) to refresh the selected GC. This CIM supports simultaneous MAC computation
and W updating or refreshing to shorten the system-level T
MAC
. Moreover, each SB-TP-
GCC uses only 27 transistors for 4 memory cells (6.75T/cell), which is lower than
previous 2-port (8–12T) CIMs [4-5].
Figure 34.2.5 summarizes the performance of the schemes. The high area efficiency in
INT8 and BF16 modes enabled by DM-LCC and ZB-IPS results in a combined AEF FoM1
that is 2.1–6.9× of previous CIMs [1-3]. An energy-area-accuracy FoM2 that is 1.5–2.2×
higher than previous CIMs [1,2], when applying ResNet20 to the CIFAR-100 dataset, is
due to high-accuracy computation with exMANb, a compact area, and the low energy
consumption of the DM structure and SB-TP-GCA.
Figure 34.2.6 shows the measured results from a 16-nm 96-Kb GC-CIM macro for FP-
MAC (BF16-IN and BF16-W with N
ACCU
= 64 and FP32-OUT) and INT-MAC (8b-IN and
8b-W with N
ACCU
= 128 and 23b-OUT) operations. Shmoo plots confirm that this CIM
macro achieves a T
MAC
= 4.0ns for FP-MAC and 1.9ns for INT-MAC at V
DD
=0.8V. The
measured EEF in FP-MAC and INT-MAC are 45.4TFLOPS/W and 98.5TOPS/W. This CIM
macro achieves an FoM (OUT-ratio × Normalized EEF × Normalized AEF) that is >5.3×
higher than previous FP-CIMs. In BF16 mode, the system-level inference accuracy is
only 0.01% lower than software (FP32) for ResNet20 with the CIFAR-100 dataset, and
only 0.02% for ResNet18 with the ImageNet dataset. Figure 34.2.7 presents the die
photograph.
Acknowledgement:
The authors thank the guidance from Philip Wong, Kerem Akarvardar, and TSMC
colleagues, and the financial support from NSTC and TSMC-NTHU Major League.
References:
[1] P.-C. Wu et al., “A 22nm 832Kb Hybrid-Domain Floating-Point SRAM In-Memory-
Compute Macro with 16.2-70.2TFLOPS/W for High-Accuracy AI-Edge Devices,” ISSCC,
pp. 126-127, 2023.
[2] A. Guo et al., “A 28nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-
Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for
Floating-Point CNNs,” ISSCC, pp. 128-129, 2023.
[3] F. Tu et al., “A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable
Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth
Multiplication for Cloud Deep Learning Acceleration,” ISSCC, pp. 254-255, 2022.
[4] H. Mori et al., “A 4nm 6163-TOPS/W/b 4790-TOPS/mm
2
/b SRAM Based Digital-
Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC
and Weight Update,” ISSCC, pp. 132-133, 2023.
[5] H. Fujiwara et al., “A 5-nm 254-TOPS/W 221-TOPS/mm
2
Fully-Digital Computing-in-
Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and
Simultaneous MAC and Write Operations,” ISSCC, pp. 186-187, 2022.
[6] Y. He et al., “A 28nm 38-to-102-TOPS/W 8b Multiply-Less Approximate Digital SRAM
Compute-In-Memory Macro for Neural-Network Inference,” ISSCC, pp. 130-131, 2023.
979-8-3503-0620-0/24/$31.00 ©2024 IEEE