内存计算技术：高能效AI应用的创新宏单元

需积分: 5 82 浏览量更新于2024-06-14 收藏 43.44MB PDF 举报

"Session 34 Compute-In-Memory.pdf 涉及了内存计算技术在高精度人工智能应用、边缘计算设备以及神经网络处理中的最新进展。来自清华大学、TSMC公司和东南大学的研究团队展示了各自在内存计算领域的创新成果，包括基于POSIT的数字计算内存宏、16纳米工艺的双模式增益单元计算内存宏和22纳米的混合型闪电式内存计算宏。这些技术显著提高了能效比，实现了极高的浮点运算性能，为AI应用提供了强大支持。" Session 34是关于内存计算（Compute-In-Memory, CIM）的一个研讨会，重点关注如何通过内存硬件本身执行计算任务，以提高计算效率和能效。内存计算是一种新兴的技术，它将数据处理功能集成到内存单元中，减少了传统计算架构中数据在内存和处理器之间频繁移动带来的延迟和能量损耗。首先，来自清华大学的报告介绍了第一款支持POSIT（Position Independent Number System）数值表示格式的数字CIM宏。POSIT是一种新兴的数值计算格式，旨在提供比浮点数更高的精度和更小的位宽。该28纳米工艺的CIM宏实现了每瓦特83.23万亿次浮点运算（TFLOPS/W），这对于高精度AI应用来说是一项重大突破。其次，TSMC公司的研究展示了一种16纳米工艺的96千字节CIM宏，采用了双模式增益单元，可以执行整数和浮点数的乘加操作。这种设计实现了73.3到163.3万亿次操作每秒每瓦(TOPS/W)和33.2到91.2 TFLOPS/W的性能，特别适合于AI边缘设备，其中能源效率至关重要。最后，东南大学的团队提出了一种22纳米的64千字节混合模拟-数字静态随机存取内存（SRAM）CIM宏，采用压缩加法树和模拟存储量化器，专门针对变压器和卷积神经网络（CNNs）优化。这种“闪电式”设计结合了模拟和数字电路的优势，提高了计算速度并降低了功耗。这些研究展示了内存计算在提升AI计算性能、能效以及适应不同应用场景上的巨大潜力。通过不断创新，内存计算有望成为未来计算架构的重要组成部分，加速AI和大数据处理的发展。

568

• 2024 IEEE International Solid-State Circuits Conference

ISSCC 2024 / SESSION 34 / COMPUTE-IN-MEMORY / 34.2

34.2 A 16nm 96Kb Integer/Floating-Point Dual-Mode-Gain-Cell-

Computing-in-Memory Macro Achieving 73.3-163.3TOPS/W

and 33.2-91.2TFLOPS/W for AI-Edge Devices

Win-San Khwa*

, Ping-Chun Wu*

, Jui-Jen Wu

, Jian-Wei Su

2,3

, Ho-Yu Chen

Zhao-En Ke

, Ting-Chien Chiu

, Jun-Ming Hsu

, Chiao-Yen Cheng

Yu-Chen Chen

, Chung-Chuan Lo

, Ren-Shuo Liu

, Chih-Cheng Hsieh

Kea-Tiong Tang

, Meng-Fan Chang

1,2

TSMC Corporate Research, Hsinchu, Taiwan

National Tsing Hua University, Hsinchu, Taiwan

Industrial Technology Research Institute, Hsinchu, Taiwan

*Equally Credited Authors (ECAs)

Advanced AI-edge chips require computational ﬂexibility and high-energy efﬁciency (EEF)

with sufﬁcient inference accuracy for a variety of applications. Floating-point (FP)

numerical representation can be used for complex neural networks (NN) requiring a high

inference accuracy; however, such an approach requires higher energy and more

parameter storage than does a ﬁxed-point integer (INT) numerical representation. Many

compute-in-memory (CIM) designs have a good EEF for INT multiply-and-accumulate

(MAC) operations; however, few support FP-MAC operations [1-3]. Implementing INT/FP

dual-mode (DM) MAC operations presents challenges (Fig. 34.2.1), including (1) low-

area efﬁciency, since FP-MAC functions become idle during INT-MAC operations; (2) a

high system-level latency, due to NN data update interruptions on small-capacity SRAM-

CIM without concurrent write-and-compute functionality; and (3) high-energy

consumption, due to repeated system-to-CIM data transfers during computation. This

work presents an INT/FP DM macro featuring (1) a DM zone-based input (IN) processing

scheme (ZB-IPS) to eliminate subtraction in exponent (EXP) computation, while reusing

the alignment circuit in INT-mode to improve EEF and area efﬁciency (AEF); (2) a DM

local-computing-cell (DM-LCC), which reuses the EXP addition as an adder tree stage

for INT-MAC to improve AEF in INT mode; and (3) a stationary-based two-port gain-cell

(GC) array (SB-TP-GCA) to support concurrent data updates and computation, while

reducing system-to-CIM and internal data accesses to improve EEF and latency (T

MAC

A 16nm 96-Kb INT-FP DM GC-CIM macro with 4T GCs is fabricated to support FP-MAC

with 64 accumulations (N

ACCU

) for BF16-IN, BF16-W, and FP32-OUT as well as an INT-

MAC with N

ACCU

=128 for 8b-IN, 8b-W, and 23b-OUT. This CIM macro achieves a

163.3TOPS/W INT-MAC and a 91.2TFLOPS/W FP-MAC EEF.

Figure 34.2.2 illustrates the CIM structure and dataﬂow. The conventional FP-CIM

structure has low area efﬁciency in INT mode, as EXP adders and alignment circuits sit

idle. The DM CIM structure uses DM adders (DM-ADDs) as an adder tree for 2× N

ACCU

and the alignment circuit as an IN-sparsity-aware circuit (INAC) to improve EEF and AEF

in INT mode. The macro consists of 24 banks, each with an output channel. The DM CIM

bank includes a DM zone-based IN processing unit (ZB-IPU), DM GC computing array

(DM-GCCA), a digital shift-and-adder (DSaA), and a timing controller (CTRL). The DM-

GCCA consists of 64 GC computing blocks (GC-CB), each containing a SB-TP-GCA for

64b storage data and 16b stationary data, and a DM-LCC comprising a DM-ADD and DM

multiplexers (DM-MUX). In BF16 mode, each SB-TP-GCA stores weight (W) parameters

with a 1b-sign (S), 7b-W mantissa (MAN) (W

), and an 8b-W-EXP (W

). In phase-1, the

DM-ADD sums the 8b-IN-EXP (IN

) and 8b-W

to derive the product-EXP (PD

). In phase-

2, the ZB-IPU ﬁnds the maximum PD

(PD

E-MAX

) and aligns each IN-MAN (IN

)

accordingly to an aligned IN

(IN

). In phase-3, the DM-MUX computes IN

× W

and

generates the product-MAN (PD

). In phase-4, the DSaA combines 64 PD

and PD

E-MAX

with place-values to output a full-precision FP32 MACV. In INT8 mode, each SB-TP-GCA

stores two 8b-INT-W (i.e. W

[7:0] and W

[7:0]). In phase-1, the DM-ADD sums 8b-W

and 8b-W

to derive a pre-compute sum (pSUM = W

+ W

), which can be reused for

multiple computations by exploiting W data re-use. In phase-2, the ZB-IPU detects IN-

sparsity to reduce MAC energy consumption in DM-GCCA and DSaA, and decodes two

bitwise INs (i.e. IN

[k] and IN

[k]) as the select signals of the DM-MUX. In phase-3, the

DM-MUX performs partial MAC (pMAC) for IN

and IN

and generates the pMAC value

(pMACV = IN

× W

+ IN

× W

). In phase-4, the DSaA accumulates 64 pMACVs (N

ACCU

128) to output a full-precision 23b MACV.

Figure 34.2.3 illustrates the ZB-IPU scheme. A typical FP-MAC ﬂow [1] uses the full

E-MAX

bit width to compute the number of shifting bits (N

= PD

E-MAX

– PD

) and

requires extended MAN alignment bits (exMANb) to suppress truncation data loss. The

use of exMANb increases area overhead and results in low area-utilization in INT mode.

The ZB-IPU adopts 2-phase alignment with large exMANb to increase inference accuracy

in FP-mode with a small area overhead. Each ZB-IPU comprises 64 DM IN processing

blocks (DM-IPB), a partial-PD

E-MAX

ﬁnder (pEMAXF), a zone bias unit (ZBU), and a zone

detector (ZD) for the zone-detect-based alignment (ZDBA) scheme. The ZDBA scheme

includes two stages: (St1) pEMAXF ﬁnds the MSB-6b (PD

E-MAX[8:3]

) of PD

E-MAX

. The ZBU

then generates 3 zone-references (PD

E-REF1~3

) according to PD

E-MAX[8:3]

, namely PD

E-REF1[8:0]

= PD

E-MAX[8:3]

+ 111, PD

E-REF2

= PD

E-REF1

– 8 and PD

E-REF3

= PD

E-REF1

– 16. (St2) Each PD

E(N)

is classiﬁed into one of the three zones based on its zone-ﬂag (ZFG). The DM-IPB aligns

the IN

according to the zone-shift number (N

SHZ

) obtained by inverting the PD

E[2:0]

(LSB-

3b), which is the difference between PD

and PD

E-REF.

The IN

alignments for PD

with

ZFG = 1 and 2 are executed in Ph1 and Ph2, respectively. The IN

alignment for PD

with

ZFG = 3 triggers the INAC to reduce compute energy. For example, if PD

E(0)

= 011111101

(253) is the PD

E-MAX

and PD

E-REF1

= 011111111 (255), then PD

E(0)

is in zone-0 (ZFLG = 1)

and proceeds to IN

alignment in Ph1 with N

SH(0)

= 2 (inversion of PD

E(0)[2:0]

). For PD

E(63)

= 236 (ZFLG = 3), then its IN

alignment is skipped without data loss (INAC activated)

and MAN is zero. Eliminating extra physical bit width in alignment circuits and using only

3 small inverters instead of a 9b subtractor to ﬁnd N

signiﬁcantly reduces the energy

and area overhead of the ZDBA. Pipelining the adder accumulation of Ph1 and the MAN

multiplication of Ph2 shortens T

MAC

. For INT8 mode MAC operations, PD

E-REF1

is set to 0

and the ZD serves as INAC to improve EEF.

Figure 34.2.4 illustrates the SB-TP-GCA, comprising 16 stationary-based two-port GC

columns (SB-TP-GCC) operable in three modes: stationary-update, storage-update, and

self-refresh. Each SB-TP-GCC is equipped with four 4T GCs, a 4T self-refresh unit (SRU),

and a 7T stationary unit (STU). In stationary-update mode, the data stored in the accessed

GC (RWL = 1) is transmitted to the read BL and SRU. By activating the write-assist mode

of the STU (PGATE = 1, LLAT = 1, RLAT = 1), the full-swing differential signal on read

BL (RBL) and write BL (WBL) drives the accessed data into the STU. Stationary-data can

be re-used by DM-LCC over multiple MAC computations exploiting the advantage of

weight data re-use. After updating the STU data, the STU is decoupled from WBL and

RBL by setting PGATE = LLAT = RLAT = 0, while RBL is pre-charged to V

via the MP

transistor (PRE = 0). In storage-update mode, write data is passed from the global BL

(GBL) to the SRU via N0 transistor (HWL = 1). The inverter (RP and RN) of the SRU then

drives the WBL to write the data into the selected GC with the write WL (WWL) activated.

In self-refresh mode, the SRU reads data from RBL and then drives the WBL (as storage-

update) to refresh the selected GC. This CIM supports simultaneous MAC computation

and W updating or refreshing to shorten the system-level T

MAC

. Moreover, each SB-TP-

GCC uses only 27 transistors for 4 memory cells (6.75T/cell), which is lower than

previous 2-port (8–12T) CIMs [4-5].

Figure 34.2.5 summarizes the performance of the schemes. The high area efﬁciency in

INT8 and BF16 modes enabled by DM-LCC and ZB-IPS results in a combined AEF FoM1

that is 2.1–6.9× of previous CIMs [1-3]. An energy-area-accuracy FoM2 that is 1.5–2.2×

higher than previous CIMs [1,2], when applying ResNet20 to the CIFAR-100 dataset, is

due to high-accuracy computation with exMANb, a compact area, and the low energy

consumption of the DM structure and SB-TP-GCA.

Figure 34.2.6 shows the measured results from a 16-nm 96-Kb GC-CIM macro for FP-

MAC (BF16-IN and BF16-W with N

ACCU

= 64 and FP32-OUT) and INT-MAC (8b-IN and

8b-W with N

ACCU

= 128 and 23b-OUT) operations. Shmoo plots conﬁrm that this CIM

macro achieves a T

MAC

= 4.0ns for FP-MAC and 1.9ns for INT-MAC at V

=0.8V. The

measured EEF in FP-MAC and INT-MAC are 45.4TFLOPS/W and 98.5TOPS/W. This CIM

macro achieves an FoM (OUT-ratio × Normalized EEF × Normalized AEF) that is >5.3×

higher than previous FP-CIMs. In BF16 mode, the system-level inference accuracy is

only 0.01% lower than software (FP32) for ResNet20 with the CIFAR-100 dataset, and

only 0.02% for ResNet18 with the ImageNet dataset. Figure 34.2.7 presents the die

photograph.

Acknowledgement:

The authors thank the guidance from Philip Wong, Kerem Akarvardar, and TSMC

colleagues, and the ﬁnancial support from NSTC and TSMC-NTHU Major League.

References:

[1] P.-C. Wu et al., “A 22nm 832Kb Hybrid-Domain Floating-Point SRAM In-Memory-

Compute Macro with 16.2-70.2TFLOPS/W for High-Accuracy AI-Edge Devices,” ISSCC,

pp. 126-127, 2023.

[2] A. Guo et al., “A 28nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-

Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for

Floating-Point CNNs,” ISSCC, pp. 128-129, 2023.

[3] F. Tu et al., “A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconﬁgurable

Digital CIM Processor with Uniﬁed FP/INT Pipeline and Bitwise In-Memory Booth

Multiplication for Cloud Deep Learning Acceleration,” ISSCC, pp. 254-255, 2022.

[4] H. Mori et al., “A 4nm 6163-TOPS/W/b 4790-TOPS/mm

/b SRAM Based Digital-

Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC

and Weight Update,” ISSCC, pp. 132-133, 2023.

[5] H. Fujiwara et al., “A 5-nm 254-TOPS/W 221-TOPS/mm

Fully-Digital Computing-in-

Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and

Simultaneous MAC and Write Operations,” ISSCC, pp. 186-187, 2022.

[6] Y. He et al., “A 28nm 38-to-102-TOPS/W 8b Multiply-Less Approximate Digital SRAM

Compute-In-Memory Macro for Neural-Network Inference,” ISSCC, pp. 130-131, 2023.

剩余29页未读，继续阅读

存内计算开发者社区

粉丝: 1w+
资源: 10

内存计算技术：高能效AI应用的创新宏单元

V34 Compute-In-Memory.pdf

openstack-nova-compute-18.0.2-1.el7.noarch.rpm

yolov5s nnie.zip

基于uni-app+uview-ui开发的校园云打印系统微信小程序项目源码+文档说明

使用Java写的一个简易的贪吃蛇小游戏.zip

计算机网络概述.docx

数学建模学习资料 姜启源数学模型课件 M06 稳定性模型 共46页.pptx

【IEA-2024研报】到2030年满足中国电力系统灵活性需求（英）.pdf

游戏账号交易小程序 微信小程序+SSM毕业设计 源码+数据库+论文+启动教程.zip

结合 Swin Transformer 的小物体检测算法用于茶芽检测.zip

最新资源

数学建模学习资料姜启源数学模型课件 M06 稳定性模型共46页.pptx

游戏账号交易小程序微信小程序+SSM毕业设计源码+数据库+论文+启动教程.zip