异构机器学习加速器：高效能AI-IoT系统级芯片

需积分: 5 141 浏览量更新于2024-06-26 收藏 43.92MB PDF 举报

"Session_22_Heterogeneous_ML_Accelerator.pdf" 本次会议的主题是“异构机器学习加速器”，重点关注在机器学习领域中如何通过优化硬件设计来提升性能和能效。会议由机器学习子委员会组织，讨论了两篇重要的论文。首先，论文22.1（A12.4 TOPS/W @ 136 GOPS AI-IoT System-on-Chip with 16 RISC-V, 2-to-8b Precision-Scalable DNN Acceleration and 30% Boost Adaptive Body Biasing）由博洛尼亚大学、苏黎世联邦理工学院和Dolphin Design共同呈现。他们设计了一款异构系统级芯片（SoC），该芯片包含了16个RISC-V核心和可配置的深度神经网络（DNN）引擎，支持AI物联网应用。这款SoC采用22纳米工艺制造，其RISC-V DSP集群支持2到8位的指令集扩展，DNN引擎则能够实现混合精度的深度学习加速。在8位精度下，芯片实现了1.64 TOPS/W的能效，在2位精度下能效高达12.4 TOPS/W。此外，通过适应性体偏置技术，系统性能提高了30%，这在能源效率和性能之间找到了一个良好的平衡点，非常适合于资源受限的IoT设备。其次，论文22.2（A 28nm 2D/3D Unified Sparse Convolution Accelerator with Block-Wise Neighbor Searcher for Large-Scale Voxel-Based Point Cloud Network）来自清华大学。清华大学的研究团队提出了一种2D/3D统一稀疏卷积加速器，专为基于体素的点云处理设计。这款芯片采用28纳米工艺制造，能有效地处理大规模的体素点云网络。点云数据处理通常涉及到大量的空间操作，而该加速器的独特之处在于其块级邻域搜索器，能够高效地进行邻居搜索，从而加速3D卷积计算。这对于自动驾驶、机器人导航等需要大量处理3D环境信息的应用场景具有重大意义。这两篇论文都展示了如何通过创新的硬件设计来优化机器学习任务的性能和能效，尤其是在资源有限的环境中。RISC-V架构因其开源和可扩展性在嵌入式系统中的应用越来越广泛，而稀疏卷积加速器则针对深度学习模型的特性进行了定制，降低了计算复杂度，提升了处理效率。这些研究成果为未来AI和物联网领域的硬件设计提供了新的思路和实践案例。

328

• 2023 IEEE International Solid-State Circuits Conference

ISSCC 2023 / SESSION 22 / HETEROGENOUS ML ACCELERATORS / 22.2

22.2 A 28nm 2D/3D Uniﬁed Sparse Convolution Accelerator with

Block-Wise Neighbor Searcher for Large-Scaled Voxel-Based

Point Cloud Network

Wenyu Sun

, Xiaoyu Feng

, Chen Tang

, Shupei Fan

, Yixiong Yang

Jinshan Yue

, Huazhong Yang

, Yongpan Liu

Tsinghua University, Beijing, China

Chinese Academy of Sciences, Beijing, China

3D processing plays an important role in many emerging applications such as

autonomous driving, visual navigation and virtual reality. Recent research shows that

adopting 3D voxel-based sparse convolution (SCONV) as a backbone can achieve better

performance than a point-based network in large-scale outdoor scenarios [1]. Moreover,

2D SCONVs are still necessary for Bird’s-Eye-View (BEV) neck layers or fusion with image

processing. Hardware acceleration is needed for multiple key operations, including 3D

submanifold SCONV (S-SCONV), 3D non-submanifold SCONV (N-SCONV) and 2D

SCONV. Recently, several processors have been developed for point-based networks [2]

or SCONV [3-5]. However, for large-scale voxel-based sparse networks, three key

challenges have not been fully addressed thereby limiting practical application, as shown

in Fig. 22.2.1. 1) Frequent and random external memory accesses (EMA) arise from

gather and scatter operations for sparse-neighbor search (SNS), causing high search

overhead and irregular EMA with low efﬁciency. 2) Low throughput between SNS and

CNN, reducing the computing performance (e.g. >50% SNS time consumed by the 3D

SECOND [6] network in a GPU). 3) Unbalanced workload and limited data reuse for

multiple cores, which results in low core utilization and duplicated data loads without

speciﬁc optimizations.

This work presents a 2D/3D uniﬁed sparse convolution accelerator for large-scale voxel-

based networks, as shown in Fig. 22.2.2. The overall architecture mainly includes three

parts: a block-wise memory controller for exchanging coordinate (COO) and feature map

(FM) data, 4 SCONV cores with parallel neighbor search, and an asynchronous and

synchronous hybrid scheduler with dynamic memory router for multiple cores. The

accelerator has three key features: 1) Block-wise sparse data management supporting

out-of-order memory allocation combined with the speciﬁcally designed on-chip direct

memory access (DMA) controller, eliminating the overhead of off-chip gather and scatter,

and enabling continuous block-level data transfer. 2) A high-throughput and

reconﬁgurable SCONV core providing uniﬁed support for 2D/3D S-SCONV/N-SCONV

with parallel SNS and SCONV computing, which reduces 86% of the processing latency

on average. 3) An asynchronous and synchronous hybrid scheduler for multiple SCONV

cores with a dynamic on-chip memory router to maximize data reuse and core utilization,

reducing 52% of the EMA overhead.

The workﬂow of block-wise memory management is shown in Fig. 22.2.3. Points are

voxelized and then block-partitioned according to the COO. To facilitate CNN window

computing, boundary voxels between neighbor blocks are duplicated. After block

partition, the COO and FM are stored separately in different banks. Multiple spaces in

memory are allocated according to block size. Only non-zero blocks will be assigned and

COO data are assigned out-of-order intra and inter blocks. FM data are also continuously

saved in external memory according to sampling order without COO sorting. The

proposed pre-processing method is embedded into voxel steam loading with little latency

and area overhead. Compared with a traditional gather and scatter method, the out-of-

order block-wise memory management can reduce of the 97% pre-processing overhead,

as evaluated using the KITTI dataset. After external memory is allocated, the on-chip

DMA begins to burst-load data efﬁciently. The COO fetcher ﬁrst reads in COO blocks and

generates a bitmap for further SNS. Then, the on-chip IFM fetcher/OFM saver stream

in/out FM data in order. The block size deﬁnes the on-chip memory size and R/W

overhead for boundary voxels. A 10×10×6 block size is adopted in this work to have

<200KB on-chip memory size with limited boundary R/W overhead.

Figure 22.2.4 shows the reconﬁgurable SCONV core supporting multiple functions

(2D/3D S-SCONV/N-SCONV), including a multi-functional SCONV engine (SCONVE) and

a neighbor search engine (NSE). For the SCONVE, both S-SCONV and N-SCONV are

supported with dataﬂow reconﬁguration by the dual-stationary PE controller. S-SCONV

follows an output stationary (OS) dataﬂow with NSE to skip zero computations for both

input and output, while N-SCONV adopts an input stationary (IS) dataﬂow with the NSE

power gated for better energy efﬁciency. For NSE, 2D/3D bitmaps are processed

uniformly as a 1D global bitmap to share neighbor-searching circuits. The 1D global

bitmap is ﬁrst encoded into a local neighbor bitmap according to the CNN window size

through a N-K multiplexer. Then, the neighbor encoder efﬁciently locates the neighbor

bitmap’s non-zero positions in one cycle by the priority coder, as well as generates the

related FM/WT index to determine the sparse computing order. Computation mapping

FIFOs for FM and WT connect the NSE and SCONVE to realize parallel acceleration. The

NSE continuously writes FM/WT index pairs every cycle with no delay, otherwise the PE

array would be partially idle, resulting in low throughput for the SCONV core. Thus, a

priority-code-based SNS circuit is introduced. Differing from traditional ergodic search,

the priority coder transforms the searching problem into ﬁnding the index of right-most

non-zero bit. After that, the local bitmap undergoes a right-shift. The process repeats

priority coding and right shifting in turn until all non-zero locations are decoded. By

adopting priority-code-based NSE, the average latency of processing one block in the

SECOND network can be reduced by 76%. Moreover, the parallel optimization improves

the throughput and further reduces latency by 44%. The NSE is area and power efﬁcient,

comprising 22% of the area and 18% of the power of the SCONV core.

For block-wise processing, the workload can be unbalanced between different sparse

blocks, which calls for an efﬁcient scheduling strategy to improve the utilization of

computing and memory resources for the multi-core system. This work solves this

problem by a layer-speciﬁc asynchronous and synchronous hybrid scheduler, as shown

in Fig. 22.2.5. For layers having limited numbers of input/output channels (e.g. S-

SCONV1 for SECOND), since all WT/FM data within one block can be stored on chip,

weights are broadcast to memory, and different cores will be independently assigned to

compute different blocks by the asynchronous scheduler. When data is ready for one

core, the controller invokes core and waits for it to ﬁnish. Once a core is ﬁnished and

memory is freed, a request is generated to process the next block. For layers with

excessive input/output channels, on-chip memory can only hold part of the data. By

simply adopting the asynchronous method, there exists considerable duplicated data

accesses from/to external memory, resulting in high EMA latency. To improve on-chip

data reuse, synchronous scheduling is introduced. One block is divided in the channel

dimension and then mapped to different cores (e.g. S-SCONV7). This assignment

ensures a balanced workload for two cores computing different output channel data in

same block. Moreover, once the ﬁrst computation of partial results is ﬁnished, the WT/FM

can be reused by rerouting the memory at the next time to compute the full results,

thereby minimizing EMA overhead. For S-SCONV4, which is medium size, asynchronous

and synchronous scheduling are combined for inter and intra blocks, respectively. Fig.

22.2.5 gives the cycle utilization of three typical layers in the SECOND network, showing

that synchronous scheduling effectively improves the core utilization especially for layers

with excessive channel counts. In total, the hybrid scheduler reduces 52% of the EMA

of COO/FM/WT data for the full SECOND network, and the operation of DMA and SCONV

computing are pipelined to hide the IO time when processing multiple sparse blocks for

most layers.

This SCONV chip is fabricated in 28nm CMOS technology and the measurement results

are shown in Fig. 22.2.6. It can work at 40-450MHz with 0.48V-0.92V supply, while the

peak energy efﬁciency is 4.68TOPS/W at 60MHz/0.48V without considering sparsity. For

point cloud, the SECOND network composed of a 3D backbone and 2D BEV neck is

evaluated on two large-scale out-door datasets (KITTI and nuScenes) and the chip

achieves 3.3-16.9fps (including IO time) running at 60-400MHz. Compared with a state-

of-art 3D accelerator for voxel-based SCONV [4], this work can support a uniﬁed 2D/3D

network with 1.9-5mJ/frame energy on the outdoor large-scale KITTI dataset. For 2D

images, it achieves 2.8-11.99TOPS/W energy efﬁciency for ResNet-18 on ImageNet with

24%-45% input/output sparsity. Fig. 22.2.7 shows the die photo and chip summary,

which occupies 3.24mm

Acknowledgement:

This work was supported in part by National Key R&D Program 2019YFB2204204; NSF

of Beijing Municipality under Grant L211005; Beijing National Research Center for

Information Science and Technology; and Beijing Innovation Center for Future Chips.

Corresponding author: Yongpan Liu.

References:

[1] Waymo 3D detection, open challenge: https://waymo.com/open/challenges/2020/3d-

detection/

[2] Shi S et al., “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,”

IEEE CVPR, pp. 10526-10535, 2020.

[3] S. Kim et al., “PNNPU: A 11.9 TOPS/W High-speed 3D Point Cloud-based Neural

Network Processor with Block-based Point Processing for Regular DRAM Access,” IEEE

Symp. VLSI Circuits, 2021.

[4] Q. Cao et al., “A Sparse Convolution Neural Network Accelerator for 3D/4D Point-

Cloud Image Recognition on Low Power Mobile Device with Hopping-Index Rule Book

for Efﬁcient Coordinate Management,”IEEE Symp. VLSI Circuits, pp. 106-107, 2022.

[5] C. -H. Lin at al., “A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning

Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC,” ISSCC, pp. 134-

135, 2020.

[6] J. -F. Zhang et al., SNAP: An Efﬁcient Sparse Neural Acceleration Processor for

Unstructured Sparse Deep Neural Network Inference,” IEEE JSSC, vol. 56, no. 2, pp.

636-647, 2021.

剩余28页未读，继续阅读

LittleBrightness

粉丝: 0
资源: 145

异构机器学习加速器：高效能AI-IoT系统级芯片

Broad_Learning_via_Fusion_of_Heterogeneous_Information.pdf

【GNN综述_2020_8】Heterogeneous Network Representation Learning: ...

Transformer_Heterogeneous_Operator_Development.pdf

Heterogeneous_computing_OpenCL.pdf

MK.Heterogeneous.Computing.with.OpenCL

HRTF_Paper2.rar_Windows编程_PDF_

tcp.rar_heterogeneous_visual c_异构网络

DEEC-E.m.zip_DEEC Heterogeneous_DEEC matlab_M-DEEC_WSN_deec wsn

Hungarian-algorithm-By-matlab-master.rar_Hungarian_heterogeneous

ASN.1 Communication between Heterogeneous Systems.pdf

最新资源