328
• 2023 IEEE International Solid-State Circuits Conference
ISSCC 2023 / SESSION 22 / HETEROGENOUS ML ACCELERATORS / 22.2
22.2 A 28nm 2D/3D Unified Sparse Convolution Accelerator with
Block-Wise Neighbor Searcher for Large-Scaled Voxel-Based
Point Cloud Network
Wenyu Sun
1
, Xiaoyu Feng
1
, Chen Tang
1
, Shupei Fan
1
, Yixiong Yang
1
,
Jinshan Yue
2
, Huazhong Yang
1
, Yongpan Liu
1
1
Tsinghua University, Beijing, China
2
Chinese Academy of Sciences, Beijing, China
3D processing plays an important role in many emerging applications such as
autonomous driving, visual navigation and virtual reality. Recent research shows that
adopting 3D voxel-based sparse convolution (SCONV) as a backbone can achieve better
performance than a point-based network in large-scale outdoor scenarios [1]. Moreover,
2D SCONVs are still necessary for Bird’s-Eye-View (BEV) neck layers or fusion with image
processing. Hardware acceleration is needed for multiple key operations, including 3D
submanifold SCONV (S-SCONV), 3D non-submanifold SCONV (N-SCONV) and 2D
SCONV. Recently, several processors have been developed for point-based networks [2]
or SCONV [3-5]. However, for large-scale voxel-based sparse networks, three key
challenges have not been fully addressed thereby limiting practical application, as shown
in Fig. 22.2.1. 1) Frequent and random external memory accesses (EMA) arise from
gather and scatter operations for sparse-neighbor search (SNS), causing high search
overhead and irregular EMA with low efficiency. 2) Low throughput between SNS and
CNN, reducing the computing performance (e.g. >50% SNS time consumed by the 3D
SECOND [6] network in a GPU). 3) Unbalanced workload and limited data reuse for
multiple cores, which results in low core utilization and duplicated data loads without
specific optimizations.
This work presents a 2D/3D unified sparse convolution accelerator for large-scale voxel-
based networks, as shown in Fig. 22.2.2. The overall architecture mainly includes three
parts: a block-wise memory controller for exchanging coordinate (COO) and feature map
(FM) data, 4 SCONV cores with parallel neighbor search, and an asynchronous and
synchronous hybrid scheduler with dynamic memory router for multiple cores. The
accelerator has three key features: 1) Block-wise sparse data management supporting
out-of-order memory allocation combined with the specifically designed on-chip direct
memory access (DMA) controller, eliminating the overhead of off-chip gather and scatter,
and enabling continuous block-level data transfer. 2) A high-throughput and
reconfigurable SCONV core providing unified support for 2D/3D S-SCONV/N-SCONV
with parallel SNS and SCONV computing, which reduces 86% of the processing latency
on average. 3) An asynchronous and synchronous hybrid scheduler for multiple SCONV
cores with a dynamic on-chip memory router to maximize data reuse and core utilization,
reducing 52% of the EMA overhead.
The workflow of block-wise memory management is shown in Fig. 22.2.3. Points are
voxelized and then block-partitioned according to the COO. To facilitate CNN window
computing, boundary voxels between neighbor blocks are duplicated. After block
partition, the COO and FM are stored separately in different banks. Multiple spaces in
memory are allocated according to block size. Only non-zero blocks will be assigned and
COO data are assigned out-of-order intra and inter blocks. FM data are also continuously
saved in external memory according to sampling order without COO sorting. The
proposed pre-processing method is embedded into voxel steam loading with little latency
and area overhead. Compared with a traditional gather and scatter method, the out-of-
order block-wise memory management can reduce of the 97% pre-processing overhead,
as evaluated using the KITTI dataset. After external memory is allocated, the on-chip
DMA begins to burst-load data efficiently. The COO fetcher first reads in COO blocks and
generates a bitmap for further SNS. Then, the on-chip IFM fetcher/OFM saver stream
in/out FM data in order. The block size defines the on-chip memory size and R/W
overhead for boundary voxels. A 10×10×6 block size is adopted in this work to have
<200KB on-chip memory size with limited boundary R/W overhead.
Figure 22.2.4 shows the reconfigurable SCONV core supporting multiple functions
(2D/3D S-SCONV/N-SCONV), including a multi-functional SCONV engine (SCONVE) and
a neighbor search engine (NSE). For the SCONVE, both S-SCONV and N-SCONV are
supported with dataflow reconfiguration by the dual-stationary PE controller. S-SCONV
follows an output stationary (OS) dataflow with NSE to skip zero computations for both
input and output, while N-SCONV adopts an input stationary (IS) dataflow with the NSE
power gated for better energy efficiency. For NSE, 2D/3D bitmaps are processed
uniformly as a 1D global bitmap to share neighbor-searching circuits. The 1D global
bitmap is first encoded into a local neighbor bitmap according to the CNN window size
through a N-K multiplexer. Then, the neighbor encoder efficiently locates the neighbor
bitmap’s non-zero positions in one cycle by the priority coder, as well as generates the
related FM/WT index to determine the sparse computing order. Computation mapping
FIFOs for FM and WT connect the NSE and SCONVE to realize parallel acceleration. The
NSE continuously writes FM/WT index pairs every cycle with no delay, otherwise the PE
array would be partially idle, resulting in low throughput for the SCONV core. Thus, a
priority-code-based SNS circuit is introduced. Differing from traditional ergodic search,
the priority coder transforms the searching problem into finding the index of right-most
non-zero bit. After that, the local bitmap undergoes a right-shift. The process repeats
priority coding and right shifting in turn until all non-zero locations are decoded. By
adopting priority-code-based NSE, the average latency of processing one block in the
SECOND network can be reduced by 76%. Moreover, the parallel optimization improves
the throughput and further reduces latency by 44%. The NSE is area and power efficient,
comprising 22% of the area and 18% of the power of the SCONV core.
For block-wise processing, the workload can be unbalanced between different sparse
blocks, which calls for an efficient scheduling strategy to improve the utilization of
computing and memory resources for the multi-core system. This work solves this
problem by a layer-specific asynchronous and synchronous hybrid scheduler, as shown
in Fig. 22.2.5. For layers having limited numbers of input/output channels (e.g. S-
SCONV1 for SECOND), since all WT/FM data within one block can be stored on chip,
weights are broadcast to memory, and different cores will be independently assigned to
compute different blocks by the asynchronous scheduler. When data is ready for one
core, the controller invokes core and waits for it to finish. Once a core is finished and
memory is freed, a request is generated to process the next block. For layers with
excessive input/output channels, on-chip memory can only hold part of the data. By
simply adopting the asynchronous method, there exists considerable duplicated data
accesses from/to external memory, resulting in high EMA latency. To improve on-chip
data reuse, synchronous scheduling is introduced. One block is divided in the channel
dimension and then mapped to different cores (e.g. S-SCONV7). This assignment
ensures a balanced workload for two cores computing different output channel data in
same block. Moreover, once the first computation of partial results is finished, the WT/FM
can be reused by rerouting the memory at the next time to compute the full results,
thereby minimizing EMA overhead. For S-SCONV4, which is medium size, asynchronous
and synchronous scheduling are combined for inter and intra blocks, respectively. Fig.
22.2.5 gives the cycle utilization of three typical layers in the SECOND network, showing
that synchronous scheduling effectively improves the core utilization especially for layers
with excessive channel counts. In total, the hybrid scheduler reduces 52% of the EMA
of COO/FM/WT data for the full SECOND network, and the operation of DMA and SCONV
computing are pipelined to hide the IO time when processing multiple sparse blocks for
most layers.
This SCONV chip is fabricated in 28nm CMOS technology and the measurement results
are shown in Fig. 22.2.6. It can work at 40-450MHz with 0.48V-0.92V supply, while the
peak energy efficiency is 4.68TOPS/W at 60MHz/0.48V without considering sparsity. For
point cloud, the SECOND network composed of a 3D backbone and 2D BEV neck is
evaluated on two large-scale out-door datasets (KITTI and nuScenes) and the chip
achieves 3.3-16.9fps (including IO time) running at 60-400MHz. Compared with a state-
of-art 3D accelerator for voxel-based SCONV [4], this work can support a unified 2D/3D
network with 1.9-5mJ/frame energy on the outdoor large-scale KITTI dataset. For 2D
images, it achieves 2.8-11.99TOPS/W energy efficiency for ResNet-18 on ImageNet with
24%-45% input/output sparsity. Fig. 22.2.7 shows the die photo and chip summary,
which occupies 3.24mm
2
.
Acknowledgement:
This work was supported in part by National Key R&D Program 2019YFB2204204; NSF
of Beijing Municipality under Grant L211005; Beijing National Research Center for
Information Science and Technology; and Beijing Innovation Center for Future Chips.
Corresponding author: Yongpan Liu.
References:
[1] Waymo 3D detection, open challenge: https://waymo.com/open/challenges/2020/3d-
detection/
[2] Shi S et al., “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,”
IEEE CVPR, pp. 10526-10535, 2020.
[3] S. Kim et al., “PNNPU: A 11.9 TOPS/W High-speed 3D Point Cloud-based Neural
Network Processor with Block-based Point Processing for Regular DRAM Access,” IEEE
Symp. VLSI Circuits, 2021.
[4] Q. Cao et al., “A Sparse Convolution Neural Network Accelerator for 3D/4D Point-
Cloud Image Recognition on Low Power Mobile Device with Hopping-Index Rule Book
for Efficient Coordinate Management,”IEEE Symp. VLSI Circuits, pp. 106-107, 2022.
[5] C. -H. Lin at al., “A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning
Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC,” ISSCC, pp. 134-
135, 2020.
[6] J. -F. Zhang et al., SNAP: An Efficient Sparse Neural Acceleration Processor for
Unstructured Sparse Deep Neural Network Inference,” IEEE JSSC, vol. 56, no. 2, pp.
636-647, 2021.
978-1-6654-9016-0/23/$31.00 ©2023 IEEE