基于LiDAR的三维物体检测的在线视频框架

4 浏览量更新于2023-10-23 收藏 12.92MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

LiDAR-based Online 3D Video Object Detection with Graph-based MessagePassing and Spatiotemporal Transformer AttentionJunbo Yin1,2, Jianbing Shen1,4∗, Chenye Guan2,3, Dingfu Zhou2,3, Ruigang Yang2,3,51Beijing Lab of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, China2 Baidu Research3 National Engineering Laboratory of Deep Learning Technology and Application, China4 Inception Institute of Artiﬁcial Intelligence, UAE5 University of Kentucky, Kentucky, USA{yinjunbocn, shenjianbingcg}@gmail.comhttps://github.com/yinjunbo/3DVIDAbstractExisting LiDAR-based 3D object detectors usually focuson the single-frame detection, while ignoring the spatiotem-poral information in consecutive point cloud frames. In thispaper, we propose an end-to-end online 3D video object de-tector that operates on point cloud sequences. The proposedmodel comprises a spatial feature encoding component anda spatiotemporal feature aggregation component.In theformer component, a novel Pillar Message Passing Net-work (PMPNet) is proposed to encode each discrete pointcloud frame. It adaptively collects information for a pil-lar node from its neighbors by iterative message passing,which effectively enlarges the receptive ﬁeld of the pillarfeature. In the latter component, we propose an AttentiveSpatiotemporal Transformer GRU (AST-GRU) to aggregatethe spatiotemporal information, which enhances the con-ventional ConvGRU with an attentive memory gating mech-anism. AST-GRU contains a Spatial Transformer Attention(STA) module and a Temporal Transformer Attention (TTA)module, which can emphasize the foreground objects andalign the dynamic objects, respectively. Experimental re-sults demonstrate that the proposed 3D video object detec-tor achieves state-of-the-art performance on the large-scalenuScenes benchmark.1. IntroductionLiDAR-based 3D object detection plays a critical rolein a wide range of applications, such as autonomous driv-ing, robot navigation and virtual/augmented reality [11, 46].The majority of current 3D object detection approaches [42,58, 6, 62, 24] follow the single-frame detection paradigm,while few of them perform detection in the point cloudvideo. A point cloud video is deﬁned as a temporal se-quence of point cloud frames. For instance, in the nuScenes∗Corresponding author: Jianbing Shen.Figure 1: Occlusion situation in autonomous driving scenar-ios. Typical single-frame 3D object detector, e.g. [24], often leadsto false-negative (FN) results (top row). In contrast, our online 3Dvideo object detector can handle this (bottom row). The grey andred boxes denote the predictions and ground-truths, respectively.dataset [4], 20 point cloud frames can be captured per sec-ond with a modern 32-beam LiDAR sensor. Detection insingle frame may suffer from several limitations due to thesparse nature of point cloud. In particular, occlusions, long-distance and non-uniform sampling inevitably occur on acertain frame, where a single-frame object detector is in-capable of handling these situations, leading to a deterio-rated performance, as shown in Fig 1. However, a pointcloud video contains rich spatiotemporal information of theforeground objects, which can be explored to improve thedetection performance. The major concern of construct-ing a 3D video object detector is how to model the spa-tial and temporal feature representation for the consecutivepoint cloud frames. In this work, we propose to integratea graph-based spatial feature encoding component with anattention-aware spatiotemporal feature aggregation compo-nent, to capture the video coherence in consecutive point11495cloud frames, which yields an end-to-end online solutionfor the LiDAR-based 3D video object detection.Popular single-frame 3D object detectors tend to ﬁrstdiscretize the point cloud into voxel or pillar girds [62, 56,24], and then extract the point cloud features using stacksof convolutional neural networks (CNNs). Such approachesincorporate the success of existing 2D or 3D CNNs and usu-ally gain better computational efﬁciency compared with thepoint-based methods [42, 37]. Therefore, in our spatial fea-ture encoding component, we also follow this paradigm toextract features for each input frame. However, a poten-tial problem with these approaches lies in that they only fo-cus on a locally aggregated feature, i.e., employing a Point-Net [39] to extract features for separate voxels or pillars asin [62] and [24]. To further enlarge the receptive ﬁelds, theyhave to apply the stride or pooling operations repeatedly,which will cause the loss of the spatial information. To al-leviate this issue, we propose a novel graph-based network,named Pillar Message Passing Network (PMPNet), whichtreats a non-empty pillar as a graph node and adaptively en-larges the receptive ﬁeld for a node by aggregating mes-sages from its neighbors. PMPNet can mine the rich geo-metric relations among different pillar grids in a discretizedpoint cloud frame by iteratively reasoning on a k-NN graph.This effectively encourages information exchanges amongdifferent spatial regions within a frame.After obtaining the spatial features of each input frame,we assemble these features in our spatiotemporal featureaggregation component. Since ConvGRU [1] has shownpromising performance in the 2D video understanding ﬁeld,we suggest an Attentive Spatiotemporal Transformer GRU(AST-GRU) to extend ConvGRU to the 3D ﬁeld throughcapturing dependencies of consecutive point cloud frameswith an attentive memory gating mechanism. Speciﬁcally,there exist two potential limitations when considering theLiDAR-based 3D video object detection in autonomousdriving scenarios. First, in the bird’s eye view, most fore-ground objects (e.g., cars and pedestrians) occupy small re-gions, and the background noise is inevitably accumulatedas computing the new memory in a recurrent unit. Thus, wepropose to exploit the Spatial Transformer Attention (STA)module, an intra-attention derived from [48, 53], to sup-press the background noise and emphasize the foregroundobjects by attending each pixel with the context informa-tion. Second, when updating the memory in the recurrentunit, the spatial features of the two inputs (i.e., the oldmemory and the new input) are not well aligned. In par-ticular, though we can accurately align the static objectsacross frames using the ego-pose information, the dynamicobjects with large motion are not aligned, which will im-pair the quality of the new memory. To address this, wepropose a Temporal Transformer Attention (TTA) modulethat adaptively captures the object motions in consecutiveframes with a temporal inter-attention mechanism.Thiswill better utilize the modiﬁed deformable convolutionallayers [65, 64]. Our AST-GRU can better handle the spa-tiotemporal features and produce a more reliable new mem-ory, compared with the vanilla ConvGRU. To summarize,we propose a new LiDAR-based online 3D video object de-tector that leverages the previous long-term information toimprove the detection performance. In our model, a novelPMPNet is introduced to adaptively enlarge the receptiveﬁeld of the pillar nodes in a discretized point clod frame byiterative graph-based message passing. The output sequen-tial features are then aggregated in the proposed AST-GRUto mine the rich coherence in the point cloud video by usingan attentive memory gating mechanism. Extensive evalua-tions demonstrate that our 3D video object detector achievesbetter performance against the single-frame detectors on thelarge-scale nuScenes benchmark.1149602. 相关工作0基于LiDAR的三维物体检测。现有的三维物体检测方法可以大致分为三类，即基于LiDAR的方法[42，58，62，24，61，56]，基于图像的方法[22，54，26，34，25]和基于多传感器融合的方法[5，29，30，21，38]。在这里，我们关注基于LiDAR的方法，因为它们对不同的光照和天气条件不太敏感。其中，一类[62，57，24]通常将点云离散化为规则的网格（例如体素或柱），然后利用2D或3DCNN进行特征提取。另一类[42，58，6]直接从原始点云中学习3D表示，使用像PointNet++[39]这样的逐点特征提取器。在具有大规模点云的场景中直接应用基于点的检测器通常是不切实际的，因为它们倾向于对每个单独的点进行特征提取。例如，nuScenes数据集中的一个关键帧包含300,000个点云，这些点云由0.5秒内的10个非关键帧LiDAR扫描密集化而来。在这样的规模上操作点云将导致非平凡的计算成本和内存需求。相比之下，基于体素的方法可以解决这种困难，因为它们对点的数量不太敏感。Zhou等人[62]首次将端到端CNN应用于基于体素的三维物体检测。他们提出使用体素特征编码（VFE）层描述每个体素，并利用级联的3D和2DCNN提取深层特征。然后，使用区域建议网络（RPN）来获得最终的检测结果。之后，Lang等人[24]通过将点云投影到鸟瞰图并使用柱特征网络（PFN）对每个离散化的网格（称为柱）进行编码，进一步扩展了[62]。VFE层和PFN只在生成网格级表示时考虑单独的体素或柱，忽略了信息交换。𝑰𝑻−𝟐 𝑰𝑻−𝟏 𝑰𝑻 𝒀𝑻−𝟐 𝒀𝑻−𝟏 𝒀𝑻 𝑿𝑻−𝟐 𝑿𝑻−𝟏 𝑿𝑻 𝑯𝑻−𝟐 𝑯𝑻−𝟏 𝑯𝑻 19, 60, 36, 40] uses the gating mechanism to enable theinformation to propagate across the graph. For instance,Li et al. [28] leverage the recurrent neural networks to de-scribe the state of each graph node. Then, Gilmer et al. [12]generalizes a framework to formulate the graph reason-ing as a parameterized message passing network. Anothergroup [3, 15, 9, 17, 27] integrates convolutional networksto the graph domain, named as Graph Convolutional NeuralNetworks (GCNNs), which update node features via stacksof graph convolutional layers. GNNs have achieved promis-ing results in many areas [9, 10, 51, 2, 52] due to the greatexpressive power of graphs. Our PMPNet belongs to theﬁrst group by capturing the pillar features with a gated mes-sage passing strategy, which is used to construct the spatialrepresentation for each point cloud frame.3. Model ArchitectureIn this section, we elaborate on our online 3D video ob-ject detection framework. As shown in Fig. 2, it consists ofa spatial feature encoding component and a spatiotemporalfeature aggregation component. Given the input sequences{It}Tt=1 with T frames, we ﬁrst convert the point cloud co-ordinates from the previous frames {It}T −1t=1 to the currentframe IT using the GPS data, so as to eliminate the inﬂu-ence of the ego-motion and align the static objects acrossframes. Then, in the spatial feature encoding component,we extract features for each frame with the Pillar MessagePassing Network (PMPNet) (§3.1) and a 2D backbone, pro-ducing sequential features {Xt}Tt=1. After that, these fea-tures are fed into the Attentive Spatiotemporal TransformerGated Recurrent Unit (AST-GRU) (§3.2) in the spatiotem-poral feature aggregation component, to generate the newmemory features {Ht}Tt=1. Finally, a RPN head is appliedon {Ht}Tt=1 to give the ﬁnal detection results {Yt}Tt=1.Some network architecture details are provided in §3.3.3.1. Pillar Message Passing NetworkPrevious point cloud encoding layers (e.g., the VFE lay-ers in [62] and the PFN in [24]) for voxel-based 3D objectdetection typically encode each voxel or pillar separately,which limits the expressive power of the grid-level repre-sentation due to the small receptive ﬁeld of each local gridregion. Our PMPNet instead seeks to explore the rich spa-tial relations among different gird regions by treating thenon-empty pillar grids as graph nodes. Such design effec-tively reserves the non-Euclidean geometric characteristicsof the original point clouds and enhance the output pillarfeatures with a non-locality property.Given an input point cloud frame It, we ﬁrst uniformlydiscretize it into a set of pillars P, with each pillar uniquelyassociated with a spatial coordinate in the x-y plane as114970检测头0柱消息传递网络0AST-GRU0AST-GRU0AST-GRU02D CNN0检测头0检测头0点云视频0空间特征编码0时空特征聚合0检测结果0图2：我们的在线三维视频物体检测框架包括空间特征编码组件和时空特征聚合组件。在前者组件中，提出了一种新颖的PMPNet（§3.1）来提取每个点云帧的空间特征。然后，将连续帧的特征发送到后者组件中的AST-GRU（§3.2），利用注意力内存门控机制聚合时空信息。0较大的空间区域。相比之下，我们的PMPNet通过基于图的消息传递从全局视角编码柱特征，从而促进了具有非局部属性的表示。此外，所有这些单帧三维物体检测器只能逐帧处理点云数据，缺乏对时间信息的探索。虽然[33]在点云序列上应用了时间3DConvNet，但在时间域中对特征进行下采样时遇到了特征坍塌问题。此外，它无法处理具有多帧标签的长期序列。相反，我们的AST-GRU通过注意力内存门控机制捕捉长期时间信息，可以充分挖掘点云视频中的时空一致性。0图神经网络。图神经网络（GNNs）最早由Gori等人[13]引入，用于建模图结构数据的内在关系。然后，Scarselli等人[41]将其扩展到不同类型的图。之后，GNNs在不同的消息传递策略方面进行了探索。第一组[28，𝑚1,4� ℎ1� ℎ2� ℎ3� ℎ4� ℎ5� ℎ6� 𝑚3,1� 𝑚1,3� 𝑚4,1� 𝑚2,1� 𝑚1,2� 𝑚2,5� 𝑚5,2� 𝑚6,2� 𝑚2,6� ℎ1�+1 ℎ2�+1 ℎ3�+1 ℎ4�+1 ℎ5�+1 ℎ6�+1 𝑚3,1�+1 𝑚1,3�+1 𝑚1,4�+1 𝑚4,1�+1 𝑚2,1�+1 𝑚1,2�+1 𝑚2,5�+1 𝑚5,2�+1 𝑚6,2�+1 𝑚2,6�+1 Figure 3: Illustration of one iteration step for message propa-gation, where hi is the state of node vi. In step s, the neighborsfor h1 are {h2, h3, h4} (within the gray dash line), presenting thepillars in the top car. After aggregating messages from the neigh-bors, the receptive ﬁeld of h1 is enlarged in step s + 1, indicatingthe relations with nodes from the bottom car are modeled.in [24]. Then, PMPNet maps the resultant pillars to a di-rected graph G = (V, E), where node vi ∈ V represents anon-empty pillar Pi ∈ P and edge ei,j ∈ E indicates themessage passed from node vi to vj. For reducing the com-putational overhead, we deﬁne G as a k-nearest neighbor(k-NN) graph, which is built from the geometric space bycomparing the centroid distance among different pillars.To explicitly mine the rich relations among different pil-lar nodes, PMPNet performs iterative message passing onG and updates the nodes state at each iteration step. Con-cretely, given a node vi, we ﬁrst utilize a pillar feature net-work (PFN) [24] to describe its initial state h0i at iterationstep s = 0:h0i = FPFN(Pi) ∈ RL,(1)where h0i is a L-dim vector and Pi ∈ RN×D presents apillar containing N LiDAR points, with each point param-eterized by D dimension representation (e.g., the XYZ co-ordinates and the received reﬂectance). The PFN is real-ized by applying fully connected layers on each point withinthe pillar, then summarizing features of all points through achannel-wise maximum operation. The initial node state h0iis a locally aggregated feature, only including points infor-mation within a certain pillar grid.Next, we elaborate on the message passing process. Oneiteration step of message propagation is illustrated in Fig. 3.At step s, a node vi aggregates information from all theneighbor nodes vj ∈ Ωvi in the k-NN graph. We deﬁnethe incoming edge feature from node vj as esj,i, indicatingthe relation between node vi and vj. Inspired by [55], theincoming edge feature esj,i is given by:esj,i = hsj − hsi ∈ RL,(2)which is an asymmetric function encoding the local neigh-bor information. Accordingly, we have the message passedfrom vj to vi, which is denoted as:ms+1j,i= φθ([hsi, esj,i]) ∈ RL′,(3)where φθ is parameterized by a fully connected layer, whichtakes as input the concatenation of hsi and esj,i, and yields aL′-dim feature.After computing all the pair-wise relations between viand the neighbors vj ∈ Ωvi of , we summarize the receivedk messages with a maximum operation:ms+1i= maxj∈Ωi (ms+1j,i ) ∈ RL′,(4)Then, we update the node state hsi with hs+1ifor node vi.The update process should consider both the newly col-lected message ms+1iand the previous state hsi. Recurrentneural network and its variants [16, 47] can adaptively cap-ture dependencies in different time steps. Hence, we utilizeGated Recurrent Unit (GRU) [7] as the update function forits better convergence characteristic. The update process isthen formulated as follows:hs+1i= GRU(hsi, ms+1i) ∈ RL,(5)In this way, the new node state hs+1icontains the informa-tion from all the neighbor nodes of vi. Moreover, a neigh-bor node vj also collects information from its own neigh-bors Ωvj. Consequently, after

下载后可阅读完整内容，剩余1页未读，立即下载