LiDAR-based Online 3D Video Object Detection with Graph-based MessagePassing and Spatiotemporal Transformer AttentionJunbo Yin1,2, Jianbing Shen1,4∗, Chenye Guan2,3, Dingfu Zhou2,3, Ruigang Yang2,3,51Beijing Lab of Intelligent Information Technology, School of Computer Science, Beijing Institute of Technology, China2 Baidu Research3 National Engineering Laboratory of Deep Learning Technology and Application, China4 Inception Institute of Artificial Intelligence, UAE5 University of Kentucky, Kentucky, USA{yinjunbocn, shenjianbingcg}@gmail.comhttps://github.com/yinjunbo/3DVIDAbstractExisting LiDAR-based 3D object detectors usually focuson the single-frame detection, while ignoring the spatiotem-poral information in consecutive point cloud frames. In thispaper, we propose an end-to-end online 3D video object de-tector that operates on point cloud sequences. The proposedmodel comprises a spatial feature encoding component anda spatiotemporal feature aggregation component.In theformer component, a novel Pillar Message Passing Net-work (PMPNet) is proposed to encode each discrete pointcloud frame. It adaptively collects information for a pil-lar node from its neighbors by iterative message passing,which effectively enlarges the receptive field of the pillarfeature. In the latter component, we propose an AttentiveSpatiotemporal Transformer GRU (AST-GRU) to aggregatethe spatiotemporal information, which enhances the con-ventional ConvGRU with an attentive memory gating mech-anism. AST-GRU contains a Spatial Transformer Attention(STA) module and a Temporal Transformer Attention (TTA)module, which can emphasize the foreground objects andalign the dynamic objects, respectively. Experimental re-sults demonstrate that the proposed 3D video object detec-tor achieves state-of-the-art performance on the large-scalenuScenes benchmark.1. IntroductionLiDAR-based 3D object detection plays a critical rolein a wide range of applications, such as autonomous driv-ing, robot navigation and virtual/augmented reality [11, 46].The majority of current 3D object detection approaches [42,58, 6, 62, 24] follow the single-frame detection paradigm,while few of them perform detection in the point cloudvideo. A point cloud video is defined as a temporal se-quence of point cloud frames. For instance, in the nuScenes∗Corresponding author: Jianbing Shen.Figure 1: Occlusion situation in autonomous driving scenar-ios. Typical single-frame 3D object detector, e.g. [24], often leadsto false-negative (FN) results (top row). In contrast, our online 3Dvideo object detector can handle this (bottom row). The grey andred boxes denote the predictions and ground-truths, respectively.dataset [4], 20 point cloud frames can be captured per sec-ond with a modern 32-beam LiDAR sensor. Detection insingle frame may suffer from several limitations due to thesparse nature of point cloud. In particular, occlusions, long-distance and non-uniform sampling inevitably occur on acertain frame, where a single-frame object detector is in-capable of handling these situations, leading to a deterio-rated performance, as shown in Fig 1. However, a pointcloud video contains rich spatiotemporal information of theforeground objects, which can be explored to improve thedetection performance. The major concern of construct-ing a 3D video object detector is how to model the spa-tial and temporal feature representation for the consecutivepoint cloud frames. In this work, we propose to integratea graph-based spatial feature encoding component with anattention-aware spatiotemporal feature aggregation compo-nent, to capture the video coherence in consecutive point11495cloud frames, which yields an end-to-end online solutionfor the LiDAR-based 3D video object detection.Popular single-frame 3D object detectors tend to firstdiscretize the point cloud into voxel or pillar girds [62, 56,24], and then extract the point cloud features using stacksof convolutional neural networks (CNNs). Such approachesincorporate the success of existing 2D or 3D CNNs and usu-ally gain better computational efficiency compared with thepoint-based methods [42, 37]. Therefore, in our spatial fea-ture encoding component, we also follow this paradigm toextract features for each input frame. However, a poten-tial problem with these approaches lies in that they only fo-cus on a locally aggregated feature, i.e., employing a Point-Net [39] to extract features for separate voxels or pillars asin [62] and [24]. To further enlarge the receptive fields, theyhave to apply the stride or pooling operations repeatedly,which will cause the loss of the spatial information. To al-leviate this issue, we propose a novel graph-based network,named Pillar Message Passing Network (PMPNet), whichtreats a non-empty pillar as a graph node and adaptively en-larges the receptive field for a node by aggregating mes-sages from its neighbors. PMPNet can mine the rich geo-metric relations among different pillar grids in a discretizedpoint cloud frame by iteratively reasoning on a k-NN graph.This effectively encourages information exchanges amongdifferent spatial regions within a frame.After obtaining the spatial features of each input frame,we assemble these features in our spatiotemporal featureaggregation component. Since ConvGRU [1] has shownpromising performance in the 2D video understanding field,we suggest an Attentive Spatiotemporal Transformer GRU(AST-GRU) to extend ConvGRU to the 3D field throughcapturing dependencies of consecutive point cloud frameswith an attentive memory gating mechanism. Specifically,there exist two potential limitations when considering theLiDAR-based 3D video object detection in autonomousdriving scenarios. First, in the bird’s eye view, most fore-ground objects (e.g., cars and pedestrians) occupy small re-gions, and the background noise is inevitably accumulatedas computing the new memory in a recurrent unit. Thus, wepropose to exploit the Spatial Transformer Attention (STA)module, an intra-attention derived from [48, 53], to sup-press the background noise and emphasize the foregroundobjects by attending each pixel with the context informa-tion. Second, when updating the memory in the recurrentunit, the spatial features of the two inputs (i.e., the oldmemory and the new input) are not well aligned. In par-ticular, though we can accurately align the static objectsacross frames using the ego-pose information, the dynamicobjects with large motion are not aligned, which will im-pair the quality of the new memory. To address this, wepropose a Temporal Transformer Attention (TTA) modulethat adaptively captures the object motions in consecutiveframes with a temporal inter-attention mechanism.Thiswill better utilize the modified deformable convolutionallayers [65, 64]. Our AST-GRU can better handle the spa-tiotemporal features and produce a more reliable new mem-ory, compared with the vanilla ConvGRU. To summarize,we propose a new LiDAR-based online 3D video object de-tector that leverages the previous long-term information toimprove the detection performance. In our model, a novelPMPNet is introduced to adaptively enlarge the receptivefield of the pillar nodes in a discretized point clod frame byiterative graph-based message passing. The output sequen-tial features are then aggregated in the proposed AST-GRUto mine the rich coherence in the point cloud video by usingan attentive memory gating mechanism. For instance,Li et al. [28] leverage the recurrent neural networks to de-scribe the state of each graph node. Then, Gilmer et al. [12]generalizes a framework to formulate the graph reason-ing as a parameterized message passing network. Anothergroup [3, 15, 9, 17, 27] integrates convolutional networksto the graph domain, named as Graph Convolutional NeuralNetworks (GCNNs), which update node features via stacksof graph convolutional layers. GNNs have achieved promis-ing results in many areas [9, 10, 51, 2, 52] due to the greatexpressive power of graphs. Our PMPNet belongs to thefirst group by capturing the pillar features with a gated mes-sage passing strategy, which is used to construct the spatialrepresentation for each point cloud frame.3. Model ArchitectureIn this section, we elaborate on our online 3D video ob-ject detection framework. As shown in Fig. 2, it consists ofa spatial feature encoding component and a spatiotemporalfeature aggregation component. Given the input sequences{It}Tt=1 with T frames, we first convert the point cloud co-ordinates from the previous frames {It}T −1t=1 to the currentframe IT using the GPS data, so as to eliminate the influ-ence of the ego-motion and align the static objects acrossframes. Then, in the spatial feature encoding component,we extract features for each frame with the Pillar MessagePassing Network (PMPNet) (§3.1) and a 2D backbone, pro-ducing sequential features {Xt}Tt=1. After that, these fea-tures are fed into the Attentive Spatiotemporal TransformerGated Recurrent Unit (AST-GRU) (§3.2) in the spatiotem-poral feature aggregation component, to generate the newmemory features {Ht}Tt=1. Finally, a RPN head is appliedon {Ht}Tt=1 to give the final detection results {Yt}Tt=1.Some network architecture details are provided in § Pillar Message Passing NetworkPrevious point cloud encoding layers (e.g., the VFE lay-ers in [62] and the PFN in [24]) for voxel-based 3D objectdetection typically encode each voxel or pillar separately,which limits the expressive power of the grid-level repre-sentation due to the small receptive field of each local gridregion. Our PMPNet instead seeks to explore the rich spa-tial relations among different gird regions by treating thenon-empty pillar grids as graph nodes. Such design effec-tively reserves the non-Euclidean geometric characteristicsof the original point clouds and enhance the output pillarfeatures with a non-locality property.Given an input point cloud frame It, we first uniformlydiscretize it into a set of pillars P, with each pillar uniquelyassociated with a spatial coordinate in the x-y plane as In step s, the neighborsfor h1 are {h2, h3, h4} (within the gray dash line), presenting thepillars in the top car. After aggregating messages from the neigh-bors, the receptive field of h1 is enlarged in step s + 1, indicatingthe relations with nodes from the bottom car are modeled.in [24]. Then, PMPNet maps the resultant pillars to a di-rected graph G = (V, E), where node vi ∈ V represents anon-empty pillar Pi ∈ P and edge ei,j ∈ E indicates themessage passed from node vi to vj. For reducing the com-putational overhead, we define G as a k-nearest neighbor(k-NN) graph, which is built from the geometric space bycomparing the centroid distance among different pillars.To explicitly mine the rich relations among different pil-lar nodes, PMPNet performs iterative message passing onG and updates the nodes state at each iteration step. Con-cretely, given a node vi, we first utilize a pillar feature net-work (PFN) [24] to describe its initial state h0i at iterationstep s = 0:h0i = FPFN(Pi) ∈ RL,(1)where h0i is a L-dim vector and Pi ∈ RN×D presents apillar containing N LiDAR points, with each point param-eterized by D dimension representation (e.g., the XYZ co-ordinates and the received reflectance). The PFN is real-ized by applying fully connected layers on each point withinthe pillar, then summarizing features of all points through achannel-wise maximum operation. The initial node state h0iis a locally aggregated feature, only including points infor-mation within a certain pillar grid.Next, we elaborate on the message passing process. Oneiteration step of message propagation is illustrated in Fig. 3.At step s, a node vi aggregates information from all theneighbor nodes vj ∈ Ωvi in the k-NN graph. We definethe incoming edge feature from node vj as esj,i, indicatingthe relation between node vi and vj. Inspired by [55], theincoming edge feature esj,i is given by:esj,i = hsj − hsi ∈ RL,(2)which is an asymmetric function encoding the local neigh-bor information. Accordingly, we have the message passedfrom vj to vi, which is denoted as:ms+1j,i= φθ([hsi, esj,i]) ∈ RL′,(3)where φθ is parameterized by a fully connected layer, whichtakes as input the concatenation of hsi and esj,i, and yields aL′-dim feature.After computing all the pair-wise relations between viand the neighbors vj ∈ Ωvi of , we summarize the receivedk messages with a maximum operation:ms+1i= maxj∈Ωi (ms+1j,i ) ∈ RL′,(4)Then, we update the node state hsi with hs+1ifor node vi.The update process should consider both the newly col-lected message ms+1iand the previous state hsi. Recurrentneural network and its variants [16, 47] can adaptively cap-ture dependencies in different time steps. Hence, we utilizeGated Recurrent Unit (GRU) [7] as the update function forits better convergence characteristic. The update process isthen formulated as follows:hs+1i= GRU(hsi, ms+1i) ∈ RL,(5)In this way, the new node state hs+1icontains the informa-tion from all the neighbor nodes of vi. Moreover, a neigh-bor node vj also collects information from its own neigh-bors Ωvj. Consequently, after
