点云特征生成及知识蒸馏：2D和3D特征生成对应关系细节与实现

157 浏览量更新于2023-11-30 收藏 21.1MB PDF 举报

点云特征

知识蒸馏

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

2DPASS: 2D Priors Assisted SemanticSegmentation on LiDAR Point CloudsXu Yan1†, Jiantao Gao2†, Chaoda Zheng1†,Chao Zheng3, Ruimao Zhang1, Shuguang Cui1, Zhen Li1⋆1The Chinese University of Hong Kong (Shenzhen), The Future Network ofIntelligence Institute, Shenzhen Research Institute of Big Data,2Shanghai University, 3Tencent Map, T LabAbstract. As camera and LiDAR sensors capture complementary infor-mation in autonomous driving, great eﬀorts have been made to conductsemantic segmentation through multi-modality data fusion. However,fusion-based approaches require paired data, i.e., LiDAR point cloudsand camera images with strict point-to-pixel mappings, as the inputsin both training and inference stages. It seriously hinders their applica-tion in practical scenarios. Thus, in this work, we propose the 2D PriorsAssisted Semantic Segmentation (2DPASS) method, a general train-ing scheme, to boost the representation learning on point clouds. Theproposed 2DPASS method fully takes advantage of 2D images with richappearance during training, and then conduct semantic segmentationwithout strict paired data constraints. In practice, by leveraging an aux-iliary modal fusion and multi-scale fusion-to-single knowledge distillation(MSFSKD), 2DPASS acquires richer semantic and structural informationfrom the multi-modal data, which are then distilled to the pure 3D net-work. As a result, our baseline model shows signiﬁcant improvement withonly point cloud inputs once equipped with the 2DPASS. Speciﬁcally, itachieves the state-of-the-arts on two large-scale recognized benchmarks(i.e., SemanticKITTI and NuScenes), i.e., ranking the top-1 in both sin-gle and multiple scan(s) competitions of SemanticKITTI. Code will bemade available at https://github.com/yanx27/2DPASS.Keywords: Semantic Segmentation, Multi-Modal, Knowledge Distilla-tion, LiDAR Point Clouds1IntroductionSemantic segmentation plays a crucial role in large-scale outdoor scene under-standing, which has broad applications in autonomous driving and robotics [1–3].In the past few years, the research community has devoted signiﬁcant eﬀort tounderstanding natural scenes using either camera images [4–7] or LiDAR pointclouds [2, 8–12] as the input. However, these single-modal methods inevitablyface challenges in complex environments due to the inherent limitations of the⋆ Corresponding author: Zhen Li. † Equal ﬁrst authorship.arXiv:2207.04397v2 [cs.CV] 16 Sep 2022+v:mala2255获取更多论文2X. Yan et al.Front-Camera Image and Perspective Projection360° LiDAR Point CloudPoint Cloud in Camera Perspective Fig. 1. Limitation of fusion-based methods. When the self-driving car only hasfront-cameras with limited perspective such as SemanticKITTI [16] dataset while the360-degree LiDAR has a much larger sensing range, fusion-based methods that requirestrict alignment between camera and LiDAR can only identify a small proportion ofthe point cloud (see the red region).input sensors. Concretely, cameras provide dense color information and ﬁne-grained texture, but they are ambiguous in depth sensing and unreliable in lowlight conditions. In contrast, LiDARs robustly oﬀer accurate and wide-rangingdepth information regardless of lighting variances but only capture sparse andtextureless data. Since cameras and LiDARs complement each other, it is betterto perceive the surrounding with both sensors.Recently, many commercial cars have been equipped with both cameras andLiDARs. This excites the research community to improve the semantic segmen-tation by fusing the information from two complementary sensors [13–15]. Theseapproaches ﬁrst establish the mapping between 3D points and 2D pixels by pro-jecting the point clouds onto the image planes using the sensor calibrations.Based on the point-to-pixel mapping, the models fuse the corresponding imagefeatures into the point features, which are further processed to obtain the ﬁnalsemantic scores. Despite the improvements, fusion-based methods have the fol-lowing unavoidable limitations: 1) Due to the diﬀerence of FOVs (ﬁeld of views)between cameras and LiDARs, the point-to-pixel mapping cannot be establishedfor points that are out of the image planes. Typically, the FOVs of LiDAR andcameras only overlap in a small portion (see Fig. 1), which signiﬁcantly lim-its the application of fusion-based methods. 2) Fusion-based methods consumemore computational resources since they process both images and point clouds(through multitask or cascade manners) at runtime, which introduces a greatburden on real-time applications.To address the above two issues, we focus on improving semantic segmenta-tion by leveraging both images and point clouds through an eﬀective design inthis work. Considering the sensors are moving in the scenes, the non-overlap partof the 360-degree LiDAR point clouds corresponding to image in the same time-stamp (see the gray region of the right part in Fig. 1) can be covered by imagesfrom other time-stamp. Besides, the dense and structural information of im-ages provides useful regularization for both seen and unseen point cloud regions.Based on these observations, we propose a “model-independent” training scheme,namely 2D Priors Assisted Semantic Segmentation (2DPASS), to enhance therepresentation learning of any 3D semantic segmentation networks with minor+v:mala2255获取更多论文2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds3structure modiﬁcation. In practice, on the one hand, for above-mentioned non-overlap regions, 2DPASS takes pure point clouds as the inputs to train thesegmentation model. On the other hand, for subregions with well-aligned point-to-pixel mappings, 2DPASS adopts an auxiliary multi-modal fusion to aggregateimage and point features in each scale, and then aligns the 3D predictions withthe fusion predictions. Unlike previous cross-modal alignment [17] apt to con-taminate the modal-speciﬁc information, we design a multi-scale fusion-to-singleknowledge distillation (MSFSKD) strategy to transfer extra knowledge to the 3Dmodel as well as retaining its modal-speciﬁc ability. Compared with fusion-basedmethods, our solution has the following preferable properties: 1) Generality:It can be easily integrated with any 3D segmentation model with minor struc-tural modiﬁcation; 2) Flexibility: The fusion module is only used during thetraining to enhance the 3D network. After training, the enhanced 3D model canbe deployed without image inputs. 3) Eﬀectively: Even with only a small sec-tion of overlapped multi-modality data, our method can signiﬁcantly boost theperformance. As a result, we evaluate 2DPASS with a simple yet strong baselineimplemented with sparse convolutions [3]. The experiments show 2DPASS bringsnoticeable improvements even over this strong baseline. Equipped with 2DPASSusing multi-modal data, our model achieves the top-1 results on the single andmultiple-scan leaderboards of SemanticKITTI [16]. The state-of-the-art resultson the NuScenes [18] dataset further conﬁrm the generality of our method.In general, the main contributions are summarized as follows.– We propose 2D Priors Assisted Semantic Segmentation (2DPASS) that as-sists 3D LiDAR semantic segmentation with 2D priors from cameras. To thebest of our knowledge, 2DPASS is the ﬁrst method that distills multi-modalknowledge to single point cloud modality for semantic segmentation.– Equipped with the proposed multi-scale fusion-to-single knowledge distilla-tion (MSFSKS) strategy, 2DPASS achieves the signiﬁcant performance gainson SemanticKITTI and NuScenes benchmarks, ranking the 1st on single andmultiple tracks of SemanticKITTI.2Related Work2.1Single-Sensor MethodsCamera-Based Methods. Camera-based semantic segmentation aims to pre-dict the pixel-wise labels for input 2D images. FCN [19] is the pioneer in seman-tic segmentation, which proposes an end-to-end fully convolutional architecturebased on image classiﬁcation networks. Recent works have achieved signiﬁcantimprovements via exploring multi-scale features learning [4,20,21], dilated con-volution [5,22], and attention mechanisms [7,23]. However, camera-only methodsare ambiguous in depth sensing and not robust in low light conditions.LiDAR-Based Methods. The LiDAR data is generally represented as pointclouds. There are several mainstreams to process point clouds with diﬀerentrepresentations. 1) Point-based methods approximate a permutation-invariant+v:mala2255获取更多论文4X. Yan et al.set function using a per-point Multi-Layer Perceptron (MLP). PointNet [24] isthe pioneer in this ﬁeld. Later on, many studies design point-wise MLP [25,26],adaptive weight [27, 28] and pseudo grid [29, 30] based methods to extract lo-cal features of point clouds or exploit nonlocal operators [31–33] to learn longdistance dependency. However, point-based methods are not eﬃcient in the Li-DAR scenario since their sampling and grouping algorithms are generally time-consuming. 2) Projection-based methods are very eﬃcient approaches for Li-DAR point clouds. They project point clouds onto 2D pixels so that traditionalCNN can play a normal role. Previous works project all points scanned by therotating LiDAR onto 2D images by plane projection [34–36], spherical projec-tion [37,38] or both [39]. However, the projection inevitably causes informationloss. And the projection-based methods currently meet the bottleneck of the seg-mentation accuracy. 3) Most recent works adopt voxel-based frameworks sincethey balance the eﬃciency and eﬀectiveness, where sparse convolution (Spar-seConv) [3] are most commonly utilized. Compared to traditional voxel-basedmethods (i.e., 3DCNN) directly transforming all points into the 3D voxel grids,SparseConv only stores non-empty voxels in a Hash table and conducts convolu-tion operations only on these non-empty voxels in a more eﬃcient way. Recently,many studies have used SparseConv to design more powerful network architec-tures. Cylinder3D [40] changes original grid voxels to cylinder ones and designsan asymmetrical network to boost the performance. AF2-S3Net [41] applies mul-tiple branches with diﬀerent kernel sizes, aggregating multi-scale features via anattention mechanism. 4) Very recently, there is a trend of exploiting multi-representation fusion methods. These methods combine multiple representa-tions above (i.e., points, projection images, and voxels) and design feature fusionamong diﬀerent branches. Tang et.al. [10] combines point-wise MLPs in eachsparse convolution block to learn a point-voxel representation and uses NAS tosearch for a more eﬃcient architecture. RPVNet [42] proposes range-point-voxelfusion network to utilizes information from three representations. Nevertheless,these methods only take sparse and textureless LiDAR point clouds as inputs,thus appearance and texture in the camera images have not been fully utilized.2.2Multi-Sensor MethodsMulti-sensor methods attempt to fuse information from two complementarysensors and leverage the beneﬁts of both camera and LiDAR [14, 15, 43, 44].RGBAL [14] converts RGB images to a polar-grid mapping representation anddesigns early and mid-level fusion strategies. PointPainting [15] exploits the seg-mentation logits of images and projects them to the LiDAR space by bird’s-eyeprojection [23] or spherical projection [45] for LiDAR network performance im-provement. Recently, PMF [13] exploits a collaborative fusion of two modalitiesin camera coordinates. However, these methods require multi-sensor inputs inboth training and inference phases. Moreover, the paired multi-modality data isusually computation-intensive and unavailable in practical application.+v:mala2255获取更多论文2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds5Image from Camera 2D NetworkLiDAR Points Cloud3D Network3D Decoder2D Decoder3D Ground Truthsupervise2D Ground Truthsuperviseground truthgenerationtraining onlytraining and inference2D Encoder3D EncoderMSFSKD (fusion+distillation)Fig. 2. 2D Priors Assisted Semantic Segmentation (2DPASS). It ﬁrst crops asmall patch from the original camera image as the 2D input. Then the cropped imagepatch and LiDAR point cloud independently pass through the 2D and 3D encodersto generate multi-scale features in parallel. Afterwards, for each scale, complementary2D knowledge is eﬀectively transferred to the 3D network via the multi-scale fusion-to-single knowledge distillation (MSFSKD). The feature maps (in the form of either pixelgrid or point set) are used to generate the ﬁnal semantic scores using modal-speciﬁcdecoders, which are supervised by pure 3D labels.2.3Cross-modal Knowledge TransferKnowledge distillation was initially proposed for compressing the large teachernetwork to a small student one [46]. Over the past few years, several subsequentstudies enhanced knowledge transferring through matching feature representa-tions in diﬀerent manners [47–50]. For instance, aligning attention maps [49]and Jacobean matrixes [50] were independently applied. With the developmentof multi-modal computer vision, recent research apply knowledge distillation totransfer priors across diﬀerent modalities, e.g., exploiting extra 2D images in thetraining phase and improving the performance in the inference [51–55]. Specif-ically, [56] introduces the 2D-assisted pre-training, [57] inﬂates the kernels of2D convolution to the 3D ones, and [58] applies well-designed teacher-studentframework. Inspired but diﬀerent from the above, we transfer 2D knowledgethrough a multi-scale fusion-to-single manner, which additionally takes care ofthe modal-speciﬁc knowledge.3Method3.1Framework OverviewThis paper focuses on improving the LiDAR point cloud semantic segmentation,which aims to assign the semantic label to each point. To handle diﬃculties inlarge-scale outdoor LiDAR point clouds, i.e., sparsity, varying density, and lackof texture, we introduce the strong regularization and priors from 2D cameraimages through a fusion-to-single knowledge transferring.The workﬂow of our 2D Priors Assisted Semantic Segmentation (2DPASS)is shown in Fig. 2. Since the camera images are pretty large (e.g., 1242 × 512),sending the original ones to our multi-modal pipeline is intractable. Therefore,+v:mala2255获取更多论文6X. Yan et al.we randomly sample a small patch (480 × 320) from the original camera imageas the 2D input [17], accelerating the training processing without performancedrop. Then the cropped image patch and LiDAR point cloud independently passthrough independent 2D and 3D encoders, where multi-scale features from thetwo backbones are extracted in parallel. Afterwards, multi-scale fusion-to-singleknowledge distillation (MSFSKD) is conducted to enhance the 3D network usingmulti-modal features, i.e., fully utilizing texture and color-aware 2D priors aswell as retaining the original 3D-speciﬁc knowledge. Finally, all the 2D and 3Dfeatures at each scale are used to generate semantic segmentation predictions,which are supervised by pure 3D labels. During inference, the 2D-related branchcan be discarded,

下载后可阅读完整内容，剩余1页未读，立即下载