没有合适的资源?快使用搜索试试~ 我知道了~
2DPASS: 2D Priors Assisted SemanticSegmentation on LiDAR Point CloudsXu Yan1†, Jiantao Gao2†, Chaoda Zheng1†,Chao Zheng3, Ruimao Zhang1, Shuguang Cui1, Zhen Li1⋆1The Chinese University of Hong Kong (Shenzhen), The Future Network ofIntelligence Institute, Shenzhen Research Institute of Big Data,2Shanghai University, 3Tencent Map, T LabAbstract. As camera and LiDAR sensors capture complementary infor-mation in autonomous driving, great efforts have been made to conductsemantic segmentation through multi-modality data fusion. However,fusion-based approaches require paired data, i.e., LiDAR point cloudsand camera images with strict point-to-pixel mappings, as the inputsin both training and inference stages. It seriously hinders their applica-tion in practical scenarios. Thus, in this work, we propose the 2D PriorsAssisted Semantic Segmentation (2DPASS) method, a general train-ing scheme, to boost the representation learning on point clouds. Theproposed 2DPASS method fully takes advantage of 2D images with richappearance during training, and then conduct semantic segmentationwithout strict paired data constraints. In practice, by leveraging an aux-iliary modal fusion and multi-scale fusion-to-single knowledge distillation(MSFSKD), 2DPASS acquires richer semantic and structural informationfrom the multi-modal data, which are then distilled to the pure 3D net-work. As a result, our baseline model shows significant improvement withonly point cloud inputs once equipped with the 2DPASS. Specifically, itachieves the state-of-the-arts on two large-scale recognized benchmarks(i.e., SemanticKITTI and NuScenes), i.e., ranking the top-1 in both sin-gle and multiple scan(s) competitions of SemanticKITTI. Code will bemade available at https://github.com/yanx27/2DPASS.Keywords: Semantic Segmentation, Multi-Modal, Knowledge Distilla-tion, LiDAR Point Clouds1IntroductionSemantic segmentation plays a crucial role in large-scale outdoor scene under-standing, which has broad applications in autonomous driving and robotics [1–3].In the past few years, the research community has devoted significant effort tounderstanding natural scenes using either camera images [4–7] or LiDAR pointclouds [2, 8–12] as the input. However, these single-modal methods inevitablyface challenges in complex environments due to the inherent limitations of the⋆ Corresponding author: Zhen Li. † Equal first authorship.arXiv:2207.04397v2 [cs.CV] 16 Sep 2022+v:mala2255获取更多论文2X. Yan et al.Front-Camera Image and Perspective Projection360° LiDAR Point CloudPoint Cloud in Camera Perspective Fig. 1. Limitation of fusion-based methods. When the self-driving car only hasfront-cameras with limited perspective such as SemanticKITTI [16] dataset while the360-degree LiDAR has a much larger sensing range, fusion-based methods that requirestrict alignment between camera and LiDAR can only identify a small proportion ofthe point cloud (see the red region).input sensors. Concretely, cameras provide dense color information and fine-grained texture, but they are ambiguous in depth sensing and unreliable in lowlight conditions. In contrast, LiDARs robustly offer accurate and wide-rangingdepth information regardless of lighting variances but only capture sparse andtextureless data. Since cameras and LiDARs complement each other, it is betterto perceive the surrounding with both sensors.Recently, many commercial cars have been equipped with both cameras andLiDARs. This excites the research community to improve the semantic segmen-tation by fusing the information from two complementary sensors [13–15]. Theseapproaches first establish the mapping between 3D points and 2D pixels by pro-jecting the point clouds onto the image planes using the sensor calibrations.Based on the point-to-pixel mapping, the models fuse the corresponding imagefeatures into the point features, which are further processed to obtain the finalsemantic scores. Despite the improvements, fusion-based methods have the fol-lowing unavoidable limitations: 1) Due to the difference of FOVs (field of views)between cameras and LiDARs, the point-to-pixel mapping cannot be establishedfor points that are out of the image planes. Typically, the FOVs of LiDAR andcameras only overlap in a small portion (see Fig. 1), which significantly lim-its the application of fusion-based methods. 2) Fusion-based methods consumemore computational resources since they process both images and point clouds(through multitask or cascade manners) at runtime, which introduces a greatburden on real-time applications.To address the above two issues, we focus on improving semantic segmenta-tion by leveraging both images and point clouds through an effective design inthis work. Considering the sensors are moving in the scenes, the non-overlap partof the 360-degree LiDAR point clouds corresponding to image in the same time-stamp (see the gray region of the right part in Fig. 1) can be covered by imagesfrom other time-stamp. Besides, the dense and structural information of im-ages provides useful regularization for both seen and unseen point cloud regions.Based on these observations, we propose a “model-independent” training scheme,namely 2D Priors Assisted Semantic Segmentation (2DPASS), to enhance therepresentation learning of any 3D semantic segmentation networks with minor+v:mala2255获取更多论文2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds3structure modification. In practice, on the one hand, for above-mentioned non-overlap regions, 2DPASS takes pure point clouds as the inputs to train thesegmentation model. On the other hand, for subregions with well-aligned point-to-pixel mappings, 2DPASS adopts an auxiliary multi-modal fusion to aggregateimage and point features in each scale, and then aligns the 3D predictions withthe fusion predictions. Unlike previous cross-modal alignment [17] apt to con-taminate the modal-specific information, we design a multi-scale fusion-to-singleknowledge distillation (MSFSKD) strategy to transfer extra knowledge to the 3Dmodel as well as retaining its modal-specific ability. Compared with fusion-basedmethods, our solution has the following preferable properties: 1) Generality:It can be easily integrated with any 3D segmentation model with minor struc-tural modification; 2) Flexibility: The fusion module is only used during thetraining to enhance the 3D network. After training, the enhanced 3D model canbe deployed without image inputs. 3) Effectively: Even with only a small sec-tion of overlapped multi-modality data, our method can significantly boost theperformance. As a result, we evaluate 2DPASS with a simple yet strong baselineimplemented with sparse convolutions [3]. The experiments show 2DPASS bringsnoticeable improvements even over this strong baseline. Equipped with 2DPASSusing multi-modal data, our model achieves the top-1 results on the single andmultiple-scan leaderboards of SemanticKITTI [16]. The state-of-the-art resultson the NuScenes [18] dataset further confirm the generality of our method.In general, the main contributions are summarized as follows.– We propose 2D Priors Assisted Semantic Segmentation (2DPASS) that as-sists 3D LiDAR semantic segmentation with 2D priors from cameras. To thebest of our knowledge, 2DPASS is the first method that distills multi-modalknowledge to single point cloud modality for semantic segmentation.– Equipped with the proposed multi-scale fusion-to-single knowledge distilla-tion (MSFSKS) strategy, 2DPASS achieves the significant performance gainson SemanticKITTI and NuScenes benchmarks, ranking the 1st on single andmultiple tracks of SemanticKITTI.2Related Work2.1Single-Sensor MethodsCamera-Based Methods. Camera-based semantic segmentation aims to pre-dict the pixel-wise labels for input 2D images. FCN [19] is the pioneer in seman-tic segmentation, which proposes an end-to-end fully convolutional architecturebased on image classification networks. Recent works have achieved significantimprovements via exploring multi-scale features learning [4,20,21], dilated con-volution [5,22], and attention mechanisms [7,23]. However, camera-only methodsare ambiguous in depth sensing and not robust in low light conditions.LiDAR-Based Methods. The LiDAR data is generally represented as pointclouds. There are several mainstreams to process point clouds with differentrepresentations. 1) Point-based methods approximate a permutation-invariant+v:mala2255获取更多论文4X. Yan et al.set function using a per-point Multi-Layer Perceptron (MLP). PointNet [24] isthe pioneer in this field. Later on, many studies design point-wise MLP [25,26],adaptive weight [27, 28] and pseudo grid [29, 30] based methods to extract lo-cal features of point clouds or exploit nonlocal operators [31–33] to learn longdistance dependency. However, point-based methods are not efficient in the Li-DAR scenario since their sampling and grouping algorithms are generally time-consuming. 2) Projection-based methods are very efficient approaches for Li-DAR point clouds. They project point clouds onto 2D pixels so that traditionalCNN can play a normal role. Previous works project all points scanned by therotating LiDAR onto 2D images by plane projection [34–36], spherical projec-tion [37,38] or both [39]. However, the projection inevitably causes informationloss. And the projection-based methods currently meet the bottleneck of the seg-mentation accuracy. 3) Most recent works adopt voxel-based frameworks sincethey balance the efficiency and effectiveness, where sparse convolution (Spar-seConv) [3] are most commonly utilized. Compared to traditional voxel-basedmethods (i.e., 3DCNN) directly transforming all points into the 3D voxel grids,SparseConv only stores non-empty voxels in a Hash table and conducts convolu-tion operations only on these non-empty voxels in a more efficient way. Recently,many studies have used SparseConv to design more powerful network architec-tures. Cylinder3D [40] changes original grid voxels to cylinder ones and designsan asymmetrical network to boost the performance. AF2-S3Net [41] applies mul-tiple branches with different kernel sizes, aggregating multi-scale features via anattention mechanism. 4) Very recently, there is a trend of exploiting multi-representation fusion methods. These methods combine multiple representa-tions above (i.e., points, projection images, and voxels) and design feature fusionamong different branches. Tang et.al. [10] combines point-wise MLPs in eachsparse convolution block to learn a point-voxel representation and uses NAS tosearch for a more efficient architecture. RPVNet [42] proposes range-point-voxelfusion network to utilizes information from three representations. Nevertheless,these methods only take sparse and textureless LiDAR point clouds as inputs,thus appearance and texture in the camera images have not been fully utilized.2.2Multi-Sensor MethodsMulti-sensor methods attempt to fuse information from two complementarysensors and leverage the benefits of both camera and LiDAR [14, 15, 43, 44].RGBAL [14] converts RGB images to a polar-grid mapping representation anddesigns early and mid-level fusion strategies. PointPainting [15] exploits the seg-mentation logits of images and projects them to the LiDAR space by bird’s-eyeprojection [23] or spherical projection [45] for LiDAR network performance im-provement. Recently, PMF [13] exploits a collaborative fusion of two modalitiesin camera coordinates. However, these methods require multi-sensor inputs inboth training and inference phases. Moreover, the paired multi-modality data isusually computation-intensive and unavailable in practical application.+v:mala2255获取更多论文2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds5Image from Camera 2D NetworkLiDAR Points Cloud3D Network3D Decoder2D Decoder3D Ground Truthsupervise2D Ground Truthsuperviseground truthgenerationtraining onlytraining and inference2D Encoder3D EncoderMSFSKD (fusion+distillation)Fig. 2. 2D Priors Assisted Semantic Segmentation (2DPASS). It first crops asmall patch from the original camera image as the 2D input. Then the cropped imagepatch and LiDAR point cloud independently pass through the 2D and 3D encodersto generate multi-scale features in parallel. Afterwards, for each scale, complementary2D knowledge is effectively transferred to the 3D network via the multi-scale fusion-to-single knowledge distillation (MSFSKD). The feature maps (in the form of either pixelgrid or point set) are used to generate the final semantic scores using modal-specificdecoders, which are supervised by pure 3D labels.2.3Cross-modal Knowledge TransferKnowledge distillation was initially proposed for compressing the large teachernetwork to a small student one [46]. Over the past few years, several subsequentstudies enhanced knowledge transferring through matching feature representa-tions in different manners [47–50]. For instance, aligning attention maps [49]and Jacobean matrixes [50] were independently applied. With the developmentof multi-modal computer vision, recent research apply knowledge distillation totransfer priors across different modalities, e.g., exploiting extra 2D images in thetraining phase and improving the performance in the inference [51–55]. Specif-ically, [56] introduces the 2D-assisted pre-training, [57] inflates the kernels of2D convolution to the 3D ones, and [58] applies well-designed teacher-studentframework. Inspired but different from the above, we transfer 2D knowledgethrough a multi-scale fusion-to-single manner, which additionally takes care ofthe modal-specific knowledge.3Method3.1Framework OverviewThis paper focuses on improving the LiDAR point cloud semantic segmentation,which aims to assign the semantic label to each point. To handle difficulties inlarge-scale outdoor LiDAR point clouds, i.e., sparsity, varying density, and lackof texture, we introduce the strong regularization and priors from 2D cameraimages through a fusion-to-single knowledge transferring.The workflow of our 2D Priors Assisted Semantic Segmentation (2DPASS)is shown in Fig. 2. Since the camera images are pretty large (e.g., 1242 × 512),sending the original ones to our multi-modal pipeline is intractable. Therefore,+v:mala2255获取更多论文6X. Yan et al.we randomly sample a small patch (480 × 320) from the original camera imageas the 2D input [17], accelerating the training processing without performancedrop. Then the cropped image patch and LiDAR point cloud independently passthrough independent 2D and 3D encoders, where multi-scale features from thetwo backbones are extracted in parallel. Afterwards, multi-scale fusion-to-singleknowledge distillation (MSFSKD) is conducted to enhance the 3D network usingmulti-modal features, i.e., fully utilizing texture and color-aware 2D priors aswell as retaining the original 3D-specific knowledge. Finally, all the 2D and 3Dfeatures at each scale are used to generate semantic segmentation predictions,which are supervised by pure 3D labels. During inference, the 2D-related branchcan be discarded,
下载后可阅读完整内容,剩余1页未读,立即下载
cpongm
- 粉丝: 4
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- zigbee-cluster-library-specification
- JSBSim Reference Manual
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功