端到端学习多传感器3D目标检测追踪技术

版权申诉

196 浏览量更新于2024-09-09 收藏 7.12MB PDF 举报

"End-to-end Learning of Multi-sensor 3D Tracking by Detection" 本文提出了一个端到端学习多传感器3D跟踪检测的新方法，该方法利用摄像头和激光雷达（LIDAR）数据生成高精度的3D轨迹。研究者将问题形式化为一个线性规划问题，可以精确求解，并通过端到端的方式学习用于检测和匹配的卷积网络。在具有挑战性的KITTI数据集上评估模型，展示出了非常有竞争力的结果。一、引言自动驾驶感知系统中的一个基础任务是对交通参与者进行跟踪。这个通常被称为多目标跟踪的任务，涉及到识别每一帧中有多少个对象，并随着时间推移链接它们的轨迹。尽管经过了数十年的研究，跟踪仍然是一个开放的问题。挑战包括处理物体截断、高速目标、光照条件、传感器运动以及目标之间复杂的交互，这些都会导致跟踪的难度增加。二、方法 1. 端到端学习：传统的跟踪方法往往将检测和跟踪分开处理，而本文提出的方法则将两者合并，通过端到端学习优化整个流程。这样可以同时优化检测和匹配过程，提高整体性能。 2. 多传感器融合：结合摄像头和LIDAR数据，可以提供更丰富的信息，摄像头对颜色和纹理敏感，适合识别物体，而LIDAR则能提供精确的3D空间信息，有利于定位和跟踪。 3. 线性规划：将多传感器3D跟踪建模为线性规划问题，使得算法可以直接找到全局最优解，避免局部最优的情况。 4. 卷积网络：使用深度学习的卷积网络来处理图像和点云数据，进行目标检测和匹配。这些网络可以自动学习特征，提高检测和关联的准确性。三、实验与评估在KITTI数据集上进行实验，该数据集包含真实世界驾驶场景，具有各种复杂条件，如遮挡、快速移动的目标等。实验结果证明了该方法的有效性，与现有方法相比，展示了极高的跟踪精度和鲁棒性。四、结论本文提出的方法克服了传统多传感器跟踪的一些局限，通过端到端学习和线性规划解决了多传感器数据融合和目标跟踪的难题。在未来的工作中，可能进一步研究如何扩展这种方法以应对更多传感器类型，以及如何处理更大规模和更复杂的交通环境。 "End-to-end Learning of Multi-sensor 3D Tracking by Detection" 提供了一种创新的解决思路，将深度学习与多传感器融合技术相结合，以提高自动驾驶系统对环境中动态物体的跟踪能力，对于提升无人驾驶的安全性和可靠性具有重要意义。

End-to-end Learning of Multi-sensor 3D Tracking by Detection

Davi Frossard Raquel Urtasun

Uber Advanced Technologies Group

University of Toronto

{frossard, urtasun}@uber.com

Abstract— In this paper we propose a novel approach to

tracking by detection that can exploit both cameras as well as

LIDAR data to produce very accurate 3D trajectories. Towards

this goal, we formulate the problem as a linear program that

can be solved exactly, and learn convolutional networks for

detection as well as matching in an end-to-end manner. We

evaluate our model in the challenging KITTI dataset and show

very competitive results.

I. INTRODUCTION

One of the fundamental tasks in perception systems for

autonomous driving is to be able to track trafﬁc participants.

This task, commonly referred to as Multi-target tracking,

consists on identifying how many objects there are in each

frame, as well as link their trajectories over time. Despite

many decades of research, tracking is still an open problem.

Challenges include dealing with object truncation, high speed

targets, lighting conditions, sensor motion and complex

interactions between targets, which leads to occlusion and

path crossing.

Most modern computer vision approaches to multi-target

tracking are based on tracking by detection [

], where ﬁrst

a set of possible objects are identiﬁed via object detectors.

These detections are then further associated over time in a

second step by solving a discrete problem. Both tracking and

detection are typically formulated in 2D, and a variety of

cues based on appearance and motion are exploited.

In robotics, tracking by ﬁltering methods are more prevalent,

where the input is ﬁltered in search of moving objects and their

state is predicted over time [

]. LIDAR based approaches

are the most common option for 3D tracking, since this

sensor provides an accurate spatial representation of the world

allowing for precise positioning of the objects of interest.

However, matching is more difﬁcult as LIDAR does not

capture appearance well when compared to the richness of

images.

In this paper, we propose an approach that can take

advantage of both LIDAR and camera data. Towards this goal,

we formulate the problem as inference in a deep structured

model, where the potentials are computed using convolutional

neural nets. Notably, our matching cost of associating two

detections exploits both appearance and motion via a siamese

network that processes images and motion representations

via convolutional layers. Inference in our model can be

done exactly and efﬁciently by a set of feedforward passes

followed by solving a linear program. Importantly, our model

is formulated such that it can be trained end-to-end to solve

both the detection and tracking problems. We refer the reader

to Figure 1 for an overview our approach.

II. RELATED WORK

Recent works in multiple object tracking are usually done

in two fronts: Filtering based and batch based methods.

Filtering based methods rely on the Markov assumption

to estimate the posterior distribution of the trajectories.

Bayesian or Monte Carlo ﬁltering methods such as Gaussian

Processes [

], Particle Filters and Kalman Filters [

] are

commonly employed. One advantage of ﬁltering approaches

is their efﬁciency, which allows for real-time applications.

However, they suffer from the propagation of early errors,

which are hard to mitigate. To tackle this shortcoming, batch

methods utilize object hypotheses from a detector (tracking

by detection) over entire sequences to estimate trajectories,

which allows for global optimization and usage of higher

level cues. Estimating trajectories becomes a data association

problem, i.e., deciding from the set of detections which should

be linked to form correct trajectories. The association can be

estimated with Markov Chain Monte Carlo (MCMC) [

], [

linear programming [6], [7] or with a ﬂow graph [8].

Online methods have also been proposed in order to tackle

the performance issue with batch methods [

], [

]. Milan et

al. [

] use Recurrent Neural Networks (RNN) to encode the

state-space and solve the association problem.

Our work also expands on previous research on pixel

matching, which has tipically been used for stereo estimation

and includes methods such as random forest classiﬁers [

Markov random ﬁelds (MRF) [

] and, more classically,

slanted plane models [

]. In our research, we focus on a

deep learning approach to the matching problem by exploiting

convolutional siamese networks [

], [

]. Previous methods,

however, focused on matching pairs of small image patches.

In [

] deep learning is exploited for tracking. However, this

approach is only similar to our method at a very high level:

using deep learning in a tracking by detection framework.

Our appearance matching is based on a fully convolutional

network with no requirements for optical ﬂow and learning is

done strictly via backpropagation. Furthermore, we reason in

3D and the spatial branch of our matching networks corrects

for things such as ego-motion and car resemblance. In contrast

[

] uses optical ﬂow and is piecewise trained using Gradient

Boosting.

Tracking methods usually employ hand-crafted feature

extractors with distance functions such as Chi-Square or

arXiv:1806.11534v1 [cs.CV] 29 Jun 2018

下载后可阅读完整内容，剩余8页未读，立即下载

电动汽车控制与安全

粉丝: 279

端到端学习多传感器3D目标检测追踪技术

多传感器3D跟踪检测端到端学习方法

Tracking-Learning-Detection: 深入理解长时目标追踪框架

人脸长期跟踪：基于Tracking-Learning-Detection的新方法

End-to-end Learning of Multi-sensor 3D Tracking by Detection.zip

Real-Time Machine Learning Model Update Strategies: 3 Tips to Keep Your Model Ahead

Detailed Application of Window Functions in MATLAB Signal Processing

Application of MATLAB in Robot Control Systems: Modeling and Control Strategies

MATLAB-Based Fault Diagnosis and Fault-Tolerant Control in Control Systems: Strategies and Practices

面向任务的多传感器资源预分配建模与求解：优化整体性能与资源利用【摘要】

cole_02_0507.pdf

最新资源