Are we ready for Autonomous Driving?
The KITTI Vision Benchmark Suite
Andreas Geiger and Philip Lenz
Karlsruhe Institute of Technology
{geiger,lenz}@kit.edu
Raquel Urtasun
Toyota Technological Institute at Chicago
rurtasun@ttic.edu
Abstract
Today, visual recognition systems are still rarely em-
ployed in robotics applications. Perhaps one of the main
reasons for this is the lack of demanding benchmarks that
mimic such scenarios. In this paper, we take advantage
of our autonomous driving platform to develop novel chal-
lenging benchmarks for the tasks of stereo, optical flow, vi-
sual odometry / SLAM and 3D object detection. Our record-
ing platform is equipped with four high resolution video
cameras, a Velodyne laser scanner and a state-of-the-art
localization system. Our benchmarks comprise 389 stereo
and optical flow image pairs, stereo visual odometry se-
quences of 39.2 km length, and more than 200k 3D ob-
ject annotations captured in cluttered scenarios (up to 15
cars and 30 pedestrians are visible per image). Results
from state-of-the-art algorithms reveal that methods rank-
ing high on established datasets such as Middlebury per-
form below average when being moved outside the labora-
tory to the real world. Our goal is to reduce this bias by
providing challenging benchmarks with novel difficulties to
the computer vision community. Our benchmarks are avail-
able online at:
www.cvlibs.net/datasets/kitti
1. Introduction
Developing autonomous systems that are able to assist
humans in everyday tasks is one of the grand challenges in
modern computer science. One example are autonomous
driving systems which can help decrease fatalities caused
by traffic accidents. While a variety of novel sensors have
been used in the past few years for tasks such as recognition,
navigation and manipulation of objects, visual sensors are
rarely exploited in robotics applications: Autonomous driv-
ing systems rely mostly on GPS, laser range finders, radar
as well as very accurate maps of the environment.
In the past few years an increasing number of bench-
marks have been developed to push forward the perfor-
mance of visual recognitions systems, e.g., Caltech-101
Figure 1. Recording platform with sensors (top-left), trajectory
from our visual odometry benchmark (top-center), disparity and
optical flow map (top-right) and 3D object labels (bottom).
[
17], Middlebury for stereo [41] and optical flow [2] evalu-
ation. However, most of these datasets are simplistic, e.g.,
are taken in a controlled environment. A notable exception
is the PASCAL VOC challenge [
16] for detection and seg-
mentation.
In this paper, we take advantage of our autonomous driv-
ing platform to develop novel challenging benchmarks for
stereo, optical flow, visual odometry / SLAM and 3D object
detection. Our benchmarks are captured by driving around a
mid-size city, in rural areas and on highways. Our recording
platform is equipped with two high resolution stereo cam-
era systems (grayscale and color), a Velodyne HDL-64E
laser scanner that produces more than one million 3D points
per second and a state-of-the-art OXTS RT 3003 localiza-
tion system which combines GPS, GLONASS, an IMU and
RTK correction signals. The cameras, laser scanner and lo-
calization system are calibrated and synchronized, provid-
ing us with accurate ground truth. Table
1 summarizes our
benchmarks and provides a comparison to existing datasets.
Our stereo matching and optical flow estimation bench-
mark comprises 194 training and 195 test image pairs at
a resolution of 1240 × 376 pixels after rectification with
semi-dense (50%) ground truth. Compared to previous
datasets [
41, 2, 30, 29], this is the first one with realis-
tic non-synthetic imagery and accurate ground truth. Dif-
1