4 K.-N. Lianos, J.L. Sch¨onberger, M. Pollefeys, and T. Sattler
or to obtain richer map representations [39, 42]. Conversely, VO can be used to
improve object detection [11,15,25,38]. Most similar to our approach are object-
based SLAM [3, 5, 7, 15, 40, 43] and Structure-from-Motion [4, 16] approaches.
They use object detections as higher-level semantic features to improve camera
pose tracking [4,7,15,16] and / or to detect and handle loop closures [3,5,15,43].
While some approaches rely on a database of specific objects that are detected
online [7, 15,43], others use generic object detectors [3,5,16]. The former require
that all objects are known and mapped beforehand. The latter need to solve a
data association problem to resolve the ambiguities arising from detecting the
same object class multiple times in an image. Bowman et al. were the first to
jointly optimize over continuous camera poses, 3D point landmarks, and object
landmarks (represented by bounding volumes [5, 16]) as well as over discrete
data associations [5]. They use a probabilistic association model to avoid the
need for hard decisions. In contrast, our approach does not need a discrete data
association by considering continuous distances to object boundaries rather than
individual object detections. By focusing on the boundaries of semantic objects,
we are able to handle a larger corpus of semantic object classes. Specifically,
we are able to use both convex objects as well as semantic classes that cannot
be described by bounding boxes, such as street, sky, and building. Compared
to [5], who focus on handling loop closures, our approach aims at reducing drift
through medium-term continuous data associations.
Semantic image-to-model alignment methods use semantics to align images
with 3D models [8, 45, 50, 51]. Cohen et al. stitch visually disconnected models
by measuring the quality of an alignment using 3D point projections into a se-
mantically segmented image. Taneja et al. estimate an initial alignment between
a panorama and a 3D model based on semantic segmentation [50]. They then
alternate between improving the segmentation and the alignment. Most closely
related to our approach is concurrent work by Toft et al. [51], who project se-
mantically labeled 3D points into semantically segmented images. Similar to us,
they construct error maps for each class via distance fields. Given an initial guess
for the camera pose, the errors associated with the 3D points are then used to
refine the pose. They apply their approach to visual localization and thus assume
a pre-built and pre-labeled 3D model. In contrast, our approach is designed for
VO and optimizes camera poses via a semantic error term while simultaneously
constructing a labeled 3D point cloud. Toft et al. incrementally include more
classes in the optimization and fix parts of the pose at some point. In contrast,
our approach directly considers all classes.
3 Visual Semantic Odometry
The goal of this paper is to reduce drift in visual odometry by establishing
continuous medium-term correspondences. Since both direct and indirect VO
approaches are often not able to track a point over a long period of time contin-
uously, we use scene semantics to establish such correspondences.