work. In summary, our contributions are as follows:
A guide to absolute pose estimation with deep learning,
providing both theoretical background and practical
advice.
Cross-comparison of performance and characteristics
of over 20 deep learning pose estimators.
Summary of existing and emerging trends in deep pose
estimation, and the current challenges and limitations.
1.1. Problem Definition
Given an image
, captured by a camera , an absolute
pose estimator tries to predict the 3D pose orientation and
location of in world coordinates, defined for some
arbitrary reference 3D model (a ‘scene’).
The translation of with respect to the origin (location)
is specified by a vector
. The orientation of can be
described with several alternative representations, such as a
rotation matrix, quaternion and Euler angles. Most
commonly, the quaternion representation is used,
specifying the orientation as a vector
. This
representation elevates the need for orthonormalization,
which is required for rotation matrices, and can be
converted to a (legitimate) rotation when normalizing it to
unit length [7]. One caveat of the quaternion representation
is its potential ambiguity, due to a dual mapping of the
quaternions q and –q to the same rotation operation. A
variant of Euler angle has been used to address this problem
in some solutins [10]. In practice, however, the majority of
pose estimators predicts the quaternion representation (for
a more extended review of the different representations for
pose orientation, see [11]). The overall pose of is thus
specified with a tuple .
The APE problem can now be formally defined as the
problem of estimating a function taking an image
captured by a camera and outputting its respective pose:
(1)
Note that the definition given in Eq. 1 can be extended to
additional inputs about the camera and the image (e.g.,
depth and camera frustum).
A related problem, which is often solved jointly or in
parallel to APE (for example in visual odometry systems),
is the relative pose estimation (RPE) problem. In a RPE
setting, the estimator takes two images,
and
, captured
by and aims to predict the relative pose between them.
Eq. (1) can be modified to capture this problem:
(2)
1.2. Evaluation Metrics
In order to evaluate the performance of a pose estimator,
we require a set of images and the ground truth poses of the
camera(s) which captured them. Since the camera pose is
related to some 3D model coordinates, such a model needs
to be available. Typically a 3D point cloud, associated with
a set of images for training and testing, is provided either
through the scanning device (e.g., Microsoft Kinect) or
through reconstruction using structure-from-motion (SfM)
methods. Popular SfM tools are Bundler [12], COLMAP
[13,14] and VisualSFM [15].
Given a ground truth pose and an estimated
pose , the localization error of is measured by
the deviations between the translation (location) and
rotation (orientation) of and .
The translation error
is typically measured in meters
and defined as the Euclidian distance between the ground
truth and estimated locations:
(3)
The rotation error
is typically measured in
degrees and corresponds to the minimum rotation angle
required to align the ground truth and estimated orientations
[16,17]:
(4)
Where and
are the ground truth and estimated
rotation matrices, respectively, and is the trace of .
Using the quaternion representation,
is given by:
(5)
The relative pose error is computed in a similar manner
to the absolute pose error, based on the deviation between
the ground truth and estimated relative poses. It is typically
measured in [m/s] and [degree/s] (for translation and
rotation, respectively), capturing the drift when computed
over a sequence.
The translation and rotation errors are commonly
reported as a summary statistics (e.g., the median).
Alternatively, some papers report the localization rate,
defined by computing the percentage of images localized
within a given translation and rotation error thresholds (for
example, with translation and rotation errors smaller or
equal to 0.25 meters and 2 degrees).
2. Deep Architectures for visual absolute pose estimation
Traditionally, visual APE has been achieved with image
retrieval or structure-based approaches. Structure-based
methods typically rely on SfM (hence the name) to localize.
Specifically, SfM associates 3D points with 2D images that
capture them and with their local descriptors (found through