4 M. S u n d erm eyer, Z. Marton, M. Durner, M. Brucker, R. Triebel
COCO background im ages [21] while varying brightness and contrast. This lets
the network generalize to real images and enables 6D detection at 10Hz . Like us,
for very accurat e distance estimation they rely on Iterative Closest Point (ICP)
post-processing using depth data. In contrast, we do not treat 3D orientation
estimation as a classific at ion task.
2.2 Learning representations of 3D orientations
We describe the difficulties of training with fixed SO(3) parameterizations which
will m ot ivate the learning of object-specific repr es entations.
Regression. Since rotations live in a continuous space, it seems natural t o
directly regress a fixed SO(3) parameterizations like quaternions. However, rep-
resentational constraints and pose ambiguities can introduce convergence issues
[32]. In practice, direct regression approaches for full 3D object orientation esti-
mation h ave not been very successful [23].
Classification of 3D object orientations require s a discretization of SO(3). Even
rather coarse intervals of ∼ 5
o
lead to over 50.000 possible class e s . Since each
class appears only sparsely in the training data, this hinders convergence. In
SSD6D [17] the 3D orientation is learned by separ at el y classifying a discretized
viewpoint and in-plane rotation, thus reducing the complexity to O(n
2
). How-
ever, for non-canonical views, e.g. if an object is seen from above, a change
of viewpoint can be nearly equivalent to a change of in-plane rotation which
yields ambiguous class combinations. In general, the relation between diff er ent
orientations is ignored when performing one-hot classificat i on.
Symmetries are a severe issue when relying on fixed representations of 3D ori-
entations since they cause pose ambiguities (Fig. 2). If not m anually addressed,
identical training images can have different orientation labels assigned which
can significantly disturb the learning process. In order to cope with ambiguous
objects, most approaches in lit er at u re are manually adapted [40,9,17,28]. The
strategies reach from ignoring one axis of rotation [40,9] over adapt i n g the dis-
cretization according to the object [17] to the training of an extra CNN to predi ct
symmetries [28]. These depict t e di ou s, manual ways to filter out object symme-
tries (2a) in advance, but treating ambiguities due to self-occlusions (2b) and
occlusions (2c) are harder to address. Symmetries do not only affect regression
and c las s i fic at ion met h ods, but any learning-based algorithm that discriminates
object views solely by fixed S O ( 3) repre se ntations.
Descriptor Learning can be used to learn a representation that relates ob-
ject views in a low-dimensional space. Wohlhart et al. [40] introduced a CNN-
based descriptor learni ng approach using a triplet loss that minimizes/maximizes
the Euclidean distance between similar/dissi mi l ar object orientations. Although