single scene from one of our datasets (see §3). Further de-
tails about the architecture and precisely how we train our
networks can be found in the supplementary material.
2.3. Online ScoreNet Prediction Adaptation
Problem Formulation. A ScoreNet trained offline on an
RGB-D sequence of a scene, as in §2.2, can later be used
to relocalise new images in the same scene. This targets an
offline formulation of the relocalisation problem, in which
both training and testing are performed on the same scene,
and there are no constraints on the time available for train-
ing. However, this formulation does not take into account
the practical requirements on a camera relocaliser for live
scenarios such as interactive dense SLAM [53], in which
it is infeasible to spend hours or even days training a relo-
caliser on the scene of interest; rather, a relocaliser must be
trained online as the user moves around the scene, and then
be usable immediately when camera tracking fails.
To address such scenarios, we target the alternative on-
line formulation of the relocalisation problem proposed by
Cavallari et al. [13], in which there are three stages: offline
training (‘pre-training’), online training and testing. Offline
training is performed on sequences of RGB-D frames (with
known poses) from one or more scenes, generally other than
the target scene. Online training is then performed on a
single RGB-D sequence (again with known poses, e.g. as
produced by a camera tracker) from the target scene. Fi-
nally, testing is performed on a single RGB or RGB-D im-
age whose pose is to be determined. (For interactive SLAM,
the idea is that a user will move around the scene at online
training time, either training a new relocaliser online, or
adapting a pre-trained relocaliser online to function in the
target scene. If and when camera tracking fails, the trained
relocaliser can then be used to recover the camera pose.)
Cavallari et al. [13, 12] described their online training
stage as ‘adaptation’ because they were adapting a pre-
trained regression forest to relocalise in the target scene.
In particular, they showed that the branching structure of
a scene coordinate regression forest can be seen as a scene-
independent way of clustering the pixels in an image based
on their appearance. Based on this insight, they adapted a
pre-trained forest to a new scene by emptying the reservoirs
in its leaves and refilling them with points from the new
scene at online training time, and then using the forest to
look up the reservoirs again to provide correspondences at
test time. Inspired by this approach, we show in this paper
how to adapt the predictions of a ScoreNet so as to allow
these relocalisers too to be deployed in an online context.
Reservoir Prediction. The adaptation scheme described
in [13, 12] was highly effective, but relied on the fact that
their forest does not predict points in any particular scene
directly, but instead predicts leaves containing reservoirs of
points, which can then be used to generate the needed cor-
respondences. These reservoirs can be refilled with points
from the new scene, which is what allowed their method to
work, but it is not straightforward to see how it can be trans-
ferred to ScoreNets that directly predict individual points in
the pre-training scene. To achieve this, we thus propose a
new scheme that, rather than clustering pixels into leaves
based on routing their associated feature vectors down a
regression forest, clusters them into cells in a grid placed
over their associated predictions in the pre-training scene
(see Figure 1). Note that this implicitly clusters pixels in
the input image based on their predicted pre-training scene
locations, rather than directly based on their appearance. In-
tuitively, a ScoreNet, which has been deliberately trained to
map similar-looking pixels in an image to similar 3D points
in the pre-training scene, can in practice do this for images
of any scene, not just the one on which it was trained, and
hence pre-training scene location can be used as a reason-
able proxy for appearance (see §3.2 for a discussion).
As mentioned in §2.2, our ScoreNets take an RGB image
of size w×h as input, and produce as output a w/8×h/8×3
tensor that contains a predicted 3D point (in the scene on
which the ScoreNet was trained) for a regularly-spaced sub-
set of pixels in the image. We initially map each of these
predicted points, p = (p
x
, p
y
, p
z
) ∈ R
3
, to a grid cell index
as follows. First, we imagine placing a bounded regular cu-
bic grid, with cells of side length ` and an overall side length
of C`, over the pre-training scene, as shown in Figure 1.
(The C and ` values we use can be found in the supplemen-
tary material.) Next, for each dimension k ∈ {x, y, z}, we
compute an index g(p
k
) ∈ [0 .. C) via
g(p
k
) = clamp
p
k
`
+
C
2
, 0, C − 1
. (1)
Finally, we combine these three dimension-wise indices
into a grid cell index, G(p), via
G(p) = C
2
g(p
z
) + Cg(p
y
) + g(p
x
). (2)
This initial raster-based mapping produces grid cell indices
in the range [0 .. C
3
), but in practice, it is undesirable for
memory reasons to try to allocate a reservoir for every cell
in the grid. Each reservoir may need to store many point
clusters, and must be allocated upfront on the GPU with a
fixed size. As a result, if every cell in the grid must have a
reservoir, then C must be kept small to avoid exceeding the
available GPU memory, limiting the size of scene we can
handle with our approach.
Fortunately, however, there is no need for every grid cell
to have a reservoir: as noted by [54], most cells in a scene
are empty in practice, and we can exploit this observation to
store a sparse set of reservoirs for only those cells that con-
tain predicted points. To achieve this, rather than using the
grid cell indices produced as above directly, we instead al-
locate a fixed-size buffer of N reservoirs upfront, and con-
struct a lookup table T during online training that can be