3
contrast, our method is free from object segmentation, hence
circumvent the difficulty associated with motion segmenta-
tion in a dynamic setting.
The template-based approach is yet another method for
deformable surface reconstruction. Yu et al. [40] proposed
a direct approach to capture dense, detailed 3D geometry
of generic, complex non-rigid meshes using a single RGB
camera. While it works for generic surfaces, the require-
ment of template prevents its wider application to more
general scenes. Wang et al. [41] introduced a template-
free approach to reconstruct a poorly-textured, deformable
surface. Nevertheless, its success is restricted to a single
deforming surface rather than the entire dynamic scene.
Varol et al. [42] reconstructed deformable surfaces based on
a piecewise reconstruction assuming overlapping patches to
be consistent over the entire surface, but again limited to the
reconstruction of a single deformable surface.
While the conceptual idea of our work appeared in ICCV
2017, this journal version provides (i) in-depth realization of
our overall optimization (ii) Qualitative comparison with
[1], Video-PopUp [39] as well as statistical comparison with
deep-learning method [43]. (iii) Comprehensive ablation
test showing the importance of each term in the overall
optimization. (iv) Extensive performance analysis showing
the performance with the variation in the number of super-
pixels, choice of k-nearest neighbor, choice of dense optical
flow algorithm and change in the shape of the superpixel.
(v) Detail discussion on the failure cases, choice of euclidean
metric for nearest neighbor graph construction, and limita-
tion of our work with possible direction for improvements.
3 MOTIVATION AND CONTRIBUTION
The formulation proposed in this work is motivated by the
following endeavor in dense structure from motion of a
dynamic scene.
3.1 Object level motion segmentation
To solve dense reconstruction of an entire dynamic scene
from perspective images, the first step that is practiced
usually is: Perform object-level motion segmentation to infer
distinct motion models of multiple rigidly moving object in
the scene. As alluded before, dense segmentation of moving
object in a dynamic scene in itself is a challenging task.
Also, non-rigidly moving object themselves may compose
of a union of distinct motion models. Therefore, object-
level segmentation build upon the assumption of per object
rigid motion will fail to describe a general dynamic scene.
This motivates us to develop an algorithm that can recover
a dense-detailed 3D model of a complex dynamic scene
from its two perspective images, without object-level motion
segmentation as an essential intermediate step.
3.2 Separate treatment for rigid SfM and non-rigid SfM
Our investigation shows that the algorithms for deformable
object 3D reconstruction often differs from a rigidly mov-
ing object. Not only solutions, but even the assumptions
varies significantly e.g orthographic projection, low-rank
shape [11] [12] [13] [15]. The reason for such inadequacy
is perfectly valid due to the under-constraint nature of the
problem itself. This motivated us to develop an algorithm
that can provide i.e “ 3D reconstruction of entire dynamic scene
and the non-rigidly deforming object under similar assumptions
and formulation.”
Although to accomplish this goal for any arbitrary non-
rigid deformation remains an open problem, experiments
suggest that our framework under the aforementioned as-
sumptions about the scene and the deformation, can re-
construct a general dynamic scene irrespective of the scene
rigidity type. Thanks to the recent advancement in the
dense optical flow algorithms [44] [45] which can reliably
capture smooth non-rigid deformation over frames. These
robust dense optical flow algorithms allow us to exploit
local motion of deforming surfaces. Thus, our formulation
is competent enough to bridge this gap between rigid and
non-rigid SfM.
The main contributions of our work are as follows:
1) A framework which disentangles object-level motion
segmentation for dense 3D reconstruction of a complex
dynamic scene.
2) A common framework for dense two-frame 3D recon-
struction of a complex dynamic scene (including de-
formable objects), which achieves state-of-the-art per-
formance.
3) A new idea to resolve the inherent relative scale am-
biguity problem in monocular 3D reconstruction by
exploiting the as-rigid-as-possible (ARAP) constraint
[46].
4 OUTLINE OF THE ALGORITHM
Before providing the details of our algorithm, we would
like to introduce some common notations that are used
throughout the paper.
4.1 Notation
We represent two consecutive images as I, I
0
: Ω → R
3
|Ω ⊂ Z
2
, also referred as reference image and next image
respectively. Vectors are represented by bold lower case let-
ter, such as ‘x’ and matrices are represented by bold upper
case letter such as ‘X’. The subscript ‘a’, ‘b’ denotes anchor
point and boundary point respectively, for e.g x
ai
, x
bi
represents anchor point and boundary point corresponding
to i
th
superpixel in the image space. The 1-norm, 2-norm of
a vector is denoted as |.|
1
and k.k
2
respectively. For matrices,
Frobenius norm is denoted as k.k
F
.
4.2 Overview
We first over-segment the reference image into superpixels,
then model the deformation of the scene by a union of piece-
wise rigid motions of these superpixels. Specifically, we
divide the overall non-rigid reconstruction into a local rigid
reconstruction of each superpixel, followed by an assembly
process which glues all these individual local reconstruc-
tions in a globally coherent manner. While the concept of the
above divide-and-conquer procedure looks simple, there is
however a fundamental difficulty (of scale indeterminacy) in
its implementation. Scale-Indeterminacy refers to the well-
known fact that using a moving camera one can only recover
the 3D structure up to an unknown scale. In our method,