Fig. 2. 3-DTV signal processing and data transmission chain consisting
of five functional building blocks: 1) 3-D content creation; 2) 3-D video
coding; 3) transmission; 4) “virtual” view synthesis; 5) 3-D display.
Fig. 3. Functionality of the Zcam active range camera. (a) An infrared light
wall is emitted by the camera. (b) The reflected light wall carries an imprint
of the captured 3-D scene. (c) The 3-D information is extracted by blocking
the remaining incoming light with a very fast shutter (from [24]).
supplementary depth-images can be compressed using any
of the newer, more efficient additions to the MPEG family
of standards such as MPEG-4 Visual [22] or the latest
Advanced Video Coding (H.264/AVC) [23].
To allow for an easier understanding of the fundamental
ideas, the envisioned signal processing and data transmission
chain of the outlined 3-DTV concept is illustrated in Fig. 2.
It consists of five functional building blocks: 1) 3-D content
creation; 2) 3-D video coding; 3) transmission; 4) “virtual”
view synthesis; and 5) 3-D display.
A. 3-D Content Creation
A number of approaches are applicable for the creation
of 3-D content. In one very appealing scenario, novel 3-D
material is generated by simultaneously capturing video and
associated per-pixel depth information with an active range
camera such as the Zcam developed by 3DV Systems, Ltd.
[24] or the NHK Axi-vision HDTV camera [25]. These de-
vices integrate a high-speed pulsed infrared light source into
a conventional broadcast TV camera, and they relate the time
of flight of the emitted and reflected light walls to direct
measurements of the depth characteristics of the 3-D scene
(Fig. 3).
The main drawback of current 3-D cameras is the fact
that they are only fit for indoor use in studio environments
and that they are not able to record more than relatively
small-scale scenes (up to a few meters of depth). Thus,
alternative approaches are required for the generation of
3-D data for larger scale, outdoor scenes. Here, the most
promising concepts are based on the simultaneous capturing
of multiview data using either traditional stereo cameras or
synchronized multicamera systems (Fig. 4). Given several
images of the spatial scenery, the 3-D geometry can be
reconstructed by applying techniques from computer vision
Fig. 4. The Penn State multicamera system. A cluster of up to six firewire
cameras is used to generate depth information of a human participant in an
immersive telepresence application (from [28]).
(CV) and photogrammetry [15], [26]–[28]. In general, most
existing methods involve five basic steps: 1) geometric and
photometric calibration of the individual cameras; 2) estima-
tion of geometrical relations between the different views; 3)
an optical flow or correlation-based search for corresponding
points in two or more image planes; 4) localization of the
corresponding 3-D space points; and 5) integration of the
entire depth information into one or more camera reference
frames.
Even with these novel “3-D capture” technologies at hand,
it seems clear that the need for sufficient high-quality 3-D
content can only partially be satisfied with new recordings.
It will therefore be necessary—especially in the introductory
phase of the new 3-DTV technology—to also convert already
existing 2-D video material into 3-D using so-called “struc-
ture from motion” algorithms. On principle, such (offline or
online) methods process one or more monoscopic color video
sequences to: 1) establish a dense set of image point cor-
respondences from which information about the recording
camera as well as the 3-D structure of the scene can be de-
rived [26], [27], [29], [30] or 2) infer approximate depth
information from the relative movements of automatically
tracked image segments [31].
B. “Virtual” View Synthesis
DIBR is defined as the process of synthesizing “virtual”
views of a real-world scene from still or moving images and
associated per-pixel depth information [32], [33]. Conceptu-
ally, this novel view generation method can be understood as
a two-step procedure: At first, the original image points are
reprojected into the 3-D world, utilizing the respective depth
values. Thereafter, these intermediate space points are pro-
jected into the image plane of a “virtual” camera located at
the required viewing position. The concatenation of reprojec-
tion (2-D to 3-D) and subsequent projection (3-D to 2-D) is
usually referred to as “3-D image warping” in the computer
graphics (CG) literature.
1) The “Virtual” Stereo Camera: Building on the de-
scribed 3-D image warping concept, the synthesis of stereo-
scopic images can be realized through the definition of two
“virtual” cameras—one for the left-eye and one for the right-
eye. With respect to the original (reference) view, these two
cameras are symmetrically displaced by half the interaxial
distance
(Fig. 5). To establish the zero parallax setting
(ZPS), i.e., to choose the convergence distance
in the 3-D
526 PROCEEDINGS OF THE IEEE, VOL. 94, NO. 3, MARCH 2006