HAN et al.: ENHANCED COMPUTER VISION WITH MICROSOFT KINECT SENSOR 1321
Fig. 4. Example for hole-filling based on the bilateral filter [25]. (a) Raw
depth image. (b) Depth image after filtering.
turns out that Kinect is able to capture the relative 3-D
coordinates of markers with minor errors (< 1cm) in case the
sensor is positioned in an ideal range (1m to 3m) and with an
effective field of view. In [15], authors examine the accuracy
of joint localization and the robustness of pose estimation
with respect to more realistic setups. In the experiment, six
exercises are conducted, in which the subject is either seated
or positioned next to a chair. The exercise is generally chal-
lenging for human pose recognition since the self-occlusion
appears frequently and the capturing view angle is changed
over time. The acquired 3-D location of each joint is then
compared to the data generated by a marker-based motion
capture system, which can be considered as ground truth data.
According to the results, Kinect has a significant potential as
a low-cost alternative for real-time motion capturing and body
tracking in healthcare applications. The accuracy of the Kinect
joint estimation is comparable to marker-based motion capture
in a more controlled body pose (e.g., standing and exercising
arms). However, in general poses, the typical error of Kinect
skeletal tracking is about 10 cm. Moreover, the current Kinect
algorithm frequently fails due to occlusions, nondistinguishing
depth (limbs close to the body) or clutter (other objects in the
scene).
III. Preprocessing
The data obtained with Kinect normally cannot be directly
fed into the designed computer vision algorithms. Most of
the algorithms take advantage of rich information (RGB and
depth) attached to a point. In order to correctly combine the
RGB image with the depth data, it is necessary to spatially
align the RGB camera output and the depth camera output. In
addition, the raw depth data are very noisy and many pixels
in the image may have no depth due to multiple reflections,
transparent objects or scattering in certain surfaces (such as
human tissue and hair). Those inaccurate/missing depth data
(holes) need to be recovered prior to being used. Therefore,
many systems based on Kinect start with a preprocessing mod-
ule, which conducts application-specific camera recalibration
and/or depth data filtering.
A. Kinect Recalibration
In fact, Kinect has been calibrated during manufacturing.
The camera parameters are stored in the device’s memory,
which can be used to fuse the RGB and depth information.
This calibration information is adequate for casual usage, such
as object tracking. However, it is not accurate enough for
reconstructing a 3-D map, for which a highly precise cloud of
3-D points should be obtained. Moreover, the manufacturer’s
calibration does not correct the depth distortion, and is thus
incapable of recovering the missing depth.
Zhang et al. [16] and Herrera et al. [17] develop a cali-
bration board based technique, which is derived from Zhang’s
camera calibration technique used for the RGB camera [18].
In this method, 3-D coordinates of the feature points on the
calibration card are obtained from the RGB camera’s coordi-
nate system. Feature-point matching between the RGB image
and the depth image is able to spatially correlate those feature
points between two different images. This spatial mapping
helps feature points to get their true depth values based on
the RGB camera’s coordinate system. Meanwhile, the depth
camera measures 3-D coordinates of those feature points in the
IR camera’s coordinate system. It assumes that the obtained
depth values by the depth camera can be transformed to the
true depth values by an affine model. As a result, the key is
to estimate the parameters of the affine model, which can be
done by minimizing the distances between the two point sets.
This technique combined with a calibration card allows users
to recalibrate the Kinect sensor in case the initial calibration is
not accurate enough for certain applications. The weakness of
this method is that it does not specifically pay attention to the
depth distortion. Correcting the depth distortion may become
unavoidable for most 3-D mapping scenarios.
There are a few publications that discuss solutions for
Kinect depth distortion correction. Smisek et al. [11] discover
that the Kinect device has shown radially symmetric distor-
tions. In order to correct this distortion, a spatially varying
offset to the calculated depth is applied. The offset at a given
pixel position is calculated as the mean difference between
measured depth and expected depth in metric coordinates.
In [19], a disparity distortion correction method is proposed
based on the observation that a more accurate calibration can
be made by correcting the distortion directly in disparity units.
An interesting paper [20] deals with more practical issues,
which investigates a possible influence of thermal and envi-
ronmental conditions when calibrating Kinect. The experiment
turns out that variations of the temperature and air draft have a
notable influence on Kinect’s images and range measurements.
Based on the findings, temperature-related rules have been
established in the paper, which reduce errors in the calibration
and measurement process of the Kinect.
B. Depth Data Filtering
Another preprocessing step is depth data filtering, which can
be used for depth image denoising or missing depth (hole)
recovering. A naive approach considers the depth data as a
monochromatic image and thus applies existing image filters
on it, such as a Gaussian filter. This simple method works only
for regions where the signal statistics is in favor of the underly-
ing filter. A more sophisticated algorithm [21] investigates the
specific characteristics of a depth map created by Kinect, and
finds out that there are two types of occlusions/holes caused
by different reasons. The algorithm automatically separates