140 Viola and Jones
Figure 3. The sum of the pixels within rectangle D can be computed
with four array references. The value of the integral image at location
1isthe sum of the pixels in rectangle A. The value at location 2 is
A + B,atlocation 3 is A + C , and at location 4 is A + B + C + D.
The sum within D can be computed as 4 + 1 − (2 + 3).
(1999). The authors point out that in the case of linear
operations (e.g. f · g), any invertible linear operation
can be applied to f or g if its inverse is applied to the
result. For example in the case of convolution, if the
derivative operator is applied both to the image and the
kernel the result must then be double integrated:
f ∗ g =
( f
∗ g
).
The authors go on to show that convolution can be
significantly accelerated if the derivatives of f and g
are sparse (or can be made so). A similar insight is that
an invertible linear operation can be applied to f if its
inverse is applied to g:
( f
) ∗
g
= f ∗ g.
Viewed in this framework computation of the rect-
angle sum can be expressed as a dot product, i ·r , where
i is the image and r is the box car image (with value
1 within the rectangle of interest and 0 outside). This
operation can be rewritten
i · r =
i
· r
.
The integral image is in fact the double integral of the
image (first along rows and then along columns). The
second derivative of the rectangle (first in row and then
in column) yields four delta functions at the corners of
the rectangle. Evaluation of the second dot product is
accomplished with four array accesses.
2.2. Feature Discussion
Rectangle features are somewhat primitive when
compared with alternatives such as steerable filters
(Freeman and Adelson, 1991; Greenspan et al., 1994).
Steerable filters, and their relatives, are excellent for the
detailed analysis of boundaries, image compression,
and texture analysis. While rectangle features are also
sensitive to the presence of edges, bars, and other sim-
ple image structure, they are quite coarse. Unlike steer-
able filters, the only orientations available are vertical,
horizontal and diagonal. Since orthogonality is not cen-
tral to this feature set, we choose to generate a very
large and varied set of rectangle features. Typically the
representation is about 400 times overcomplete. This
overcomplete set provides features of arbitrary aspect
ratio and of finely sampled location. Empirically it ap-
pears as though the set of rectangle features provide
a rich image representation which supports effective
learning. The extreme computational efficiency of rect-
angle features provides ample compensation for their
limitations.
In order to appreciate the computational advantage
of the integral image technique, consider a more con-
ventional approach in which a pyramid of images is
computed. Like most face detection systems, our de-
tector scans the input at many scales; starting at the
base scale in which faces are detected at a size of
24 × 24 pixels, a 384 by 288 pixel image is scanned
at 12 scales each a factor of 1.25 larger than the last.
The conventional approach is to compute a pyramid of
12 images, each 1.25 times smaller than the previous
image. A fixed scale detector is then scanned across
each of these images. Computation of the pyramid,
while straightforward, requires significant time. Imple-
mented efficiently on conventional hardware (using bi-
linear interpolation to scale each level of the pyramid) it
takes around .05 seconds to compute a 12 level pyramid
of this size (on an Intel PIII 700 MHz processor).
5
In contrast we have defined a meaningful set of rect-
angle features, which have the property that a single
feature can be evaluated at any scale and location in a
few operations. We will show in Section 4 that effec-
tive face detectors can be constructed with as few as two
rectangle features. Given the computational efficiency
of these features, the face detection process can be com-
pleted for an entire image at every scale at 15 frames per