performance of the feature description of these architectures is not
report.
In addition, most existing works [16–19] have been focused on
partial implementation of the SIFT algorithm. More specifically,
[16,17] are focused on the feature point detection step of the SIFT,
and [18,19] on the pyramid construction, gradient orientation and
magnitude computation, and descriptor extraction, respectively. In
contrast, this paper presents a complete SIFT implementation,
accommodating all components and steps of the original SIFT algo-
rithm. It is a true low-cost (more specifically, in terms of power
consumption and hardware resources) real-time complete imple-
mentation of the SIFT algorithm.
3. A brief introduction to SIFT
To make this paper self-contained, we briefly review the SIFT
algorithm [2] in this section. As mentioned in Section 1, SIFT can
be divided into four components. We will describe them in detail,
respectively. For details please see [2].
3.1. DoG pyramid construction
The first step in SIFT is to determine the image positions that
exhibit significant local changes of the visual appearance. Such
places are the candidates for the SIFT features. In order to find
these feature candidates, we need to construct a DoG image pyra-
mid which approximates the image gradient field. Firstly, we need
to convolve the input image I(x,y) with a Gaussian kernel K(x, y;
r
),
where
r
is the scale of the Gaussian kernel. The result is Gaussian-
filtered image denoted as G(x,y;
r
), i.e.
Gðx; y;
r
Þ¼conv2ðIðx; yÞ; Kðx; y;
r
ÞÞ ð1Þ
where conv2(,) represents the 2-D convolution operation, and
where
Kðx; y;
r
Þ¼
1
2
p
d
2
e
ðx
2
þy
2
Þ=2d
2
ð2Þ
The DoG image is the difference of two Gaussian-filtered images
over two consecutive scales, denoted by:
Dðx; y;
r
Þ¼Gðx; y;
r
ÞGðx; y;
r
Þð3Þ
where k is a constant multiplicative factor.
To detect the candidate feature points, we need to construct the
DoG image pyramid. The process can be illustrated as Fig. 1 (from
[2]). As can be seen, there has 3 octaves (group of images have the
same resolution), and 6 scales for each octave. In each octave, the
Gaussian-filtered image G(x,y;k
i+1
r
) i =0...4 is generated by con-
volving G(x, y; k
i
r
) with Kðx; y;
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
k
2
1
p
k
i
r
Þ. Once a complete oc-
tave has been processed, we down-sample the Gaussian-filtered
image by taking every other pixel in the rows and columns of
the first image for the next octave. The first image of Octave0 is
the original image.
3.2. Stable feature detection
After the DoG image pyramid has been constructed, we detect
the local maxima and minima in the DoG images by comparing a
pixel to its 26 neighbors in 3 3 regions at the current and adja-
cent scales. And we treat these local maxima and minima point
as candidate features.
Once the feature candidates have been located, we should elim-
inate the low contrast points and strong edge response points to
make the feature robust to noise. To eliminate low contrast points,
we compare the DoG image pixel value with a contrast value. A
poorly defined peak in the DoG image has a large principal
curvature across the edge but a small one in the perpendicular
direction. And it is known that the principal curvature value can
be computed from Hessian matrix H, that is
H ¼
D
xx
D
xy
D
xy
D
yy
!
ð4Þ
where D
xx
, which is estimated by taking the difference of neighbor
points, is the second order derivative in the x direction.
The eigenvalues of H is proportional to the principal curvatures
of D. Let
c
be the radio between the larger eigenvalue
a
and the
smaller one b, so that
a
=
c
b, then
TrðHÞ
2
DetðHÞ
¼
ð
a
þ bÞ
2
a
b
¼
ð
c
b þ bÞ
2
c
bb
¼
ð1 þ
c
Þ
2
c
ð5Þ
where Tr(H) means the trace of H, and Det(H) means the determi-
nant of H. It is clear that
c
represents the condition number of the
principal curvature matrix, which implies its singularity of the
degeneration of the local appearances.
Therefore, to eliminate strong edge response points, we only
need to set the threshold
c
, and check whether (6) is satisfied or
not. That is
TrðHÞ
2
DetðHÞ
ð1 þ
c
Þ
2
c
ð6Þ
3.3. Gradient magnitude and orientation assignment
To extract the SIFT descriptor for the detected features, we have
to compute the image gradient magnitude and orientation, de-
noted by m(x, y) and h(x,y) respectively, for the points close to
the feature point. The formula of m(x, y) and h(x,y) are:
hðx; yÞ¼a tanðGðx; y þ 1ÞGðx; y 1Þ=Gðx þ 1; y ÞGðx 1; yÞÞ
ð7Þ
mðx; yÞ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðGðx þ 1; yÞGðx 1; yÞÞ
2
þðx; y þ 1ÞGðx; y 1ÞÞ
2
q
ð8Þ
3.4. SIFT descriptor representation
In this stage, there are two main tasks: dominant orientation
computation and descriptor representation. Dominant orientation
computation needs to form an orientation histogram, which has
36 bins in this paper, from a circular region centered at the feature
point. Each sample added to the histogram is weighted by its gra-
dient magnitude and by a Gaussian-weighted circular window, as
presented in Fig. 2(a). The highest bin in the histogram is named
as the dominant orientation of this feature, as presented in
Fig. 2(b). Then rotate the image relative to the dominant orienta-
tion, as presented in Fig. 2(c). Finally, in the descriptor representa-
tion stage, it utilizes the image gradient orientation and magnitude
of the points from a circular image region centered at the feature to
extract a 16 8 dimensional SIFT feature descriptor, as presented
in Fig. 2(d).
4. The proposed hardware architecture
SIFT has shown a great success in many computer vision appli-
cations, such as 3D reconstruction, target tracking and object rec-
ognition and so on. However, its large computational complexity
has been a challenge to most embedded implementations for
real-time and resource-limited application scenarios.
18 S. Zhong et al. / Journal of Systems Architecture 59 (2013) 16–29