
Additional discussion in relation to details of prior related work is reported in Sect. B.2.
2 Contributions
If SIFT is written as (1), then DSP-SIFT is given by
h
DSP
(θ|I)[x] =
Z
h
SIFT
(θ|I, σ)[x]E
s
(σ)dσ x ∈ Λ (2)
where s > 0 is the size-pooling scale and E is an exponential or other unilateral density function. This is our main
contribution. The process is visualized in Fig. 1. Unlike SIFT, that is computed on a scale-selected lattice Λ(ˆσ),
DSP-SIFT is computed on a regularly sampled lattice Λ. Computed on a different lattice, the above can be considered
as a recipe for DSP-HOG [11]. Computed on a tree, it can be used to extend deformable-parts models (DPM) [16]
to DSP-DPM. Replacing h
SIFT
with other histogram-based descriptor “X” (for instance, SURF [2]), the above yields
DSP-X. Applied to a hidden layer of a convolutional network, it yields a DSP-CNN, or DSP-Deep-Fisher-Network
[39]. The details of the implementation are in Sect. 3.
While the implementation of DS pooling is straightforward, its justification is less so. We report the summary
highlights in Sect. 5, that represent contributions to the understanding of pooling and the design and learning of local
descriptors. The detailed derivation is described in Sect. B. It provides a theoretical justification for DS pooling and
explicit conditions under which the resulting descriptors are valid. Nevertheless, one cannot forgo empirical validation
on real images, where such conditions are routinely violated. In Sect. 4 we compare DSP-SIFT to alternate approaches.
Motivated by the experiments of [33, 34] that compare local descriptors on wide-baseline matching benchmarks and
show SIFT a clear winner, we choose SIFT as a paragon and compare it to DSP-SIFT on the standard benchmark [33].
Motivated by [17] that compares SIFT to both supervised and unsupervised CNNs trained on Imagenet and Flickr
respectively, with the latter emerging as the clear winner on the same benchmark [33], we submit DSP-SIFT to the
same evaluation protocol. We also run the test on the new synthetic dataset introduce by [17], that yields the same
qualitative assessment. It should be noted that the comparison is unfair in favor of the CNNs, due to its increased
dimension compared to SIFT and DSP-SIFT. Moreover, the best performance of a CNN is obtained using its fourth
layer responses, that contain 8192 coefficients, a 64-fold complexity increase, even without accounting for the cost of
learning, which is none for DSP-SIFT.
Clearly, DS pooling of under-sampled semi-orbits cannot outperform fine sampling, so if we were to retain all
the scale samples instead of aggregating them, performance would further improve. However, computing a large
collection of SIFT descriptors across different scales would incur significantly increased computational and storage
cost. To contain the latter, [22] assume that descriptors at different scales populate a linear subspace and fit a high-
dimensional hyperplane. The resulting Scale-less SIFT (SLS) outperforms ordinary SIFT as shown in Fig. 5. However,
the linear subspace assumption breaks when considering large scale changes, so SLS is outperformed by DSP-SIFT
despite the considerable difference in (memory and time) complexity.
3 Implementation and Parameters
Following common practice in evaluation protocols, we use maximally-stable extremal regions (MSER) [30] to detect
candidate regions, affine-normalize them, align them to the dominant orientation, and re-scale them for comparison
with [17]. For a detected scale ˆσ, DSP-SIFT samples N
ˆσ
scales within a neighborhood (λ
1
ˆσ, λ
2
ˆσ) around it. For
each scale-sampled patch, a single-scale un-normalized SIFT descriptor (1) is computed on the SIFT scale-space
octave corresponding to the detected scale. By choosing E
s
to be a uniform density, these raw histograms of gradient
orientations at different scales are accumulated and normalized
6
to produce DSP-SIFT (2), which is compared to
several descriptors. In the following evaluation, we use λ
1
= 1/6, λ
2
= 4/3 and N
ˆσ
= 15. These parameters are
empirically selected on the Oxford dataset [32, 33]. Fig. 4(a) shows that mean average precision (defined in Sect. 4.3)
changes over the scale pooling range. An immediate advantage of DS pooling is observed when more than one scale
6
We follow the practice of SIFT [29] to normalize, clamp and re-normalize the histograms to make them more robust to contrast changes. The
clamping threshold is set to 0.067 empirically.
4