5020 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
3) Output Stage (Hashing and Histograms): Each of the
L
1
input images I
l
i
for the second stage has L
2
real-valued
outputs {I
l
i
∗ W
2
}
L
2
=1
from the second stage. We binarize these
outputs and obtain {H(I
l
i
∗W
2
)}
L
2
=1
,whereH(·) is a Heaviside
step (like) function, whose value is one for positive entries and
zero otherwise.
Around each pixel, we view the vector of L
2
binary bits as
a decimal number. This converts the L
2
outputs in O
l
i
back
into a single integer-valued “image”:
T
l
i
.
=
L
2
=1
2
−1
H(I
l
i
∗ W
2
), (8)
whose every pixel is an integer in the range
0, 2
L
2
− 1
.The
order and weights of the L
2
outputs are irrelevant because
here we treat each integer as a distinct “word.”
Each of the L
1
images T
l
i
, l = 1,...,L
1
is partitioned into
B blocks. We compute the histogram (with 2
L
2
bins) of the
decimal values in each block and concatenate all B histograms
into one vector and denote this vector as Bhist(T
l
i
). After this
encoding process, the “feature” of the input image I
i
is then
defined to be the set of block-wise histograms, i.e.,
f
i
.
=[Bhist(T
1
i
),...,Bhist(T
L
1
i
)]
T
∈ R
(2
L
2
)L
1
B
. (9)
The local blocks can be either overlapping or non-overlapping,
depending on the application. Our empirical experience sug-
gests that non-overlapping blocks are suitable for face images,
whereas overlapping blocks are appropriate for hand-written
digits, textures, and object images. Furthermore, the histogram
offers some degree of translation invariance in the extracted
features, as in hand-crafted features (e.g., scale-invariant fea-
ture transform (SIFT) [12] and histogram of oriented gradi-
ents (HOG) [13]), learned features (e.g., bag-of-word (BoW)
model [14]), and the average and maximum pooling process
in ConvNet [3]–[5], [8], [9].
The hyper-parameters of the PCANet include the filter
size k
1
, k
2
, the number of filters in each stage L
1
, L
2
,the
number of stages, and the block size for local histograms in
the output layer. PCA filter banks require that k
1
k
2
≥ L
1
, L
2
.
In our experiments in Section III and Section IV, excluding
object recognition, we always set L
1
= L
2
= 8, which is
inspired from the common setting of Gabor filters [15] with
8 orientations, although some fine-tuned L
1
, L
2
could lead to
marginal performance improvements. The hyper-parameters,
such as the filter size k
1
, k
2
and the block size for local
histograms, are determined through a grid search with either
cross-validation or a validation set. Moreover, we have empir-
ically observed that two-stage PCANet is in general sufficient
to achieve good performance and that a deeper architecture
does not necessarily lead to further improvements. In addition,
a larger block size for local histograms provides greater
translation invariance in the extracted feature f
i
.
4) Comparison With ConvNet and ScatNet: Clearly,
PCANet shares various similarities with ConvNet [5]. The
patch-mean removal in PCANet is reminiscent of local contrast
normalization in ConvNet.
4
This operation moves all of the
4
We have tested the PCANet without patch-mean removal, and a slightly
degraded performance is observed.
patches to be centered around the origin of the vector space so
that the learned PCA filters can better capture major variations
in the data. In addition, PCA can be viewed as the simplest
class of auto-encoders, which minimizes reconstruction error.
The PCANet contains no non-linearity processes between/in
stages, in contrast to the common wisdom regarding building
deep learning networks, e.g., the absolute rectification layer
in ConvNet [5] and the modulus layer in ScatNet [6], [10].
We have tested the PCANet with an absolute rectification layer
added immediately after the first stage, but we did not observe
any improvement in the final classification results. This could
be because the use of quantization plus a local histogram
(in the output layer) already introduces sufficient invariance
and robustness in the final feature.
The overall process prior to the output layer in the PCANet
is completely linear. One may wonder what would occur
if we merge the two stages into only one stage that has
an equivalently equal number of PCA filters and receptive
field size. Specifically, one may be interested in how the
single-stage PCANet with L
1
L
2
filters of size (2k
1
− 1) ×
(2k
2
− 1) could perform compared to the two-stage PCANet
described in Section II-A. We have experimented with such
settings on faces and hand-written digits and observed that the
two-stage PCANet outperforms this single-stage alternative in
most cases; see the last several rows of Tables III, X, and XI.
In comparison to the filters learned by the single-stage alter-
native, the resulting two-stage PCA filters essentially have a
low-rank factorization, possibly resulting in a lower chance
of over-fitting the dataset. Regarding why we need the deep
structure, from a computational perspective, the single-stage
alternative requires learning filters with L
1
L
2
(2k
1
−1)(2k
2
−1)
variables, whereas the two-stage PCANet only learns
filters with in total (L
1
+ L
2
)k
1
k
2
variables. Another benefit
of the two-stage PCANet is that the larger receptive field,
because it contains more holistic observations of the objects
in images, and its learning invariance can essentially cap-
ture more semantic information. Our comparative experiments
verify that hierarchical architectures with large receptive fields
and multiple stacked stages are more efficient in terms of
learning semantically related representations, which agrees
with what has been observed in [7].
B. Computational Complexity
The components for constructing the PCANet are extremely
basic and computationally efficient. To observe how low
the computational complexity of PCANet would be, let us
take the two-stage PCANet as an example. In each stage
of the PCANet, forming the patch-mean-removed matrix X
costs k
1
k
2
+ k
1
k
2
˜m ˜n flops; the inner product XX
T
has
a complexity of 2(k
1
k
2
)
2
˜m ˜n flops; and the complexity of
eigen-decomposition is O((k
1
k
2
)
3
). The PCA filter convolu-
tion requires L
i
k
1
k
2
mn flops for stage i. In the output layer,
the conversion of L
2
binary bits to a decimal number costs
2L
2
˜m ˜n, and the naive histogram operation is of complexity
O(mnBL
2
log 2).By ˜m = m −k
1
/2, ˜n = n −k
2
/2 and
assuming mn max(k
1
, k
2
, L
1
, L
2
, B), the overall complex-
ity of the PCANet is easily verified as
O(mnk
1
k
2
(L
1
+ L
2
) + mn(k
1
k
2
)
2
).