interactions. Gao et al. [15] proposed to approximate the
second-order statistics via Tensor Sketch [35]. Yin et al.
[12] aggregated higher-order statistics by iteratively apply-
ing the Tensor Sketch compression to the features. Cai et al.
[2] used high-order pooling to aggregate hierarchical convo-
lutional responses. Moreover, the bilinear pooling and high-
order pooling methods are also applied in Visual-Question-
Answering task, such as [14, 22, 56, 57]. However, differ-
ent from these above methods which mainly focus on using
high-order statistics on top of feature pooling, resulting in
high-dimensional feature representations that are not suit-
able for efficient/fast pedestrian search, we instead intend
to enhance the feature discrimination by attention learning.
We model high-order attention mechanism to capture the
high-order and subtle differences among pedestrians, and to
produce the discriminative attention proposals.
Zero-Shot Learning: In ZSL, the model is required to
learn from the seen classes and then to be capable of utiliz-
ing the learned knowledge to distinguish the unseen classes.
It has been studied in image classification [28, 4], video
recognition [13] and image retrieval/clustering [5]. Interest-
ingly, person ReID matches the setting of ZSL well where
training identities have no intersection with testing identi-
ties, but most the existing ReID works ignore the problem
of ZSL. To this end, we propose Mixed High-Order Atten-
tion Network (MHN) to explicitly depress the problem of
‘biased learning behavior of deep model‘ [5, 6] caused by
ZSL, allowing the learning of all-sided attention informa-
tion which might be useful for unseen identities, preventing
the learning of biased attention information that only bene-
fits to the seen identities.
3. Proposed Approach
In this section, we will first provide the formulation of
the general attention mechanism in Sec. 3.1, then detail
the proposed High-Order Attention (HOA) module in Sec.
3.2, finally show the overall framework of our Mixed High-
Order Attention Network (MHN) in Sec. 3.3.
3.1. Problem Formulation
Attention acts as a tool to bias the allocation of available
resources towards the most informative parts of an input. In
convolutional neural network (CNN), it is commonly used
to reweight the convolutional response maps so as to high-
light the important parts and suppress the uninformative
ones, such as spatial attention [25, 27] and channel atten-
tion [19, 27]. We extend these two attention methods to
a general case. Specifically, for a convolutional activation
output, a 3D tensor X , encoded by CNN and coming from
the given input image. We have X ∈ R
C×H×W
, where
C, H, W indicate the number of channel, height and width,
resp. As aforementioned, the goal of attention is to reweight
the convolutional output, we thus formulate this process as:
Y = A(X ) X (1)
where A(X ) ∈ R
C×H×W
is the attention proposal output
by a certain attention module, is the Hadamard Product
(element-wise product). As A(X ) serves as a reweighting
term, the value of each element of A(X ) should be in the
interval [0, 1]. Based on the above general formulation of at-
tention, A(X ) can take many different forms. For example,
if A(X ) = rep[M]|
C
where M ∈ R
H×W
is a spatial mask
and rep[M ]|
C
means replicate this spatial mask M along
channel dimension by C times, Eq. 1 thus is the implemen-
tation of spatial attention. And if A(X ) = rep[V ]|
H,W
where V ∈ R
C
is a scale vector and rep[V ]|
H,W
means
replicate this scale vector along height and width dimen-
sions by H and W times resp, Eq. 1 thus is the implemen-
tation of channel attention.
However, in spatial attention or channel attention, A(X )
is coarse and unable to capture the high-order and complex
interactions among parts, resulting in less discriminative at-
tention proposals and failing in capturing the subtle differ-
ences among pedestrians. To this end, we dedicate to mod-
eling A(X ) with high-order statistics.
3.2. High-Order Attention Module
To model the complex and high-order interactions within
attention, we first define a linear polynomial predictor on
top of the high-order statistics of x, where x ∈ R
C
denotes
a local descriptor at a specific spatial location of X :
a(x) =
R
X
r=1
hw
r
, ⊗
r
xi (2)
where h·, ·i indicates the inner product of two same-sized
tensors, R is the number of order, ⊗
r
x is the r-th order
outer-product of x that comprises all the degree-r mono-
mials in x, and w
r
is the r-th order tensor to be learned that
contains the weights of degree-r variable combinations in x.
Considering w
r
with large r will introduce excessive
parameters and incur the problem of overfitting, we sup-
pose that when r > 1, w
r
can be approximated by D
r
rank-1 tensors by Tensor Decomposition [23], i.e. w
r
=
P
D
r
d=1
α
r,d
u
r,d
1
⊗ · · · ⊗ u
r,d
r
when r > 1, where u
r,d
1
∈
R
C
, . . . , u
r,d
r
∈ R
C
are vectors, ⊗ is the outer-product, α
r,d
is the weight for d-th rank-1 tensor. Then according to the
tensor algebra, Eq. 2 can be reformulated as:
a(x) = hw
1
, xi +
R
X
r=2
h
D
r
X
d=1
α
r,d
u
r,d
1
⊗ · · · ⊗ u
r,d
r
, ⊗
r
xi
= hw
1
, xi +
R
X
r=2
D
r
X
d=1
α
r,d
r
Y
s=1
hu
r,d
s
, xi
= hw
1
, xi +
R
X
r=2
hα
r
, z
r
i (3)