Y. Fang, H. Zhang and Y. Ren / Knowledge-Based Systems 171 (2019) 69–80 71
smoothing matrix generated by a controlled parameter. A more
detailed definition is as follows:
X ≈ USV s.t. U
ij
≥ 0, V
jl
≥ 0 (2)
where the smoothing matrix S ∈ R
k×k
is a positive symmetric
matrix, which is defined as:
S = (1 − θ)I +
θ
k
1
k
1
T
k
(3)
where I ∈ R
k×k
is an identity matrix, 1
k
is a vector of ones, and
θ ∈ [0, 1] is a smoothing parameter. If θ = 0, the smoothing
matrix S is of no effect, and NsNMF degenerates to standard NMF.
However, as θ → 1, it is the smoothest case (non-sparseness)
because all entries are non-zero.
2.3. Symmetric nonnegative matrix factorization
Symmetric nonnegative matrix factorization (SNMF) [18] aims
to factorize an adjacency matrix (graph similarity matrix) A ∈
R
n×n
into the product of nonnegative matrix H ∈ R
n×r
and its
transposed matrix H
T
, where H is the low-dimensional represen-
tation, under the column orthonormal constraint i.e. H
T
H = I. And
it has been well manifested to be a graph clustering method by con-
taining pairwise similarity values which show a closer relationship
with spectral clustering. The specific objective function form is as
follows:
J (H) = ∥A − HH
T
∥
2
F
s.t. H
ij
≥ 0, H
T
H = I (4)
In the following sections, we will employ the SNMF model to
reconstruct a similar graph between different modalities, i.e., the
inter-modal similarity graph.
3. Multi-modal graph regularized smooth matrix factorization
hashing
In this section, the MSFH framework, the optimization strategy
and the corresponding algorithm will be introduced in detail. The
goal of our method is to extract the shared latent semantic features
and generate unified binary hash codes for different modalities.
Specific steps are as follows. In the beginning, the common latent
semantic space is found by the joint smooth matrix factorization
model with specific regularization. Then the extracted shared la-
tent features are transformed to binary hash codes by learned
hashing functions, so that the similarity between modalities can
be commendably estimated. In addition, in order to generate more
efficient hash codes, several regularization terms are integrated
into the whole objective function. The overall framework of MSFH
is shown in Fig. 1.
3.1. Objective function
For every modality X
t
∈ R
d
t
×n
, the dictionary matrix U
t
∈
R
d
t
×k
and the common latent feature representation V ∈ R
k×n
can
be learned by a joint smooth matrix factorization model, which is
formulated as:
min
U
t
,V
L(U
t
, V ) =
t
α
t
∥X
t
− U
t
SV ∥
2
F
(5)
where α
t
is a weight parameter for the tth modality. S ∈ R
k×k
is defined exactly as Eq. (3). In the above formula, the robust-
ness of the model may be affected by outliers. And since the
original non-negative constraint is removed from the model, the
trivial solution problem is inevitable. Besides, the pseudo-inverse
(U
t
S)
†
= [(U
t
S)
T
(U
t
S)]
−1
(U
t
S)
T
(calculated by singular value
decomposition) will inevitably be drawn into feature extraction in
the test process, i.e., v
test
= (U
t
S)
†
x
t
test
. But it must be noted that
the calculation of pseudo-inverse may be affected by numerical
instability. To address these drawbacks, a linear projection for each
modality between original data space and shared semantic space is
constructed as a regularization term of the objective function. Then
the objective function can be rewritten as:
min
U
t
,W
t
,V
L =
t
α
t
(∥X
t
− U
t
SV ∥
2
F
+ β∥V − W
t
X
t
∥
2
F
) (6)
where β is a trade-off parameter for linear projection regulariza-
tion and W
t
∈ R
k×d
t
is a projection matrix for the tth modality.
3.2. Multi-modal graph regularization
Furthermore, to preserve the geometric topological structure of
original data more effectively, a predefined graph for each modality
is added as a regularization term in the above target function to
enrich the local information of the extracted feature V . However, if
we impose to preserve the local structure of each original modality
on the shared semantic feature V , it maybe result in the crucial
information loss and even cause damage to the structure of original
modality because the topological structure of each modality is
different. For this reason, we preserve the topological structure of
each original modality on the aforesaid linear embedding, i.e., Y
t
=
W
t
X
t
.
In graph theory, a nearest neighbor graph can be represented as
an ordered pair, G = (
V , E), where
V is a set of N = |
V | vertices
or nodes, and E is a set of m = |E| edges or links between the
vertices [19]. Most graphs can be represented as edge-weighted
ones, G = (
V , E, a), where a : E → R assigns real values to
edges [20]. For multi-modal data, a multiplex graph can be defined
by G
i
= (
V , E
i
, a
i
), for the number of modality i = 1, . . . , t. Every
graph number of nodes is the same, i.e. N = |
V |, while connectivity
and distribution of links in each graph are different, m
i
= |E
i
|.
More specifically, for each data point x
t
j
∈ R
d
t
, we seek its near-
est neighbors and put edges between x
t
j
and its neighbors. In this
paper, the most common weight matrix is chosen for the graph.
And the similarity between the vertex j and l can be calculated as:
A
t
jl
=
e
∥x
t
l
−x
t
j
∥
2
σ
, if x
t
l
∈ N (x
t
j
)
0, otherwise
(7)
where A
t
jl
is the entry of weight matrix A
t
and N (x
t
j
) is the neighbor
of x
t
j
. The Euclidean distance between new embedding representa-
tions y
t
i
and y
t
j
is computed as d(y
t
i
, y
t
j
) = ∥y
t
i
− y
t
j
∥
2
. And then
an undirected graph G
t
= (
V
t
, A
t
) for each modality can be easily
described based on the predefined weight matrix A
t
and collection
of data
V
t
. The graph regularization of the tth modality can be
defined as:
G
t
=
1
2
n
i,j
A
t
ij
∥y
t
i
− y
t
j
∥
2
2
=
n
i,j
y
t
i
y
t
j
D
t
ii
−
n
i,j
y
t
i
y
t
j
A
t
ij
= tr(Y
t
D
t
(Y
t
)
T
) − tr(Y
t
A
t
(Y
t
)
T
)
= tr(Y
t
L
t
(Y
t
)
T
)
= tr(W
t
X
t
L
t
(X
t
)
T
(W
t
)
T
)
(8)
where y
t
i
is the ith column of Y
t
= W
t
X
t
and D
t
is a diagonal
matrix whose entries are column sums of symmetric matrix A
t
,
that is, D
t
ii
=
n
j=1
A
t
ij
. L
t
= D
t
− A
t
is a Laplacian matrix.
In the case of unsupervised learning, the tag information cannot
be employed to structure the similarity adjacency matrix between
different modalities. Directly extracting multi-modal graph struc-
ture by weighted average is often a very rough approximation
which fails to capture the rich and comprehensive complexity
between modalities. For this reason, [21] utilized two modality