无监督多模态图平滑矩阵分解哈希法提升跨模态检索性能

54 浏览量更新于2024-08-26 收藏 973KB PDF 举报

本文主要探讨了一种解决无监督跨模态检索中的量化损失问题的新方法——多模态图正则化平滑矩阵分解散列（Unsupervised Cross-modal Retrieval via Multi-modal Graph Regularized Smooth Matrix Factorization Hashing）。在传统的跨模态检索技术中，为了简化计算并实现高效搜索，通常会将离散的哈希码（hash codes）放松为连续的表示，这可能导致信息丢失，即所谓的量化损失（quantization loss）。针对这一挑战，研究者提出了一种融合多模态数据结构（Multi-modal graph）与平滑矩阵分解（Smooth Matrix Factorization）的策略。首先，文章构建了一个多模态图模型，该模型能够捕捉不同模态数据之间的复杂关系，如语义相似性和视觉特征之间的关联。这有助于提高哈希码生成过程中的表征精度，减少信息压缩带来的误差。通过在图上进行学习，算法能够更好地理解和整合不同模态的数据特征，增强它们之间的映射一致性。其次，平滑矩阵分解被应用于这个框架中，它强调了在保持数据局部一致性的前提下，尽可能地减小模态间数据的潜在差异。这种技术有助于生成既具有可扩展性又保持较高查询效率的哈希码，同时保持了原始数据的内在结构。该方法是一种无监督学习方法，无需预先标记的训练样本，可以自动学习不同模态数据之间的内在联系。它通过优化一个联合目标函数，其中包括模态图的正则化项和哈希码的平滑约束，实现了对量化损失的有效缓解。通过这种方式，无监督的多模态检索性能得到了显著提升，能够在没有标签的情况下找到最相关的数据，适用于诸如图像与文本、语音与视频等跨模态信息检索的场景。总结来说，这篇文章贡献了一个创新的哈希方法，它结合了多模态图模型和平滑矩阵分解，旨在解决无监督跨模态检索中的量化损失问题，从而提高了检索的准确性和效率。这项工作对于推动跨模态信息处理领域的研究和实际应用具有重要意义。

Y. Fang, H. Zhang and Y. Ren / Knowledge-Based Systems 171 (2019) 69–80 71

smoothing matrix generated by a controlled parameter. A more

detailed definition is as follows:

X ≈ USV s.t. U

≥ 0, V

≥ 0 (2)

where the smoothing matrix S ∈ R

k×k

is a positive symmetric

matrix, which is defined as:

S = (1 − θ)I +

(3)

where I ∈ R

k×k

is an identity matrix, 1

is a vector of ones, and

θ ∈ [0, 1] is a smoothing parameter. If θ = 0, the smoothing

matrix S is of no effect, and NsNMF degenerates to standard NMF.

However, as θ → 1, it is the smoothest case (non-sparseness)

because all entries are non-zero.

2.3. Symmetric nonnegative matrix factorization

Symmetric nonnegative matrix factorization (SNMF) [18] aims

to factorize an adjacency matrix (graph similarity matrix) A ∈

n×n

into the product of nonnegative matrix H ∈ R

n×r

and its

transposed matrix H

, where H is the low-dimensional represen-

tation, under the column orthonormal constraint i.e. H

H = I. And

it has been well manifested to be a graph clustering method by con-

taining pairwise similarity values which show a closer relationship

with spectral clustering. The specific objective function form is as

follows:

J (H) = ∥A − HH

∥

s.t. H

≥ 0, H

H = I (4)

In the following sections, we will employ the SNMF model to

reconstruct a similar graph between different modalities, i.e., the

inter-modal similarity graph.

3. Multi-modal graph regularized smooth matrix factorization

hashing

In this section, the MSFH framework, the optimization strategy

and the corresponding algorithm will be introduced in detail. The

goal of our method is to extract the shared latent semantic features

and generate unified binary hash codes for different modalities.

Specific steps are as follows. In the beginning, the common latent

semantic space is found by the joint smooth matrix factorization

model with specific regularization. Then the extracted shared la-

tent features are transformed to binary hash codes by learned

hashing functions, so that the similarity between modalities can

be commendably estimated. In addition, in order to generate more

efficient hash codes, several regularization terms are integrated

into the whole objective function. The overall framework of MSFH

is shown in Fig. 1.

3.1. Objective function

For every modality X

∈ R

×n

, the dictionary matrix U

∈

×k

and the common latent feature representation V ∈ R

k×n

can

be learned by a joint smooth matrix factorization model, which is

formulated as:

min

L(U

, V ) =



∥X

− U

SV ∥

(5)

where α

is a weight parameter for the tth modality. S ∈ R

k×k

is defined exactly as Eq. (3). In the above formula, the robust-

ness of the model may be affected by outliers. And since the

original non-negative constraint is removed from the model, the

trivial solution problem is inevitable. Besides, the pseudo-inverse

†

= [(U

S)]

−1

(calculated by singular value

decomposition) will inevitably be drawn into feature extraction in

the test process, i.e., v

test

= (U

†

test

. But it must be noted that

the calculation of pseudo-inverse may be affected by numerical

instability. To address these drawbacks, a linear projection for each

modality between original data space and shared semantic space is

constructed as a regularization term of the objective function. Then

the objective function can be rewritten as:

min

L =



(∥X

− U

SV ∥

+ β∥V − W

∥

) (6)

where β is a trade-off parameter for linear projection regulariza-

tion and W

∈ R

k×d

is a projection matrix for the tth modality.

3.2. Multi-modal graph regularization

Furthermore, to preserve the geometric topological structure of

original data more effectively, a predefined graph for each modality

is added as a regularization term in the above target function to

enrich the local information of the extracted feature V . However, if

we impose to preserve the local structure of each original modality

on the shared semantic feature V , it maybe result in the crucial

information loss and even cause damage to the structure of original

modality because the topological structure of each modality is

different. For this reason, we preserve the topological structure of

each original modality on the aforesaid linear embedding, i.e., Y

In graph theory, a nearest neighbor graph can be represented as

an ordered pair, G = (



V , E), where



V is a set of N = |



V | vertices

or nodes, and E is a set of m = |E| edges or links between the

vertices [19]. Most graphs can be represented as edge-weighted

ones, G = (



V , E, a), where a : E → R assigns real values to

edges [20]. For multi-modal data, a multiplex graph can be defined

by G

= (



V , E

, a

), for the number of modality i = 1, . . . , t. Every

graph number of nodes is the same, i.e. N = |



V |, while connectivity

and distribution of links in each graph are different, m

= |E

More specifically, for each data point x

∈ R

, we seek its near-

est neighbors and put edges between x

and its neighbors. In this

paper, the most common weight matrix is chosen for the graph.

And the similarity between the vertex j and l can be calculated as:



∥x

−x

∥

, if x

∈ N (x

)

0, otherwise

(7)

where A

is the entry of weight matrix A

and N (x

) is the neighbor

of x

. The Euclidean distance between new embedding representa-

tions y

and y

is computed as d(y

, y

) = ∥y

− y

∥

. And then

an undirected graph G

= (



, A

) for each modality can be easily

described based on the predefined weight matrix A

and collection

of data



. The graph regularization of the tth modality can be

defined as:



i,j

∥y

− y

∥



i,j

−



i,j

= tr(Y

)

) − tr(Y

)

= tr(Y

)

= tr(W

)

(8)

where y

is the ith column of Y

= W

and D

is a diagonal

matrix whose entries are column sums of symmetric matrix A

that is, D



j=1

. L

= D

− A

is a Laplacian matrix.

In the case of unsupervised learning, the tag information cannot

be employed to structure the similarity adjacency matrix between

different modalities. Directly extracting multi-modal graph struc-

ture by weighted average is often a very rough approximation

which fails to capture the rich and comprehensive complexity

between modalities. For this reason, [21] utilized two modality

剩余11页未读，继续阅读

weixin_38723027

粉丝: 9
资源: 987

无监督多模态图平滑矩阵分解哈希法提升跨模态检索性能

Coupled CycleGAN：用于跨模态检索的无监督哈希网络

基于多模态图正则化的交叉模态检索的类中心判别分析

为什么采用矩阵分解的方式进行多模态特征融合？有没有人做过？其创新点在什么地方

为什么采用矩阵分解的方式进行多模态特征融合？依据是什么？有没有人做过？

基于矩阵分解的特征融合有哪些创新

辛几何模态分解过程中的矩阵是辛矩阵吗

如何进行无监督多模态图像去模糊

以图搜文、以文搜图都是多模态信息检索吗

多模态知识库中多模态关联用到的技术

高阶动态模态分解理论

最新资源