文本查询的人脸搜索：对抗属性-文本嵌入方法

2020

reid

需积分: 10 9 浏览量更新于2024-09-03 收藏 13.82MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文主要探讨了"Adversarial Attribute-Text Embedding (AATE) for Person Search with Natural Language Query"这一主题，发表在2020年的IEEE Transactions on Multimedia期刊上。随着信息技术的发展，个人搜索任务正朝着使用自然语言描述而非图像或视频来检索目标行人的方式转变，这在实际应用中更具普适性，特别是在人像重识别（Person Re-identification，ReID）领域。作者Zheng-Jun Zhao和Jiawei Liu等人提出了AATE网络，旨在解决自然语言查询下的人体搜索问题。与传统的基于视觉特征的搜索不同，AATE关注于跨模态的学习，即如何将文本描述和行人图像之间的特征进行有效融合。论文的核心创新在于设计了一个跨模态对抗学习模块，该模块能够通过对抗训练策略来学习和区分文本描述和行人身份之间的隐含特征表示，以提高搜索的准确性和鲁棒性。在这个模块中，网络会尝试同时学习两种表示：一种是从文本中提取的语义特征，另一种是从行人图像中提取的视觉特征。对抗学习机制使得模型不仅关注正确匹配的样本，还会抵制潜在的误导信息，例如误导性的文本描述或者相似但不同的行人个体。通过这种方式，AATE网络能够在处理自然语言查询时，有效地降低误识率，并提升搜索的精度。文章的贡献包括但不限于： 1. 提出了一种新颖的AATE架构，它能够在自然语言和视觉特征之间建立强健的映射，增强对行人身份的识别能力。 2. 通过跨模态对抗学习，使得模型能够学习到更具有区分度的特征表示，提高了文本查询下的人体搜索性能。 3. 评估了AATE在网络搜索任务中的性能，并与其他相关方法进行了比较，展示了其在复杂场景下的优势。这篇文章对自然语言查询下的人体搜索任务进行了深入研究，提出了一种结合了对抗学习的文本-视觉嵌入方法，为未来的跨模态信息检索和人像识别提供了新的思考方向。阅读者可以从中了解到如何利用深度学习和对抗策略来改进传统的人脸识别系统，使其更加适应现实生活中的多样性和不确定性。

资源详情

资源推荐

1520-9210 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2020.2972168, IEEE

Transactions on Multimedia

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2019 3

Fig. 2. The overall architecture of the proposed AATE network. It consists of a visual attribute graph convolutional network for learning visual feature, a

hierarchical text embedding network for learning textual feature as well as a cross-modal adversarial learning module for learning modality-invariant and

discriminative visual-textual representation.

which captures visual appearance of a person at multiple scales

by a comparative similarity loss on sample triplets.

Video-based Person Search. Video-based person search is

an extension of image-based person search. It searches for

the target pedestrian by one or multiple video clips of the

pedestrian [47]–[51]. Compared to images, video sequences

contain motion patterns of pedestrians as well as more pres-

ence of pedestrian appearance. Early methods are based on

hand-crafted video representations and/or appropriate distance

metric. For example, You et al. [49] developed a top-push

distance learning model to optimize the matching accuracy of

top rank results. Recent works proposed deep learning models

for video-based person search [47], [52]–[55]. For example,

Liu et al. [52] proposed a Dense 3D-Convolutional Network

(D3DNet), which introduces multiple 3D dense blocks to

learn spatio-temporal and appearance features of pedestrians.

McLaughlin et al. [47] presented a recurrent neural network

architecture, which consists of optical ﬂow, recurrent layers

and mean-pooling layer to learn visual appearance and motion

features of pedestrians. Li et al. [53] proposed to jointly learn

local and global features in a CNN model by optimizing

multiple classiﬁcation losses in different context. Shen et al.

[54] proposed a similarity-guided graph neural network to

incorporate gallery-gallery similarities into the training process

of person re-identiﬁcation model.

III. METHOD

Supposing a training set {I

, T

}

i=1

, it includes N pairs

of pedestrian image and text description, where {I

}

i=1

are

pedestrian images taken by non-overlapping cameras and

}

i=1

are the corresponding text descriptions of pedestrians.

Y = {y

}

i=1

, where y

∈ [1, 2, ··· , K] is pedestrian ID.

The task is to identify the target pedestrian images in gallery

based on a text query. Figure 2 illustrates the architecture

of the proposed adversarial attribute-text embedding (AATE)

network, consisting of a visual attribute graph neural network,

a hierarchical text embedding network and a cross-modal

adversarial learning module.

A. Visual Attribute Graph Convolutional Network

Text description usually describes multiple attributes of the

target pedestrian. Hence, detecting visual attributes of pedes-

trian is of great importance for searching pedestrian. Moreover,

visual attributes possess better descriptiveness, interoperability

and robustness as compared to appearance feature. An attribute

usually arises from one or more regions rather than the entire

pedestrian image. It is thus necessary to concentrate on related

regions during attribute learning. Moreover, different attributes

correlate semantically. The presence or absence of a certain

attribute is usually useful for inferring the presence/absence

of other related attributes. For example, “wearing a dress”

and “long hair” are likely to co-occur, while “carrying a bag”

and “carrying a backpack” may mutually exclusive. Based on

the above observation, we develop a visual attribute graph

convolutional network to learn effective attribute features of

pedestrians. As illustrated in Figure 3, the network consists of

a visual attention block and a graph convolutional network.

The visual attention block infers spatial attention for each

attribute and concentrates the network on the corresponding

local regions during attribute learning. The graph network ex-

ploits the underlying semantic dependencies among attributes

which could effectively boost the learning of attributes [56]–

[58].

The ResNet-50 [59] is used as the base network to extract

feature map V

from input image. The dimension of V

7 ×7 ×2048. The appearance feature with 2,048 dimension is

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on March 07,2020 at 08:22:59 UTC from IEEE Xplore. Restrictions apply.

剩余10页未读，继续阅读

佑林杉

粉丝: 10
资源: 28

文本查询的人脸搜索：对抗属性-文本嵌入方法

【作者代码】Adversarial Multi-task Learning for Text Classification-ACL 2017

Generative Adversarial zero-shot

配置adversarial-robustness-toolbox==1.7.2 torch==1.7.1 torchvision==0.8.2 numpy argparse tqdm pandas dill cox tables tensorboardX的步骤

nips17-adversarial-attack代码运行环境

image-to-image translation with conditional adversarial networks

Adaptive Normalized Risk-Averting Training for Deep Neural Networks

image-to-Image Translation with Conditional Adversarial Networks

a style-based generator architecture for generative adversarial networks主要研究了什么

哪里有上下文自编码的生成对抗网络结构图

多个角度相片生成模型的代码

Feature Statistics Mixing Regularization for Generative Adversarial Network中可视化方法是什么

sim2real挑战赛学习资料

paddle.fluid

A Style-Based Generator Architecture for Generative Adversarial Networks

a style-based generator architecture for generative adversarial networks

GAN+Transformer

介绍Fast-ganfit: Generative adversarial network for high fidelity 3d face reconstruction的内容

Graph Convolutional Adversarial Networks for Spatiotemporal Anomaly Detection

最新资源