大规模嘈杂社交图像标签下的深度特征学习

10 浏览量更新于2024-08-29 收藏 470KB PDF 举报

"Learning Features from Large-Scale, Noisy and Social Image-Tag Collection" 是一篇具有前瞻性的研究论文，主要探讨了在大规模、嘈杂且社交化的图像标签数据集中学习特征的方法。在现代多媒体领域，特征表示对于诸如图像分类、内容理解等基础任务的进步至关重要。然而，深度学习模型在缺乏高质量和大规模训练数据的情况下，其应用潜力受限。该论文的核心贡献在于提出了一种基于互联网上无尽的社交多媒体内容的新型深度特征学习框架。传统的深度学习方法往往依赖于精确的图像标注数据来指导特征学习，而这篇论文则转向了一个不同的角度。作者们不再追求从高精度的图像标签监督中学习，而是探索通过图像与单词的语义关系来驱动特征学习。他们试图寻找一种统一的图像-词嵌入空间，其中，图像对之间的特征相似性能够保留在这种语义关系中。具体来说，这种方法涉及到构建一个深度神经网络架构，它能够从社交媒体上收集到的大量、不完整且可能存在噪声的图像标签对中学习。网络的目标是捕获和量化图片与其描述词汇之间的内在联系，这不仅要求模型能够处理信息的多样性，还能适应各种质量的标签信息。通过这种方式，即使在数据噪声较多的环境下，也能挖掘出潜在的模式和结构，从而提升特征的泛化能力和鲁棒性。此外，论文还可能讨论了如何设计有效的损失函数和优化策略，以及如何在实际应用中评估学习到的特征的有效性和实用性，比如在图像分类、图像检索等任务中的性能比较。这项工作对解决大规模、低质量数据条件下的特征学习问题有着重要的理论和实践价值，预示着在深度学习时代，如何有效利用社交网络中的非结构化信息将成为未来研究的一个重要方向。

Learning Features from Large-Scale, Noisy and Social

Image-Tag Collection

Hanwang Zhang

†

, Xindi Shang

†

, Huanbo Luan

†§

, Yang Yang

‡

, Tat-Seng Chua

†

National University of Singapore

‡

University of Electronic Science and Technology of China

Tsinghua University

{hanwangzhang,xindi1992,luanhuanbo,dlyyang}@gmail.com;dcscts@nus.edu.sg

ABSTRACT

Feature representation for multimedia content is the key to

the progress of many fundamental multimedia tasks. Al-

though recent advances in deep feature learning oﬀer a promis-

ing route towards these tasks, they are limited in application

to domains where high-quality and large-scale training data

are hard to obtain. In this paper, we propose a novel deep

feature learning paradigm based on large, noisy and social

image-tag collections, which can be acquired from the inex-

haustible social multimedia content on the Web. Instead of

learning features from high-quality image-label supervision,

we propose to learn from the image-word semantic relations,

in a way of seeking a uniﬁed image-word embedding space,

where the pairwise feature similarities preserve the semantic

relations in the original image-word pairs. We oﬀer an easy-

to-use implementation for the proposed paradigm, which is

fast and compatible for integrating into any state-of-the-art

deep architectures. Experiments on NUSWIDE benchmark

demonstrate that the features learned by our method signif-

icantly outperforms other state-of-the-art ones.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Retrieval

Model

Keywords

feature learning; visual-semantic embedding; multimodal anal-

ysis

1. INTRODUCTION

The progress in multimedia applications is largely due to

the advances of feature representations for multimedia con-

tent. For example, over the past decades, we have witnessed

the evolution of visual features from color histogram to SIFT

interest points and to the recent deep learning features, that

help to move a large varieties of multimedia applications

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from Permissions@acm.org.

MM’15, October 26–30, 2015, Brisbane, Australia.



2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00.

DOI: http://dx.doi.org/10.1145/2733373.2806286.

ImageNet SBU

Desk

Chow

Cleartrip's office area near

my desk

A gorgeous butterfly stopped

by to check out dog park

progress on 8-6-08.

(a) Samples from ImageNet and SBU

(b) Feature Visualization

Figure 1: t-SNE visualization for the features of a mil-

lion SBU images from Flickr. Diﬀerent colors represent

diﬀerent semantic categories. One of the distribution is

learned by deep learning a million images across 1,000

categories from high-quality labeled ImageNet and the

other is learned by a million images weakly-labeled by

over 30,000 unique tags by SBU (cf. Section 3 for de-

tails). Both sides show that the features are represen-

tative. Can you tell which side is learned by SBU? (see

answers below)

from academic prototypes into industrial products [14]. To-

day, a general consensus is that learning-based features by

deep neural networks can outperform most hand engineered

features and therefore free us to focus on designing algo-

rithms and end applications.

In order to learn strong features, we need a large-scale and

high-quality dataset. At this point, ImageNet with millions

of human-labeled images across thousands of semantic cate-

gories has oﬀered us a reliable incubator to develop features.

However, this dataset is built by Web images ﬁve years ago

and hence it lags behind the fast evolving semantic and vi-

sual diversities in real-world scenarios. For example, diﬀer-

ent emerging vertical domains like fashion (e.g., Taobao and

Amazon) would need diﬀerent datasets in order to learn spe-

ciﬁc features for shoes and clothes domain. Moreover, videos

from emerging popular social networks (e.g., Snapchat and

Vine) would love features diﬀerent from what were learned

from images. Building such datasets not only requires heavy

and tedious labeling eﬀorts but also expert domain knowl-

edge, any of which is expensive. This awkward situation

Left: ImageNet. Right: SBU.

1079

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38656337

粉丝: 4
资源: 921

大规模嘈杂社交图像标签下的深度特征学习

Python Social Media Analytics: Analyze and visualize data from Twitter, YouTube

机器学习之noisy or model

Extracting-speech-from-a-noisy-record-DSP-Project:数字信号处理项目

高光谱图像分类：论文“Density Peak Clustering-based Noisy Label Detection for Hyperspectral Image Classification”的代码-matlab开发

Fifth-order attosecond polarization beats using twin color-locked noisy lights in cascade three-level system with Doppler-free approach

matlab哈密尔顿代码-Effects-of-Noisy-Electrocardiogram-Signal-on-Heartbeat-Det

PolyU-Real-World-Noisy-Images-Dataset去噪数据集

Skellam distribution based adaptive two-stage non-local methods for photon-limited poisson noisy image reconstruction

legendre用matlab代码-noisy_image_centering:嘈杂的图像中心

Unequal-Training-for-Deep-Face-Recognition-with-Long-Tailed-Noisy-Data:CVPR2019论文代码《带有长尾噪声数据的深度人脸识别的不平等训练》

最新资源