WebFace260M：百万级人脸识别新挑战：4M身份与260M人脸的深度清理与利用

人脸识别

1星需积分: 49 79 浏览量更新于2024-08-05 收藏 5.19MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

WebFace260M.pdf 是一篇关于大规模人脸识别领域的研究论文，它的重要性在于它揭示了百万级深度人脸识别能力的潜力。该研究介绍了WebFace260M，这是一个包含400万个身份（identities）和2.6亿张人脸的海量数据集，旨在推动深度人脸识别技术在实际应用中的进步。作者们特别关注的是在数据质量上的挑战，因为原始数据中可能存在噪声，这可能会影响模型的性能。论文首先强调了数据收集的过程，研究人员从互联网上获取了400万个名字列表，并下载了总计2.6亿张人脸图片。为了处理这个庞大数据集并提高数据的准确性，他们开发了一种名为Cleaning Automatically Utilizing Self-Training (CAST) 的自训练清洗管道。CAST方法旨在高效且可扩展地清理大量数据，通过迭代学习的方式去除噪声和错误匹配，确保数据的质量。 WebFace260M中的42M张人脸经过CAST清洗后形成了WebFace42M，这是目前公开的最大规模人脸识别训练数据集。这一成果有助于缩小学术界与工业界在数据可用性上的差距，使得研究人员能够更好地评估和优化他们的算法在实际时间限制下的性能，也就是所谓的Face Recognition Under Inference Time Constrain。这篇论文的贡献不仅限于提供大量高质量的数据，还提出了一个精心设计的时间约束评估协议，它考虑到了实际应用中的实时性要求。这样的基准对于推动人脸识别技术的进一步发展，特别是在实时场景中的准确性和效率提升具有重要意义。通过WebFace260M和WebFace42M，研究者们为深度人脸识别社区提供了一个重要的研究平台，促进了算法的比较、优化和创新，同时也为解决大规模、复杂数据集的清洗和处理问题树立了新的标准。随着这些数据集的发布，我们可以期待在未来的人脸识别技术竞赛中看到更多的突破和进展。

资源详情

资源推荐

0.5

1.5

1846 1932 2019

E-2

(a) Date of Birth

USA

CAN

AUS

India

(b) Nationality

Actor

Screenwriter

Film Director

Film Producer

Politician

Writer

Figure 2: Date of birth, nationality and profession of WebFace260M.

referring to different deployment scenarios.

• Based on the new benchmark, we perform exten-

sive million-scale face recognition experiments. En-

abled by distributed training framework, comprehen-

sive baselines are established on our test set under the

FRUITS protocol. The results indicate substantial im-

provement room for light-weight track, as well as the

necessity of innovation in heavy-weight track.

2. WebFace260M and WebFace42M

Celebrity name list and image collection. Knowl-

edge graphs website Freebase [3] and well-curated web-

site IMDB [4] provide excellent resources for collecting

celebrity names. Furthermore, commercial search engines

such as Google [5] make it possible to collect images of

a speciﬁc identity with ranked correlation. Our celebrity

name list consists of two parts: the ﬁrst one is borrowed

from MS1M (1M, constructed from Freebase) and the sec-

ond one is collected from the IMDB database. There are

nearly 4M celebrity names in the IMDB website, while

we found some subjects have no public image from search

engines. Therefore, only 3M celebrity names in IMDB

are chosen for our benchmark. Based on the name list,

celebrity faces are searched and downloaded via Google im-

age search engine. 200 images per identity are downloaded

for top 10% subjects, while 100, 50, 25 images are reserved

for remaining 20%, 30%, 40% subjects, respectively. Fi-

nally, we collect 4M identities and 265M images.

Face pre-processing. Faces are detected and aligned

through ﬁve landmarks predicted by RetinaFace [11]. For

multi-face images, we only select the largest face with

the above-threshold score, which can ﬁlter most improper

faces (e.g. background faces or wrong decoding). Af-

ter pre-processing, there remains 4M identities/260M faces

(WebFace260M) shown as Tab.1. The statistics of Web-

Face260M are illustrated in Fig.2 including date of birth,

nationality and profession. Persons in WebFace260M come

from more than 200 distinct countries/regions and more

than 500 different professions with the date of birth back

to 1846, which guarantees a great diversity in our training

data.

Cleaned WebFace42M. We perform CAST pipeline

(Sec.3) to automatically clean the noisy WebFace260M and

obtain a cleaned training set named WebFace42M, consist-

-90 0 90

E-2

(a) Pose

0 49 99

E-2

(b) Age

Caucasian

M. East

E. Asian

African

Latino

Indian

SE. Asian

Figure 3: Pose (yaw), age and race of WebFace42M.

ing of 42M faces of 2M subjects. Face number in each

identity varies from 3 to more than 300, and the average

face number is 21 per identity. As shown in Fig.1 and

Tab.1, WebFace42M offers the largest cleaned training data

for face recognition. Compared with the MegaFace2 [38]

dataset, the proposed WebFace42M includes 3 times more

identities (2M vs. 672K), and near 10 times more im-

ages (42M vs. 4.7M). Compared with the widely used

MS1M [21], our training set is 20 times (2M vs. 100K)

and 4 times (42M vs. 10M) more in terms of # identities

and # photos. According to [64], there are more than 30%

and 50% noises in MegaFace2 and MS1M, while noise ra-

tio of WebFace42M is lower than 10% (similar to CASIA-

WebFace [84]) based on our sampling estimation. With

such a large data size, we take a signiﬁcant step towards

closing the data gap between academia and industry.

Face attributes on WebFace42M. We further provide 7

face attribute annotations for WebFace42M, including pose,

age, race, gender, hat, glass, and mask. Fig.3 presents the

distribution of our cleaned training data in different aspects.

WebFace42M covers a large range of poses (Fig.3(a)), ages

(Fig.3(b)) and most major races in the world (Fig.3(c)).

3. Cleaning Automatically by Self-Training

Since the images downloaded from the web are consid-

erably noisy, it is necessary to perform a cleaning step to

obtain high-quality training data. Original MS1M [21] does

not perform any dataset cleaning, resulting in near 50%

noise ratio, and signiﬁcantly degrades the performance of

the trained models. VGGFace [41], VGGFace2 [8] and

IMDB-Face [64] adopt semi-automatic or manual clean-

ing pipelines, which require expensive labor efforts. It

becomes challenging to scale up the current annotation

size to even more identities. Although the puriﬁcation in

MegaFace2 [38] is automatic, its procedure is complicated

and there are considerably more than 30 % noises [64]. An-

other relevant exploration is to cluster faces via unsuper-

vised approaches [40, 35, 51] and supervised graph-based

algorithm [85, 82, 81, 20, 72]. However, these methods as-

sume the whole dataset is clean, which is not suitable for

the extremely noisy WebFace260M.

Recently, self-training [77, 79, 42, 43], a standard ap-

proach in semi-supervised learning [48, 83], is explored to

signiﬁcantly boost the performance of image classiﬁcation.

剩余10页未读，继续阅读

kupeThinkPoem

粉丝: 3w+
资源: 38

WebFace260M：百万级人脸识别新挑战：4M身份与260M人脸的深度清理与利用

CASIA-WebFace 数据（百度网盘）

webface人脸识别数据集

webFace 标准人脸库（7/10）

[Errno 2] No such file or directory: 'D:\\Face_recognition\\Dataset\\CASIA_WebFace/CASIA-WebFacetrain_picturelist.txt'

CASIA-WebFace分类

实现pytorch训练过程中，加载自己的CASIA-WebFace数据集图片的代码

pytorch训练过，加载自己的CASIA-WebFace数据集图片，代码实现

实现pytorch训练过程中，加载自己的CASIA-WebFace数据集图片的代码，不用数据集列表

InsightFace怎么使用

经典卷积神经网络应用实战项目

国内常用的机器学习数据集，行人

深度学习的人脸识别的数据集

facenet包的含义

人脸数据集有哪些，分别有何优劣势

怎样做一个基于openmv的人脸识别

用python写一个人脸识别功能

mobilefacenet

如何下载人脸识别数据集

构建深度卷积神经网络，利用所下载的人脸数据集和深度卷积网络，完成人脸识别任务。（python代码实现）

pytorch 视觉项目

最新资源