无关键点精细人脸头部姿态估计：新方法与卓越性能

需积分: 50 71 浏览量更新于2024-09-08 收藏 1.51MB PDF 举报

本文主要探讨了"Fine-Grained Head Pose Estimation Without Keypoints"这一主题，即在不依赖关键点检测的情况下，实现更精细、鲁棒的人脸头部姿态估计。传统的头姿估计算法通常依赖于从目标人脸提取关键点（如眼睛、鼻子、嘴巴等），然后通过将这些2D关键点映射到三维人体模型来推算出头部的yaw（俯仰）、pitch（倾斜）和roll（旋转）角度。这种方法的问题在于其对关键点检测的准确性非常敏感，而且依赖于复杂的模型和后处理步骤，可能会导致在实际应用中的不稳定性和错误。作者提出了一种创新的方法，即利用多损失卷积神经网络（CNN）对300W-LP这个大型合成数据集进行训练。300W-LP是通过对原始300-W人脸关键点基准数据进行扩展得到的，它包含了丰富的姿态变化样本。该网络直接从图像灰度信息中预测出Euler角，通过结合分类和回归的方式，实现了更为精确和稳定的姿态估计。这种方法避免了传统方法中的瓶颈，并且在公开的野外姿态评估基准数据上展现了最先进的性能。此外，文章还展示了他们在使用深度数据的姿势估计算法测试集上的实验结果，显示出他们的方法在逐渐接近甚至超越深度感知的姿势估计方法。这一成果不仅提高了头姿估计的精度，而且简化了整个流程，减少了对外部模型和关键点检测的依赖。研究者们强调了他们方法的优雅性和鲁棒性，以及开源训练和测试代码以及预训练模型的重要性，这使得其他研究人员能够更容易地复制和改进他们的工作。这篇论文为无关键点头姿估计提供了一个新的强有力的技术解决方案，对于面部分析、注意力模型、3D人脸建模和视频中的面部对齐等领域具有重大意义。

Fine-Grained Head Pose Estimation Without Keypoints

Nataniel Ruiz Eunji Chong James M. Rehg

Georgia Institute of Technology

{nataniel.ruiz, eunjichong, rehg}@gatech.edu

Abstract

Estimating the head pose of a person is a crucial prob-

lem that has a large amount of applications such as aiding

in gaze estimation, modeling attention, ﬁtting 3D models

to video and performing face alignment. Traditionally head

pose is computed by estimating some keypoints from the tar-

get face and solving the 2D to 3D correspondence problem

with a mean human head model. We argue that this is a

fragile method because it relies entirely on landmark detec-

tion performance, the extraneous head model and an ad-hoc

ﬁtting step. We present an elegant and robust way to deter-

mine pose by training a multi-loss convolutional neural net-

work on 300W-LP, a large synthetically expanded dataset,

to predict intrinsic Euler angles (yaw, pitch and roll) di-

rectly from image intensities through joint binned pose clas-

siﬁcation and regression. We present empirical tests on

common in-the-wild pose benchmark datasets which show

state-of-the-art results. Additionally we test our method on

a dataset usually used for pose estimation using depth and

start to close the gap with state-of-the-art depth pose meth-

ods. We open-source our training and testing code as well

as release our pre-trained models

1. INTRODUCTION

The related problems of head pose estimation and fa-

cial expression tracking have played an important role over

the past 25 years in driving vision technologies for non-

rigid registration and 3D reconstruction and enabling new

ways to manipulate multimedia content and interact with

users. Historically, there have been several major ap-

proaches to face modeling, with two primary ones being

discriminative/landmark-based approaches [

26, 29] and pa-

rameterized appearance models, or PAMs [

4, 15] (see [30]

for additional discussion). In recent years, methods which

directly extract 2D facial keypoints using modern deep

learning tools [

2, 35, 14] have become the dominant ap-

proach to facial expression analysis, due to their ﬂexibility

https://github.com/natanielruiz/deep-head-pose

and robustness to occlusions and extreme pose changes. A

by-product of keypoint-based facial expression analysis is

the ability to recover the 3D pose of the head, by establish-

ing correspondence between the keypoints and a 3D head

model and performing alignment. However, in some ap-

plications the head pose may be all that needs to be esti-

mated. In that case, is the keypoint-based approach still the

best way forward? This question has not been thoroughly-

addressed using modern deep learning tools, a gap in the

literature that this paper attempts to ﬁll.

We demonstrate that a direct, holistic approach to esti-

mating 3D head pose from image intensities using convo-

lutional neural networks delivers superior accuracy in com-

parison to keypoint-based methods. While keypoint detec-

tors have recently improved dramatically due to deep learn-

ing, head pose recovery inherently is a two step process with

numerous opportunities for error. First, if sufﬁcient key-

points fail to be detected, then pose recovery is impossible.

Second, the accuracy of the pose estimate depends upon the

quality of the 3D head model. Generic head models can

introduce errors for any given participant, and the process

of deforming the head model to adapt to each participant

requires signiﬁcant amounts of data and can be computa-

tionally expensive.

While it is common for deep learning based methods us-

ing keypoints to jointly predict head pose along with fa-

cial landmarks, the goal in this case is to improve the accu-

racy of the facial landmark predictions, and the head pose

branch is not sufﬁciently accurate on its own: for exam-

ple [

14, 20, 21] which are studied in Section 4.1 and 4.3.

A conv-net architecture which directly predicts head pose

has the potential to be much simpler, more accurate, and

faster. While other works have addressed the direct regres-

sion of pose from images using conv-nets [

31, 19, 3] they

did not include a comprehensive set of benchmarks or lever-

age modern deep architectures.

In applications where accurate head pose estimation is

required, a common solution is to utilize RGBD (depth)

cameras. These can be very accurate, but suffer from a

number of limitations: First, because they use active sens-

ing, they can be difﬁcult to use outdoors and in uncontrolled

2187

下载后可阅读完整内容，剩余9页未读，立即下载

匠人_C

粉丝: 26
资源: 10

无关键点精细人脸头部姿态估计：新方法与卓越性能

FSA-Net：[CVPR19] FSA-Net：从单个图像中学习用于头部姿势估计的细粒度结构聚合

Self-Supervised-Learning-for-Fine-grained:应用自我监督学习进行细粒度图像分类

头部姿态估计文献

Fine-Grained Crowdsourcing for Fine-Grained Recognition

数据融合matlab代码-fine-Grained-classify:fine-Grainedclassify细颗粒度图像分类

2019-VLDB-Fine-Grained

matlab影像镶嵌代码-Image-forgery-localisation-via-fine-grained-analysis-of-CF

Fine-Grained-Synchronization:带锁的多线程

A fine-grained Qt login system-开源

Building Microservices: Designing Fine-Grained Systems

最新资源