深度学习驱动的人体运动解析与应用

深度学习

需积分: 9 101 浏览量更新于2024-07-20 2 收藏 35.08MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文档探讨了"Deep Learning of Human Motion"这一主题，重点关注在人体运动分析领域如何应用深度学习技术。作者Christian Wolf，来自法国里昂大学（Université de Lyon）的LIRIS UMR CNRS 5205，于2016年4月28日分享了研究成果。深度学习在这里扮演了关键角色，它被用于解决多种与人类运动相关的任务，如姿态估计、手势识别、群体活动和个体活动识别、语义分割以及文档分析等。数据解释是深度学习的基础，通过训练数据集（如ADL dataset），模型能够学习并理解输入的视觉信息。文章指出，人类的学习过程是混合的监督和无监督学习方式，孩子们在没有明确标记的情况下，仅通过形状和视觉输入就能学会识别物体，如问及「什么是猫？」时，他们能自动理解其特征。核心的应用案例包括姿态估计，通过深度学习算法，如发表在IEEE Transactions on Pattern Analysis and Machine Intelligence (2016) 和 Under review状态中的方法，实现了高精度的肢体位置估计。手势识别也是另一个关键领域，通过像BMVC 2014和ICANN 2010等会议上的研究，深度学习有助于提高手势的理解和分类能力。此外，研究还涉及识别群体活动和个人活动，例如HBU 2011和BMVC 2012年的项目，以及语义分割，即在图像中区分不同物体或动作区域。文档分析，如在ICDAR 2015会议上展示的，利用深度学习进行文本内容识别，是深度学习在实际应用中的又一亮点。生物神经元的启发在人工智能领域中也有所体现，比如Devin K. Phillips的研究将生物神经元的结构和功能应用于人工神经网络，如Perceptron，以实现对复杂运动数据的处理和预测。在深度学习的框架下，模型通过学习参数来预测特定值，例如将一组对象类别与相应的标签关联，或者通过运动数据进行生物特征识别，如在IEEE Access 2016中介绍的生物特征识别技术。本文深入剖析了深度学习在人类运动分析中的应用潜力，展示了其在姿态估计、动作识别、视觉理解等方面取得的显著进展，并强调了其在未来继续推动科技进步的关键作用。

资源详情

资源推荐

11!

Multi-layer Perceptron"(MLP)!

«"Fully-connected"» layers!

5.1. Feed-forward Network Functions 229

notation for the two kinds of model. We shall see later how to give a probabilistic

interpretation to a neural network.

As discussed in Section 3.1, the bias parameters in (5.2) can be absorbed into

the set of weight parameters by deﬁning an additional input variable x

whose value

is clamped at x

=1, so that (5.2) takes the form

i=0

(1)

. (5.8)

We can similarly absorb the second-layer biases into the second-layer weights, so

that the overall network function becomes

(x, w)=σ

j=0

(2)

i=0

(1)

. (5.9)

As can be seen from Figure 5.1, the neural network model comprises two stages

of processing, each of which resembles the perceptron model of Section 4.1.7, and

for this reason the neural network is also known as the multilayer perceptron,or

MLP. A key difference compared to the perceptron, however, is that the neural net-

work uses continuous sigmoidal nonlinearities in the hidden units, whereas the per-

ceptron uses step-function nonlinearities. This means that the neural network func-

tion is differentiable with respect to the network parameters, and this property will

play a central role in network training.

If the activation functions of all the hidden units in a network are taken to be

linear, then for any such network we can always ﬁnd an equivalent network without

hidden units. This follows from the fact that the composition of successive linear

transformations is itself a linear transformation. However, if the number of hidden

units is smaller than either the number of input or output units, then the transforma-

tions that the network can generate are not the most general possible linear trans-

formations from inputs to outputs because information is lost in the dimensionality

reduction at the hidden units. In Section 12.4.2, we show that networks of linear

units give rise to principal component analysis. In general, however, there is little

interest in multilayer networks of linear units.

The network architecture shown in Figure 5.1 is the most commonly used one

in practice. However, it is easily generalized, for instance by considering additional

layers of processing each consisting of a weighted linear combination of the form

(5.4) followed by an element-wise transformation using a nonlinear activation func-

tion. Note that there is some confusion in the literature regarding the terminology

for counting the number of layers in such networks. Thus the network in Figure 5.1

may be described as a 3-layer network (which counts the number of layers of units,

and treats the inputs as units) or sometimes as a single-hidden-layer network (which

counts the number of layers of hidden units). We recommend a terminology in which

Figure 5.1 is called a two-layer network, because it is the number of layers of adap-

tive weights that is important for determining the network properties.

Another generalization of the network architecture is to include skip-layer con-

nections, each of which is associated with a corresponding adaptive parameter. For

...!

Hidden

layer!

Output

layer!

Input

layer!

13!

Learning by gradient descent!

Iterative minimisation through gradient descent:!

236 5. NEURAL NETWORKS

Figure 5.5 Geometrical view of the error function E(w) as

a surface sitting over weight space. Point w

a local minimum and w

is the global minimum.

At any point w

, the local gradient of the error

surface is given by the vector ∇E.

E(w)

∇E

Following the discussion of Section 4.3.4, we see that the output unit activation

function, which corresponds to the canonical link, is given by the softmax function

(x, w)=

exp(a

(x, w))

exp(a

(x, w))

(5.25)

which satisﬁes 0 ! y

! 1 and

=1. Note that the y

(x, w) are unchanged

if a constant is added to all of the a

(x, w), causing the error function to be constant

for some directions in weight space. This degeneracy is removed if an appropriate

regularization term (Section 5.5) is added to the error function.

Once again, the derivative of the error function with respect to the activation for

a particular output unit takes the familiar form (5.18).Exercise 5.7

In summary, there is a natural choice of both output unit activation function

and matching error function, according to the type of problem being solved. For re-

gression we use linear outputs and a sum-of-squares error, for (multiple independent)

binary classiﬁcations we use logistic sigmoid outputs and a cross-entropy error func-

tion, and for multiclass classiﬁcation we use softmax outputs with the corresponding

multiclass cross-entropy error function. For classiﬁcation problems involving two

classes, we can use a single logistic sigmoid output, or alternatively we can use a

network with two outputs having a softmax output activation function.

5.2.1 Parameter optimization

We turn next to the task of ﬁnding a weight vector w which minimizes the

chosen function E(w). At this point, it is useful to have a geometrical picture of the

error function, which we can view as a surface sitting over weight space as shown in

Figure 5.5. First note that if we make a small step in weight space from w to w +δw

then the change in the error function is δE ≃ δw

∇E(w), where the vector ∇E(w)

points in the direction of greatest rate of increase of the error function. Because the

error E(w) is a smooth continuous function of w, its smallest value will occur at a

240 5. NEURAL NETWORKS

evaluations, each of which would require O(W ) steps. Thus, the computational

effort needed to ﬁnd the minimum using such an approach would be O(W

Now compare this with an algorithm that makes use of the gradient information.

Because each evaluation of ∇E brings W items of information, we might hope to

ﬁnd the minimum of the function in O(W ) gradient evaluations. As we shall see,

by using error backpropagation, each such evaluation takes only O(W ) steps and so

the minimum can now be found in O(W

) steps. For this reason, the use of gradient

information forms the basis of practical algorithms for training neural networks.

5.2.4 Gradient descent optimization

The simplest approach to using gradient information is to choose the weight

update in (5.27) to comprise a small step in the direction of the negative gradient, so

that

(τ+1)

= w

(τ)

− η∇E(w

(τ)

) (5.41)

where the parameter η > 0 is known as the learning rate. After each such update, the

gradient is re-evaluated for the new weight vector and the process repeated. Note that

the error function is deﬁned with respect to a training set, and so each step requires

that the entire training set be processed in order to evaluate ∇E. Techniques that

use the whole data set at once are called batch methods. At each step the weight

vector is moved in the direction of the greatest rate of decrease of the error function,

and so this approach is known as gradient descent or steepest descent. Although

such an approach might intuitively seem reasonable, in fact it turns out to be a poor

algorithm, for reasons discussed in Bishop and Nabney (2008).

For batch optimization, there are more efﬁcient methods, such as conjugate gra-

dients and quasi-Newton methods, which are much more robust and much faster

than simple gradient descent (Gill et al., 1981; Fletcher, 1987; Nocedal and Wright,

1999). Unlike gradient descent, these algorithms have the property that the error

function always decreases at each iteration unless the weight vector has arrived at a

local or global minimum.

In order to ﬁnd a sufﬁciently good minimum, it may be necessary to run a

gradient-based algorithm multiple times, each time using a different randomly cho-

sen starting point, and comparing the resulting performance on an independent vali-

dation set.

There is, however, an on-line version of gradient descent that has proved useful

in practice for training neural networks on large data sets (Le Cun et al., 1989).

Error functions based on maximum likelihood for a set of independent observations

comprise a sum of terms, one for each data point

E(w)=

n=1

(w ). (5.42)

On-line gradient descent, also known as sequential gradient descent or stochastic

gradient descent, makes an update to the weight vector based on one data point at a

time, so that

(τ+1)

= w

(τ)

− η∇E

(τ)

). (5.43)

currently estimated label!

5.2. Network Training 235

If we consider a training set of independent observations, then the error function,

which is given by the negative log likelihood, is then a cross-entropy error function

of the form

E(w)=−

n=1

ln y

+ (1 − t

) ln(1 −y

)} (5.21)

where y

denotes y(x

, w ). Note that there is no analogue of the noise precision β

because the target values are assumed to be correctly labelled. However, the model

is easily extended to allow for labelling errors. Simard et al. (2003) found that usingExercise 5.4

the cross-entropy error function instead of the sum-of-squares for a classiﬁcation

problem leads to faster training as well as improved generalization.

If we have K separate binary classiﬁcations to perform, then we can use a net-

work having K outputs each of which has a logistic sigmoid activation function.

Associated with each output is a binary class label t

∈ {0, 1}, where k =1,...,K.

If we assume that the class labels are independent, given the input vector, then the

conditional distribution of the targets is

p(t|x, w )=

k=1

(x, w)

[1 − y

(x, w)]

1−t

. (5.22)

Taking the negative logarithm of the corresponding likelihood function then gives

the following error functionExercise 5.5

E(w)=−

n=1

k=1

ln y

+(1−t

) ln(1 −y

)} (5.23)

where y

denotes y

, w ). Again, the derivative of the error function with re-

spect to the activation for a particular output unit takes the form (5.18) just as in theExercise 5.6

regression case.

It is interesting to contrast the neural network solution to this problem with the

corresponding approach based on a linear classiﬁcation model of the kind discussed

in Chapter 4. Suppose that we are using a standard two-layer network of the kind

shown in Figure 5.1. We see that the weight parameters in the ﬁrst layer of the

network are shared between the various outputs, whereas in the linear model each

classiﬁcation problem is solved independently. The ﬁrst layer of the network can

be viewed as performing a nonlinear feature extraction, and the sharing of features

between the different outputs can save on computation and can also lead to improved

generalization.

Finally, we consider the standard multiclass classiﬁcation problem in which each

input is assigned to one of K mutually exclusive classes. The binary target variables

∈ {0, 1} have a 1-of-K coding scheme indicating the class, and the network

outputs are interpreted as y

(x, w)=p(t

=1|x), leading to the following error

function

E(w)=−

n=1

k=1

ln y

, w ). (5.24)

Given an error (loss) function:!

Can be blocked in a local

minimum (not that it matters

much …)!

[C. Bishop, Pattern recognition and Machine learning, 2006]!

target (ground-truth) label!

剩余61页未读，继续阅读

风翼冰舟

粉丝: 2478
资源: 54

深度学习驱动的人体运动解析与应用

CVPR2018_Oral_论文合集_人工智能_机器学习

人工智能清单.docx

The concept of deep learning

how to learn deep learning

Deeplearning4j 视频教程有吗？

deep learning toolbox 安装

Influence of Autoencoder-Based Data Augmentation on Deep Learning-Based Wireless Communication

window安装deeplearning4j

怎么安装deep learning toolbox

Java 安装deeplearning4j

Deeplearning4j 视频教程链接有哪些

下载deep learning toolbox

java deeplearning4j 安装

deep learning toolbox 下载

deeplearning4j书籍

Deeplearning4j学习流程

官网下载Deeplearning4j 的步骤

安装Deep Learning Toolbox

gradle 引用deeplearning4j

deeplearning4j教程

最新资源