没有合适的资源?快使用搜索试试~ 我知道了~
首页CS294A Lecture notes Sparse autoencoder (稀疏自编码器课程讲义,吴恩达)
CS294A Lecture notes Sparse autoencoder (稀疏自编码器课程讲义,吴恩达)
需积分: 50 656 浏览量
更新于2023-03-16
评论 4
收藏 414KB PDF 举报
CS294A Lecture notes Sparse autoencoder,Andrew Ng,Stanford University
资源详情
资源评论
资源推荐

CS294A Lecture notes
Andrew Ng
Sparse autoencoder
1 Introduction
Supervised learning is one of the most powerful to ols of AI, and has led to
automatic zip code recognition, speech recognition, self-driving cars, and a
continually improving understanding of the human genome. Despite its sig-
nificant successes, supervised learning today is still severely limited. Specifi-
cally, most applications of it still require that we manually specify the input
features x given to the algorithm. Once a good feature representation is
given, a supervised learning algorithm can do well. But in such domains as
computer vision, audio processing, and natural language processing, there’re
now hundreds or perhaps thousands of researchers who’ve sp ent years of their
lives slowly and laboriously hand-engineering vision, audio or text features.
While much of this feature-engineering work is extremely clever, one has to
wonder if we can do better. Certainly this labor-intensive hand-engineering
approach does not scale well to new problems; further, ideally we’d like to
have algorithms that can automatically le arn even be tter feature representa-
tions than the hand-engineered ones.
These notes describe the sparse autoencoder learning algorithm, which
is one approach to automatically learn features from unlabeled data. In some
domains, such as computer vision, this approach is not by itself competitive
with the best hand-engineered features, but the features it can learn do turn
out to be useful for a range of problems (including ones in audio, text, etc).
Further, there’re more sophisticated versions of the sparse auto encoder (not
described in these notes, but that you’ll hear more about later in the class)
that do surprisingly well, and in many cases are competitive with or superior
to even the best hand-engineered representations.
1

These notes are organized as follows. We will first describe feedforward
neural networks and the backpropagation algorithm for supervised learning.
Then, we show how this is used to construct an autoencoder, which is an
unsupervised learning algorithm. Finally, we build on this to derive a sparse
autoencoder. Because these notes are fairly notation-heavy, the last page
also contains a summary of the symbols used.
2 Neural networks
Consider a supervised learning problem where we have access to labeled train-
ing examples (x
(i)
, y
(i)
). Neural networks give a way of defining a complex,
non-linear form of hypotheses h
W,b
(x), with parameters W, b that we can fit
to our data.
To describe neural networks, we will begin by describing the simplest
possible neural network, one which comprises a single “neuron.” We will use
the following diagram to denote a single neuron:
This “neuron” is a computational unit that takes as input x
1
, x
2
, x
3
(and
a +1 intercept term), and outputs h
W,b
(x) = f(W
T
x) = f(
P
3
i=1
W
i
x
i
+ b),
where f : R 7→ R is called the activation function. In these notes, we will
choose f(·) to be the sigmoid function:
f(z) =
1
1 + exp(−z)
.
Thus, our single neuron corresponds exactly to the input-output mapping
defined by logistic regression.
Although these notes will use the sigmoid function, it is worth noting that
another common choice for f is the hyperbolic tangent, or tanh, function:
f(z) = tanh(z) =
e
z
− e
−z
e
z
+ e
−z
, (1)
Here are plots of the sigmoid and tanh functions:
2

The tanh(z) function is a rescaled version of the sigmoid, and its output
range is [−1, 1] instead of [0, 1].
Note that unlike CS221 and (parts of) CS229, we are not using the con-
vention here of x
0
= 1. Instead, the intercept term is handled separately by
the parameter b.
Finally, one identity that’ll be useful later: If f(z) = 1/(1 + exp(−z)) is
the sigmoid function, then its derivative is given by f
0
(z) = f(z)(1 − f(z)).
(If f is the tanh function, then its derivative is given by f
0
(z) = 1 − (f(z))
2
.)
You can derive this yourself using the definition of the sigmoid (or tanh)
function.
2.1 Neural network formulatio n
A neural network is put together by hooking together many of our simple
“neurons,” so that the output of a neuron can be the input of another. For
example, here is a small neural network:
3

In this figure, we have used circles to also denote the inputs to the net-
work. The circles labeled “+1” are called bias units, and correspond to the
intercept term. The leftmost layer of the network is called the input layer,
and the rightmost layer the output layer (which, in this example, has only
one node). The middle layer of nodes is called the hidden layer, because
its values are not observed in the training set. We also say that our example
neural network has 3 input units (not counting the bias unit), 3 hidden
units, and 1 output unit.
We will let n
l
denote the number of layers in our network; thus n
l
= 3
in our example. We label layer l as L
l
, so layer L
1
is the input layer, and
layer L
n
l
the output layer. Our neural network has parameters (W, b) =
(W
(1)
, b
(1)
, W
(2)
, b
(2)
), where we write W
(l)
ij
to denote the parameter (or weight)
associated with the connection between unit j in layer l, and unit i in layer
l+1. (Note the order of the indices.) Also, b
(l)
i
is the bias associated with unit
i in layer l+1. Thus, in our example, we have W
(1)
∈ R
3×3
, and W
(2)
∈ R
1×3
.
Note that bias units don’t have inputs or connections going into them, since
they always output the value +1. We also let s
l
denote the number of nodes
in layer l (not counting the bias unit).
We will write a
(l)
i
to denote the activation (meaning output value) of
unit i in layer l. For l = 1, we also use a
(1)
i
= x
i
to denote the i-th input.
Given a fixed setting of the parameters W, b, our neural ne twork defines a
hypothesis h
W,b
(x) that outputs a real number. Specifically, the computation
that this neural network represents is given by:
a
(2)
1
= f(W
(1)
11
x
1
+ W
(1)
12
x
2
+ W
(1)
13
x
3
+ b
(1)
1
) (2)
a
(2)
2
= f(W
(1)
21
x
1
+ W
(1)
22
x
2
+ W
(1)
23
x
3
+ b
(1)
2
) (3)
a
(2)
3
= f(W
(1)
31
x
1
+ W
(1)
32
x
2
+ W
(1)
33
x
3
+ b
(1)
3
) (4)
h
W,b
(x) = a
(3)
1
= f(W
(2)
11
a
(2)
1
+ W
(2)
12
a
(2)
2
+ W
(2)
13
a
(2)
3
+ b
(2)
1
) (5)
In the sequel, we also let z
(l)
i
denote the total weighted sum of inputs to unit
i in layer l, including the bias term (e.g., z
(2)
i
=
P
n
j=1
W
(1)
ij
x
j
+ b
(1)
i
), so that
a
(l)
i
= f(z
(l)
i
).
Note that this easily lends itself to a more compact notation. Specifically,
if we extend the activation function f(·) to apply to vectors in an element-
wise fashion (i.e., f([z
1
, z
2
, z
3
]) = [f(z
1
), f(z
2
), f(z
3
)]), then we can write
4
剩余18页未读,继续阅读
















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0