没有合适的资源?快使用搜索试试~ 我知道了~
首页张志华《机器学习导论》笔记和补充材料
资源详情
资源评论
资源推荐

Machine Learning Lecture Notes and
Supplemental Materials
《机器学习导论》
张志华教授

Machine Learning Basic Theory
Lecture Notes 1: Basic Theory
Professor: Zhihua Zhang Scribe:Cheng Chen, Luo Luo, Cong Xie
1 Introduction
In machine learning, data is typically expressed in a matrix form. Suppose we have n
samples, p variables (or features). Then we have
X =
x
11
x
12
· · · x
1p
x
21
x
22
· · · x
2p
.
.
.
.
.
.
x
n1
x
n2
· · · x
np
n×p
The ith sample can be denoted as X
i
= (X
11
, X
12
, . . . , X
1p
)
T
.
Machine learning is mainly to solve the following problems:
(1) Dimension Reduction: Dimension reduction is the process of reducing the number
of random variables(or features) under consideration. Formally, let X
i
∈ R
p
, we want
to find Z
i
∈ R
q
(q < p) to present X
i
.
If we use linear transformation, then we need to find a matrix A such that Z
i
= AX
i
.
Note that A should be full row rank.
If we use nonlinear transformation, then we need to find a nonlinear function f such
that Z
i
= f(X
i
).
(2) Clustering: Clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense or another)
to each other than to those in other groups (clusters).We can view n samples as n
points, and our object is to cluster them into k clusters.
(3) Classification: Classification is the problem of identifying to which of a set of cat-
egories a new observation belongs, on the basis of a training set of data containing
observations (or instances) whose category membership is known. Formally, in the
training set, we have a label Y
i
for each X
i
, where Y
i
∈ C, C is a non-empty finite set.
If Y
i
∈ {−1, 1} or { 0, 1}, it’s a binary classification problem. If Y
i
∈ {1, 2, . . . , k}, it’s a
multi-class classification problem. There are also problems that one observation b elongs
to more than one category and they are called multi-label or multi-output classification.
(4) Regression: Regression is a particular classification problem in which the label Y
i
∈ R.
(5) Ranking: also called isotonic regression(IR). Isotonic regression involves finding a
weighted least-squares fit x ∈ R
n
to a vector a ∈ R
n
with weights vector w ∈ R
n
subject to a set of non-contradictory constraints of kind x
i
≥ x
j
.
1 - 1

Note that (1),(2) are unsupervised learning, (3),(4),(5) are supervised learning. Unsu-
pervised learning is that of trying to find hidden structure in unlabelled data. Supervised
learning is the machine learning task of inferring a function from labelled training data.
For supervised learning, the data is usually split into two or three parts.
(1) Training data: A set of examples used for learning, that is to fit the parameters (e.g.,
weights for neural networks) of the model.
(2) Validation data: Sometimes, we also need a validation set to tune the model, for
example to choose the number of hidden units in a neural network or for pruning a
decision tree. It is usually used to prevent overfitting and enhance the generalization
ability.
(3) Test data: This data set is used only for testing the final solution in order to confirm
the actual performance.
1.1 Frequentist’s view vs. Bayesian view
1.1.1 Frequentist’s view
The frequentistic approach views the model parameters as unknown constants and estimates
them by matching the model to the training data using an appropriate metric.
Example 1.1 Suppose we have n pairs of samples {(x
i
, y
i
)}
n
i=1
, x
i
∈ R
p
, y
i
∈ R and we
want to fit a linear function x
T
i
a(More strictly, it should be x
T
i
a + b or include a constant
variable 1 in x
i
) to predict y
j
.
Using least squares, we have loss function L =
n
i=1
(y
i
− x
T
i
a)
2
, where a is an unknown
fixed parameter. We can solve a by minimizing the loss function.
Using maximum likelihood estimation, let y
i
∼ N (x
T
i
a, σ
2
), namely,
p(y
i
| x
i
) =
1
(2π)
1
2
σ
e
−
(y
i
−x
T
i
a)
2
2σ
2
.
So the log likelihood is (assuming the samples are independent)
l = log
n
i=1
p(y
i
| x
i
).
We can solve a by maximizing the joint likelihood.
Under the above conditions, you can prove that maximum likelihood estimation is the
same as least squares.
1.1.2 Bayesian view
The Bayesian approach views the model parameters as a random variable and estimates
them by using Bayes’ theorem.
Example 1.2 Let’s continue example 1.1, let y
i
∼ N (x
T
i
a, σ
2
) again. Here a and σ are
random variables, not constants. Let a ∼ N (0, λ
2
), σ
2
∼ Γ(α, β). Our interest is the
posterior probability P (a | x
i
, y
i
) ∝ P (x
i
, y
i
| a)P (a). We can use maximum posterior
estimation or Bayesian estimation to solve a.
1 - 2

1.2 Parametrics vs. Nonparametrics
In a parametrical model, the number of parameters is fixed once and for all, independent
to the number of the training data. In a nonparametrical model, the number of parameters
can change according to the number of training data.
Example 1.3 In Nearest Neighbor method, the number of parameters is the number of
training samples. So this model is nonparametrical model.
In Logistic Regression, the number of parameters is the dimension of the training
samples. So this model is parametrical model.
2 Random Variables and Basic Properties
2.1 Random Variables
Definition 2.1 Let X = (X
1
, · · · , X
p
)
T
be a random vector. The cumulative distribution
function (C.D.F.) associated with X is: F
X
: F
X
(x) = P r(X ≤ x) = P r(X
1
≤ x
1
, · · · , X
p
≤
x
p
)
There are two kinds of random variables:
• Absolutely continuous
• Discrete
2.1.1 Absolutely Continuous Random Variables
Definition 2.2 X is a absolutely continuous random variable if there exists a probability
density function (P.D.F.) f
X
(x) such that F
X
(x) =
x
−∞
f
X
(u)du for absolutely continuous
variables where F
X
(∞) =
+∞
−∞
f(u)du = 1.
2.1.2 Discrete Random Variables
Definition 2.3 X is a discrete random variable if there exists a probability mass function
(P.M.F.) such that X takes countable (or finite) set of points {X
j
, j = 1, · · · } in which
P.M.F. is
f
X
(x
j
) = P (X
j
= x
j
)
f
X
(x) = 0, otherwise
where P (x ∈ D) =
D
f(u)du and D ⊆ R
p
.
2.1.3 Support Set
Definition 2.4 Support set S = {x ∈ R
p
, f
X
(x) > 0}
1 - 3

2.1.4 Marginal and Conditional Distribution
For x =
x
1
x
2
where x
1
∈ R
k
, x
2
∈ R
p−k
,
Definition 2.5 Marginal distribution: If there exists P r(X
1
≤ x
1
) = F
X
(x
1
, · · · , x
k
, ∞
k+1
, · · · , ∞
p
)
where X ∼ P, D.F.f(x), then f
1
(x
1
) =
+∞
−∞
f(x)dx
2
is the marginal distribution function.
Definition 2.6 Conditional distribution: By Bayes’ Theory, f
x
2
(x
2
|X
1
= x
1
) =
f(x
1
,x
2
)
f
1
(x
1
)
is
the conditional distribution function.
Example 2.1 For f(x
1
, x
2
) = 1, the marginal distribution functions are f
1
(x
1
) = 1,
f
2
(x
2
) = 1 where 0 < x
1
, x
2
< 1.
Example 2.2 For f (x
1
, x
2
) = 1 + α(2x
1
− 1)(2x
2
− 1), the marginal distribution functions
are f
1
(x
1
) = 1, f
2
(x
2
) = 1 where 0 < x
1
, x
2
< 1 and −1 ≤ α ≤ 1.
2.1.5 Independence
If x
1
and x
2
are independent, then the conditional P.D.F. f
2
(x
2
|x
1
) = f
2
(x
2
).
Theorem 2.1 If X
1
and X
2
are statistically independent, then f (x) = f (x
1
, x
2
) = f
1
(x
1
)f
2
(x
2
).
2.2 Population Moments
2.2.1 Expectation
Definition 2.7 X is a random variable with F
X
(x), then the expectation or mean of a
scalar-valued function g(X) is
E(g(x))
=
+∞
−∞
g(x)dF (x)
=
x
g(x)f(x) if x is discrete
+∞
−∞
g(x)f(x)dx if x is continuous
Proposition 2.1 Basic Properties of Expectation
• Linearity:
E(α
1
g
1
(x) + α
2
g
2
(x)) = α
1
E(g
1
(x)) + α
2
E(g
2
(x))
• Partition:
1 - 4
剩余123页未读,继续阅读
安全验证
文档复制为VIP权益,开通VIP直接复制

评论1