没有合适的资源？快使用搜索试试~ 我知道了~

首页张志华《机器学习导论》笔记和补充材料

资源详情

资源评论

资源推荐

Machine Learning Lecture Notes and

Supplemental Materials

《机器学习导论》

张志华教授

Machine Learning Basic Theory

Lecture Notes 1: Basic Theory

Professor: Zhihua Zhang Scribe:Cheng Chen, Luo Luo, Cong Xie

1 Introduction

In machine learning, data is typically expressed in a matrix form. Suppose we have n

samples, p variables (or features). Then we have

X =

x

11

x

12

· · · x

1p

x

21

x

22

· · · x

2p

.

.

.

.

.

.

x

n1

x

n2

· · · x

np

n×p

The ith sample can be denoted as X

i

= (X

11

, X

12

, . . . , X

1p

)

T

.

Machine learning is mainly to solve the following problems:

(1) Dimension Reduction: Dimension reduction is the process of reducing the number

of random variables(or features) under consideration. Formally, let X

i

∈ R

p

, we want

to ﬁnd Z

i

∈ R

q

(q < p) to present X

i

.

If we use linear transformation, then we need to ﬁnd a matrix A such that Z

i

= AX

i

.

Note that A should be full row rank.

If we use nonlinear transformation, then we need to ﬁnd a nonlinear function f such

that Z

i

= f(X

i

).

(2) Clustering: Clustering is the task of grouping a set of objects in such a way that

objects in the same group (called a cluster) are more similar (in some sense or another)

to each other than to those in other groups (clusters).We can view n samples as n

points, and our object is to cluster them into k clusters.

(3) Classiﬁcation: Classiﬁcation is the problem of identifying to which of a set of cat-

egories a new observation belongs, on the basis of a training set of data containing

observations (or instances) whose category membership is known. Formally, in the

training set, we have a label Y

i

for each X

i

, where Y

i

∈ C, C is a non-empty ﬁnite set.

If Y

i

∈ {−1, 1} or { 0, 1}, it’s a binary classiﬁcation problem. If Y

i

∈ {1, 2, . . . , k}, it’s a

multi-class classiﬁcation problem. There are also problems that one observation b elongs

to more than one category and they are called multi-label or multi-output classiﬁcation.

(4) Regression: Regression is a particular classiﬁcation problem in which the label Y

i

∈ R.

(5) Ranking: also called isotonic regression(IR). Isotonic regression involves ﬁnding a

weighted least-squares ﬁt x ∈ R

n

to a vector a ∈ R

n

with weights vector w ∈ R

n

subject to a set of non-contradictory constraints of kind x

i

≥ x

j

.

1 - 1

Note that (1),(2) are unsupervised learning, (3),(4),(5) are supervised learning. Unsu-

pervised learning is that of trying to ﬁnd hidden structure in unlabelled data. Supervised

learning is the machine learning task of inferring a function from labelled training data.

For supervised learning, the data is usually split into two or three parts.

(1) Training data: A set of examples used for learning, that is to ﬁt the parameters (e.g.,

weights for neural networks) of the model.

(2) Validation data: Sometimes, we also need a validation set to tune the model, for

example to choose the number of hidden units in a neural network or for pruning a

decision tree. It is usually used to prevent overﬁtting and enhance the generalization

ability.

(3) Test data: This data set is used only for testing the ﬁnal solution in order to conﬁrm

the actual performance.

1.1 Frequentist’s view vs. Bayesian view

1.1.1 Frequentist’s view

The frequentistic approach views the model parameters as unknown constants and estimates

them by matching the model to the training data using an appropriate metric.

Example 1.1 Suppose we have n pairs of samples {(x

i

, y

i

)}

n

i=1

, x

i

∈ R

p

, y

i

∈ R and we

want to ﬁt a linear function x

T

i

a(More strictly, it should be x

T

i

a + b or include a constant

variable 1 in x

i

) to predict y

j

.

Using least squares, we have loss function L =

n

i=1

(y

i

− x

T

i

a)

2

, where a is an unknown

ﬁxed parameter. We can solve a by minimizing the loss function.

Using maximum likelihood estimation, let y

i

∼ N (x

T

i

a, σ

2

), namely,

p(y

i

| x

i

) =

1

(2π)

1

2

σ

e

−

(y

i

−x

T

i

a)

2

2σ

2

.

So the log likelihood is (assuming the samples are independent)

l = log

n

i=1

p(y

i

| x

i

).

We can solve a by maximizing the joint likelihood.

Under the above conditions, you can prove that maximum likelihood estimation is the

same as least squares.

1.1.2 Bayesian view

The Bayesian approach views the model parameters as a random variable and estimates

them by using Bayes’ theorem.

Example 1.2 Let’s continue example 1.1, let y

i

∼ N (x

T

i

a, σ

2

) again. Here a and σ are

random variables, not constants. Let a ∼ N (0, λ

2

), σ

2

∼ Γ(α, β). Our interest is the

posterior probability P (a | x

i

, y

i

) ∝ P (x

i

, y

i

| a)P (a). We can use maximum posterior

estimation or Bayesian estimation to solve a.

1 - 2

1.2 Parametrics vs. Nonparametrics

In a parametrical model, the number of parameters is ﬁxed once and for all, independent

to the number of the training data. In a nonparametrical model, the number of parameters

can change according to the number of training data.

Example 1.3 In Nearest Neighbor method, the number of parameters is the number of

training samples. So this model is nonparametrical model.

In Logistic Regression, the number of parameters is the dimension of the training

samples. So this model is parametrical model.

2 Random Variables and Basic Properties

2.1 Random Variables

Deﬁnition 2.1 Let X = (X

1

, · · · , X

p

)

T

be a random vector. The cumulative distribution

function (C.D.F.) associated with X is: F

X

: F

X

(x) = P r(X ≤ x) = P r(X

1

≤ x

1

, · · · , X

p

≤

x

p

)

There are two kinds of random variables:

• Absolutely continuous

• Discrete

2.1.1 Absolutely Continuous Random Variables

Deﬁnition 2.2 X is a absolutely continuous random variable if there exists a probability

density function (P.D.F.) f

X

(x) such that F

X

(x) =

x

−∞

f

X

(u)du for absolutely continuous

variables where F

X

(∞) =

+∞

−∞

f(u)du = 1.

2.1.2 Discrete Random Variables

Deﬁnition 2.3 X is a discrete random variable if there exists a probability mass function

(P.M.F.) such that X takes countable (or ﬁnite) set of points {X

j

, j = 1, · · · } in which

P.M.F. is

f

X

(x

j

) = P (X

j

= x

j

)

f

X

(x) = 0, otherwise

where P (x ∈ D) =

D

f(u)du and D ⊆ R

p

.

2.1.3 Support Set

Deﬁnition 2.4 Support set S = {x ∈ R

p

, f

X

(x) > 0}

1 - 3

2.1.4 Marginal and Conditional Distribution

For x =

x

1

x

2

where x

1

∈ R

k

, x

2

∈ R

p−k

,

Deﬁnition 2.5 Marginal distribution: If there exists P r(X

1

≤ x

1

) = F

X

(x

1

, · · · , x

k

, ∞

k+1

, · · · , ∞

p

)

where X ∼ P, D.F.f(x), then f

1

(x

1

) =

+∞

−∞

f(x)dx

2

is the marginal distribution function.

Deﬁnition 2.6 Conditional distribution: By Bayes’ Theory, f

x

2

(x

2

|X

1

= x

1

) =

f(x

1

,x

2

)

f

1

(x

1

)

is

the conditional distribution function.

Example 2.1 For f(x

1

, x

2

) = 1, the marginal distribution functions are f

1

(x

1

) = 1,

f

2

(x

2

) = 1 where 0 < x

1

, x

2

< 1.

Example 2.2 For f (x

1

, x

2

) = 1 + α(2x

1

− 1)(2x

2

− 1), the marginal distribution functions

are f

1

(x

1

) = 1, f

2

(x

2

) = 1 where 0 < x

1

, x

2

< 1 and −1 ≤ α ≤ 1.

2.1.5 Independence

If x

1

and x

2

are independent, then the conditional P.D.F. f

2

(x

2

|x

1

) = f

2

(x

2

).

Theorem 2.1 If X

1

and X

2

are statistically independent, then f (x) = f (x

1

, x

2

) = f

1

(x

1

)f

2

(x

2

).

2.2 Population Moments

2.2.1 Expectation

Deﬁnition 2.7 X is a random variable with F

X

(x), then the expectation or mean of a

scalar-valued function g(X) is

E(g(x))

=

+∞

−∞

g(x)dF (x)

=

x

g(x)f(x) if x is discrete

+∞

−∞

g(x)f(x)dx if x is continuous

Proposition 2.1 Basic Properties of Expectation

• Linearity:

E(α

1

g

1

(x) + α

2

g

2

(x)) = α

1

E(g

1

(x)) + α

2

E(g

2

(x))

• Partition:

1 - 4

剩余123页未读，继续阅读

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功

## 评论1