UCM机器学习导论：数据驱动预测与技术详解

需积分: 5 87 浏览量更新于2024-07-09 收藏 3.47MB PDF 举报

CSE176《机器学习导论》是一份由Miguel Carreira-Perpinan教授在加州大学默塞德分校提供的课程讲义，该课程深入介绍了机器学习的基础概念和各种关键技术和方法。课程内容涵盖了广泛的机器学习领域，主要包括监督学习（如分类和回归）、无监督学习（包括聚类和降维）、强化学习以及计算学习理论。教材的主要参考是Ethem Alpaydin的《机器学习》（MIT Press, 第三版，2014年），但也包含了一些补充内容。课程从机器学习的定义开始，强调随着大数据时代的来临，数据持续增长且其中蕴含着结构，这些结构可以用于预测结果或获取知识。例如，通过分析亚马逊的购买模式，可以实现商品推荐，这展示了机器学习在实际应用中的潜力。相较于传统的算法设计，处理这类复杂任务需要更为智能的算法。监督学习部分详细讨论了贝叶斯方法，这是一种基于概率统计的预测技术，常用于分类和回归问题。混合模型则结合了多种假设，能够处理复杂的数据分布。决策树作为直观的模型，依据特征对数据进行分割，适用于分类和回归场景。基于实例的学习方法则依赖于存储的实例来预测新数据，而神经网络，特别是深度学习，通过模拟人脑神经元的工作原理，解决了许多复杂的非线性问题。内核方法则是将数据映射到高维空间，通过内积操作简化计算，使得非线性问题在低维空间中变得可解。集成学习，如随机森林和梯度提升机，通过组合多个弱学习器形成强大的预测模型，提高准确性和稳定性。无监督学习部分涉及聚类，如K-means算法，用于数据分组，发现数据内在的结构；降维技术如主成分分析（PCA）和独立成分分析（ICA），用于数据压缩和可视化。强化学习关注的是如何让智能体通过与环境交互，学习最优行为策略，它在游戏、机器人等领域有广泛应用。整个课程旨在使学生掌握机器学习的基本原理、工具和技术，为他们今后在科研、工程或者数据分析等领域应用机器学习打下坚实的基础。这些讲义可供教育目的的非商业使用，体现了开放教育资源的价值。

K > 2 classes

• Again, the most basic measure is the classiﬁcation error.

• Confusion matrix : K ×K matrix where entry (i, j) contains the

number of instances of class C

that are classiﬁed as C

• It allows us to identify which types of misclassiﬁcation errors tend

to occur, e.g. if there are two classes that are frequently confused.

Ideal classiﬁer: the confusion matrix is diagonal.

Ex: MNIST

handwritten digits

0 1 2 3 4 5 6 7 8 9

Predicted class

True class

4 (Univariate) parametric methods

• How to learn probability distributions from data

(in order to use them to make decisions).

Joint distri bution: p(X = x, Y = y).

Conditioning (product rule): p(Y = y |X = x) =

p(X = x, Y = y)

p(X = x)

Marginalizing (sum rule): p(X = x) =

p(X = x, Y = y).

Bayes’ theorem:

(inverse probability)

p(X = x |Y = y) =

p(Y = y |X = x) p(X = x)

p(Y = y)

• We assume such distributions follow a pa r t icular

parametric form (e.g. Gaussian), so we need to

estimate its parameters (µ, σ).

• Several ways to learn them:

– by optimizing an objective function ( e.g. maximum likelihood)

– by Bayesian estimation.

• This chapter: univariate case; next chapter: multivariate case.

4.2 Maximum likelihood estimation: parametric density estimation

• Problem: estimating a density p(x). Assume an iid sample X = {x

}

n=1

drawn from a known

probability density family p(x; Θ) with parameters Θ. We want to estimate Θ from X.

• Log-likelih ood of Θ given X:

L(Θ; X) = log p(X; Θ) = log

n=1

p(x

; Θ) =

n=1

log p(x

; Θ)

• Maximum likelihood estimate (MLE):

MLE

= arg max

L(Θ; X).

• Examples:

– Bernoulli: Θ = {θ}, p(x; θ) = θ

(1 −θ)

1−x



θ, x = 1

1 − θ, x = 0

, x ∈ {0, 1 }, θ ∈ [0, 1].

MLE

: ˆp =

n=1

(sample average).

– G aussian: Θ = {µ, σ

}, p(x; µ, σ

) =

√

2πσ

−

(

x−µ

)

, x ∈ R, µ ∈ R, σ ∈ R

MLE

: ˆµ =

n=1

(sample average), ˆσ

n=1

− ˆµ)

(sample variance).

For more complicated distributions, we usually need an algorithm to ﬁnd the MLE.

4.4 The Bayes’ estimator: parametric density estimation

• Consider the parameters Θ as random variables themselves (not as unknown numbers), and

assume a prio r distribution p(Θ) over them (based on domain information): how likely it is for

the parameters to take a value before having observed any dat a.

• Posterior dis tribution p(Θ|X) =

p(X|Θ)p(Θ)

p(X)

p(X|Θ)p(Θ)

p(X|Θ

′

) p(Θ

′

) dΘ

′

: how likely it is for the parame-

ters to take a value after having observed a sample X.

• Resulting estimate for the probability at a new point x

: p(x|X) =

p(x|Θ) p(Θ|X) dΘ. Hence,

rather than using the pr ediction of a single Θ value (“frequentist statistics”), we average the

prediction of every parameter value Θ using its posterior distribution ( “Bayesian statistics”).

• Approximations: reduce p(Θ|X) to a single point Θ.

– Maxim um a posteriori (MAP) estimate:

MAP

= arg max

p(Θ|X).

Particular case: if p(Θ) = constant, then p(Θ|X) ∝ p(X|Θ) and MAP estimate = MLE.

– B ayes’ estimator: Θ

Bayes

= E {Θ|X} =

Θ p(Θ|X) dΘ.

Works well if p(Θ|X) is peaked around a single value.

4.6 Maximum likelihood estimation: parametric regression

• Assume there exists a n unknown function f that maps inputs x to outputs y = f(x), but that

what we observe as output is a noisy version y = f(x) + ǫ, where ǫ is a n random error. We

want to estimate f by a parametric function h(x; Θ). In ch. 2 we saw the least-squares error

was a good loss function to use for that purpose. We will now show that maximum likelihood

estimation under G aussian noise is equivalent to that.

• Log-likelihood of Θ given a sample {(x

, y

)}

n=1

drawn iid from p(x, y):

L(Θ; X) = log

n=1

p(x

, y

)

n=1

log p(y

; Θ) + constant.

• Assume an erro r ǫ ∼ N(0, σ

), so p(y|x) ∼ N(h(x; Θ), σ

). Then maximizing the log-likelihood

is equivalent

to minimizing E(Θ; X) =

n=1

− h(x

; Θ))

, i.e., the least-squares error. p. 78

• Examples:

– Linear regression: h(x; w

, w

) = w

x + w

. LSQ estimate

w = A

−1

y, A =



n=1



, w =





, y =



n=1



– Polynomial regression: h(x; w

, . . . , w

) = w

+ ···+ w

x + w

. The model is still linear

on the parameters. LSQ estimate also of the form w = A

−1

y. p. 79

x: milage

y: price

4.3 Evaluating an estimator: bias and variance

• Statistic d(X): any value that is calculated from a sample X (e.g. aver age, maximum. . . ). It is a

r.v. with an expectation (over samples) E

{d(X)} and a variance E

{(d(X) − E

{d(X)})

• X = (x

, . . . , x

) iid sample from p(x; θ). Let d(X) be an estimator for θ. How good is it?

mean square error of the estimator d:

error(d, θ) = E



(d(X) − θ)



• Bias of the estimator:

(d) = E

{d(X)} − θ. How much the expected value o f the estimator

over samples diﬀers from the true parameter value.

If b

(d) = 0 for all θ values: unbiased estimator.

Ex: the sample average

n=1

is an unbiased estimator of the true mean µ

• Variance of the estimator:

var {d} = E

{(d(X) − E

{d(X)})

}. How much t he estimator

varies around its expected value from one sample to another.

d

i

E[d]

variance

bias

θ

If var {d} → 0 as N → ∞: consistent es tim ator.

Ex: the sample average is a consistent estimator of the true mean

剩余79页未读，继续阅读

努力+努力=幸运

粉丝: 2
资源: 136

UCM机器学习导论：数据驱动预测与技术详解

CSE446 Machine Learning.pdf

H3CSE GB381答案含解析.pdf

h3cse-wlan.pdf

360cse_12.0.1476.0.exe

h3cse rs+ pdf

class A_cSE(nn.Module): def __init__(self, in_ch): super(A_cSE, self).__init__()中super(A_cSE, self).__init__()什么意思

cse().list

华为CSE框架中mesher怎么用

cse-config-order

100leaves数据集

最新资源

class A_cSE(nn.Module): def init(self, in_ch): super(A_cSE, self).init()中super(A_cSE, self).init()什么意思