斯坦福CS229机器学习讲义解析：监督学习与模式识别

5星 · 超过95%的资源需积分: 16 68 浏览量更新于2024-07-19 2 收藏 2.03MB PDF 举报

"吴恩达斯坦福公开课CS229 Machine Learning原版讲义，涵盖了监督学习、生成式学习算法、支持向量机、学习理论、正则化与模型选择、感知机与大间隔分类器、K均值聚类算法、高斯混合模型与EM算法、EM算法详解、因子分析、主成分分析、独立成分分析以及强化学习与控制等多个主题。" 在机器学习领域，CS229是一门极具影响力的课程，由著名人工智能专家吴恩达教授在斯坦福大学开设。这门课程的讲义深入浅出地介绍了各种关键概念和技术，对于理解和实践机器学习至关重要。监督学习是机器学习中的一个基础概念，它涉及通过已有的带标签数据来训练模型，以便模型能对新数据进行预测。例如，描述中提到的房价预测问题就是一个典型的监督学习任务。我们有一组样本数据，包含房屋的居住面积（特征）和对应的价格（标签），通过对这些数据的学习，我们可以建立一个模型，用于预测其他未见过的房屋价格。在监督学习中，常用的算法有线性回归、逻辑回归、决策树、随机森林以及神经网络等。支持向量机（SVM）是监督学习中的另一种重要方法，其核心思想是找到一个最优的超平面，以最大化不同类别之间的间隔。SVM在处理小样本和高维数据时表现优秀，且能够有效地避免过拟合。学习理论探讨了如何评估和改进学习算法的性能，包括误差分析、风险与经验风险、VC维和学习曲线等概念，帮助我们理解模型的泛化能力和训练集大小的关系。正则化和模型选择是防止过拟合的关键策略，通过添加惩罚项来限制模型复杂度，如L1和L2正则化。模型选择则涉及到交叉验证和网格搜索等方法，以找到最佳的模型参数组合。感知机和大间隔分类器，如最大间隔分类，是二分类问题的基础，它们寻找能够最大化类别间隔的决策边界。 K均值聚类算法是一种无监督学习方法，用于将数据集分成K个不重叠的子集，每个子集代表一个簇。而高斯混合模型（GMM）和期望最大化（EM）算法则常用于概率建模和未标记数据的聚类，EM算法是求解这类问题的有效迭代方法。因子分析和主成分分析（PCA）都是降维技术，前者试图解释变量间的潜在关系，后者则通过线性变换找到数据的主要方向。独立成分分析（ICA）则用于寻找数据中的非高斯独立源信号。最后，强化学习与控制是机器学习的一个分支，研究智能系统如何通过与环境交互来学习最优策略，以最大化长期奖励。 CS229讲义提供的内容涵盖了机器学习的广泛领域，为学习者提供了扎实的理论基础和实践指导。

0 1 2 3 4 5 6 7

0.5

1.5

2.5

3.5

4.5

0 1 2 3 4 5 6 7

0.5

1.5

2.5

3.5

4.5

0 1 2 3 4 5 6 7

0.5

1.5

2.5

3.5

4.5

Instead, if we had added an extra feature x

, and ﬁt y = θ

+ θ

x + θ

then we obtain a slightly better ﬁt to the data. (See middle ﬁgure) Naively, it

might seem that the more features we add, the better. However, there is also

a danger in adding too many features: The rightmost ﬁgure is the result of

ﬁtting a 5-th order polynomial y =

j=0

. We see that even though the

ﬁtted curve passes through the data perfectly, we would not expect this to

be a very good predictor of, say, housing prices (y) for diﬀerent living areas

(x). Without formally deﬁning what these terms mean, we’ll say the ﬁgure

on the left shows an instance of underﬁtting—in which the data clearly

shows structure not captured by the model—and the ﬁgure on the right is

an example of overﬁtting. (Later in this class, when we talk about learning

theory we’ll formalize some of these notions, and also deﬁne more carefully

just what it means for a hypothesis to be good or bad.)

As discussed previously, and as shown in the example above, the choice of

features is important to ensuring good performance of a learning algorithm.

(When we talk about model selection, we’ll also see algorithms for automat-

ically choosing a good set of features.) In this section, let us talk brieﬂy talk

about the locally weighted linear regression (LWR) algorithm which, assum-

ing there is suﬃcient training data, makes the choice of features less critical.

This treatment will be brief, since you’ll get a chance to explore some of the

properties of the LWR algorithm yourself in the homework.

In the original linear regression algorithm, to make a prediction at a query

point x (i.e., to evaluate h(x)), we would:

1. Fit θ to minimize

(i)

− θ

(i)

)

2. Output θ

In contrast, the locally weighted linear regression algorithm does the fol-

lowing:

1. Fit θ to minimize

(i)

− θ

(i)

)

2. Output θ

Here, the w

(i)

’s are non-negative valued weights. Intuitively, if w

(i)

is large

for a particular value of i, then in picking θ, we’ll try hard to make (y

(i)

−

(i)

)

small. If w

(i)

is small, then the (y

(i)

− θ

(i)

)

error term will be

pretty much ignored in the ﬁt.

A fairly standard choice for the weights is

(i)

= exp



−

(i)

− x)

2τ



Note that the weights depend on the particular point x at which we’re trying

to evaluate x. Moreover, if |x

(i)

− x| is small, then w

(i)

is close to 1; and

if |x

(i)

− x| is large, then w

(i)

is small. Hence, θ is chosen giving a much

higher “weight” to the (errors on) training examples close to the query point

x. (Note also that while the formula for the weights takes a form that is

cosmetically similar to the density of a Gaussian distribution, the w

(i)

’s do

not directly have anything to do with Gaussians, and in particular the w

(i)

are not random variables, normally distributed or otherwise.) The parameter

τ controls how quickly the weight of a training example falls oﬀ with distance

of its x

(i)

from the query point x; τ is called the bandwidth parameter, and

is also something that you’ll get to experiment with in your homework.

Locally weighted linear regression is the ﬁrst example we’re seeing of a

non-parametric algorithm. The (unweighted) linear regression algorithm

that we saw earlier is known as a parametric learning algorithm, because

it has a ﬁxed, ﬁnite number of parameters (the θ

’s), which are ﬁt to the

data. Once we’ve ﬁt the θ

’s and stored them away, we no longer need to

keep the training data around to make future predictions. In contrast, to

make predictions using locally weighted linear regression, we need to keep

the entire training set around. The term “non-parametric” (roughly) refers

to the fact that the amount of stuﬀ we need to keep in order to represent the

hypothesis h grows linearly with the size of the training set.

If x is vector-valued, this is generalized to be w

(i)

= exp(−(x

(i)

−x)

(i)

−x)/(2τ

)),

or w

(i)

= exp(−(x

(i)

− x)

−1

(i)

− x)/2), for an appropriate choice of τ or Σ.

剩余135页未读，继续阅读

「已注销」

粉丝: 0
资源: 3

斯坦福CS229机器学习讲义解析：监督学习与模式识别

斯坦福大学机器学习公开课CS229中文笔记

CS229课程讲义及作业-Andrew Ng

斯坦福-CS229机器学习原版讲义

CS 229 machine learning原版讲义

stanford CS229 课程讲义

2020年CS229课程讲义

cs229原版课件

CS229所有讲义+作业+作业讲解

斯坦福大学机器学习课程cs229原始讲义

吴恩达机器学习视频cs229中文讲义

最新资源