监督学习精要：从分类到回归

需积分: 10 107 浏览量更新于2024-07-22 收藏 1.31MB PDF 举报

"《Supervised Learning》是Wikipedia迷你书，涵盖了监督学习的基本概念、算法、应用和相关问题。本书详细介绍了监督学习的各种方面，包括分类、回归、各种算法及其在机器学习中的重要性。" 监督学习是机器学习的一个主要分支，它涉及到使用已标记的数据来训练模型，使模型能够对新数据进行预测。在这个过程中，模型通过学习输入和对应的输出之间的关系，从而学习到一个函数，这个函数可以用于未知数据的预测。 1.1 监督学习概述监督学习的核心在于找到一个最优的函数或模型，该模型能尽可能准确地拟合训练数据。这涉及到偏差-方差权衡，即模型既要尽可能地拟合训练数据（低偏差），又不能过于复杂导致过拟合（低方差）。函数的复杂度与训练数据量之间存在关联，更多的数据通常允许更复杂的模型。此外，输入空间的维度、输出值的噪声以及其他因素都会影响模型的性能。 1.2 监督学习算法的工作原理监督学习算法主要分为两大类：经验风险最小化和结构风险最小化。经验风险最小化侧重于在训练数据上找到最小误差的模型，而结构风险最小化则考虑了模型的复杂度，试图在泛化能力上取得平衡。 1.3 生成式训练生成式模型不仅学习输入到输出的映射，还尝试学习数据的潜在分布，以便可以生成新的样本。 1.4 监督学习的推广监督学习可以应用于许多问题，包括分类和回归。分类涉及将输入数据分配到预定义的类别，而回归则预测连续的输出值。 1.5 方法与算法监督学习中常见的算法有：感知器、支持向量机（SVM）、朴素贝叶斯、决策树、神经网络、集成学习方法（如随机森林和梯度提升）以及K近邻（K-NN）算法等。 1.6 应用场景监督学习广泛应用于诸多领域，如自然语言处理、图像识别、医学诊断、金融预测等。 1.7 一般问题监督学习面临的问题包括过拟合、欠拟合、数据不平衡以及特征选择等，解决这些问题需要相应的正则化技术、数据增强策略和特征工程。 2 统计分类统计分类是监督学习的一种，涉及将数据点归类到预先确定的类别。它与回归分析不同，后者预测连续变量。统计分类包括二元分类和多类分类，常用算法包括逻辑回归、朴素贝叶斯、支持向量机等。 3 回归分析回归分析是一种预测性建模技术，用于研究两个或多个变量之间的关系。历史上的回归方法如线性回归，如今已发展到包括多项式回归、岭回归等多种形式。以上内容简要概述了《Supervised Learning》这本书中关于监督学习、统计分类和回归分析的知识点。书中还提供了更深入的讨论、算法实现、评估方法和实际应用案例，对于理解和应用监督学习具有极高的参考价值。

8 CHAPTER 2. STATISTICAL CLASSIFICATION

• The perceptron algorithm

• Support vector machines

• Linear discriminant analysis.

2.7 Algorithms

Examples of classiﬁcation algorithms include:

• Linear classiﬁers

• Fisher’s linear discriminant

• Logistic regression

• Naive Bayes classiﬁer

• Perceptron

• Support vector machines

• Least squares support vector machines

• Quadratic classiﬁers

• Kernel estimation

• k-nearest neighbor

• Boosting (meta-algorithm)

• Decision trees

• Random forests

• Neural networks

• Learning vector quantization

2.8 Evaluation

Classiﬁer performance depends greatly on the character-

istics of the data to be classiﬁed. There is no single classi-

ﬁer that works best on all given problems (a phenomenon

that may be explained by the no-free-lunch theorem).

Various empirical tests have been performed to compare

classiﬁer performance and to ﬁnd the characteristics of

data that determine classiﬁer performance. Determining

a suitable classiﬁer for a given problem is however still

more an art than a science.

The measures precision and recall are popular metrics

used to evaluate the quality of a classiﬁcation system.

More recently, receiver operating characteristic (ROC)

curves have been used to evaluate the tradeoﬀ between

true- and false-positive rates of classiﬁcation algorithms.

As a performance metric, the uncertainty coeﬃcient has

the advantage over simple accuracy in that it is not af-

fected by the relative sizes of the diﬀerent classes.

[10]

Further, it will not penalize an algorithm for simply rear-

ranging the classes.

2.9 Application domains

See also: Cluster analysis § Applications

Classiﬁcation has many applications. In some of these it

is employed as a data mining procedure, while in others

more detailed statistical modeling is undertaken.

• Computer vision

• Medical imaging and medical image analysis

• Optical character recognition

• Video tracking

• Drug discovery and development

• Toxicogenomics

• Quantitative structure-activity relationship

• Geostatistics

• Speech recognition

• Handwriting recognition

• Biometric identiﬁcation

• Biological classiﬁcation

• Statistical natural language processing

• Document classiﬁcation

• Internet search engines

• Credit scoring

• Pattern recognition

• Micro-array classiﬁcation

2.10 See also

• Class membership probabilities

• Classiﬁcation rule

• Binary classiﬁcation

• Compound term processing

• Data mining

• Fuzzy logic

• Data warehouse

• Information retrieval

• Artiﬁcial intelligence

• Machine learning

• Recommender system

Chapter 3

Regression analysis

In statistics, regression analysis is a statistical process

for estimating the relationships among variables. It in-

cludes many techniques for modeling and analysing sev-

eral variables, when the focus is on the relationship be-

tween a dependent variable and one or more independent

variables. More speciﬁcally, regression analysis helps

one understand how the typical value of the dependent

variable (or 'criterion variable') changes when any one of

the independent variables is varied, while the other in-

dependent variables are held ﬁxed. Most commonly, re-

gression analysis estimates the conditional expectation of

the dependent variable given the independent variables –

that is, the average value of the dependent variable when

the independent variables are ﬁxed. Less commonly, the

focus is on a quantile, or other location parameter of the

conditional distribution of the dependent variable given

the independent variables. In all cases, the estimation

target is a function of the independent variables called

the regression function. In regression analysis, it is also

of interest to characterize the variation of the dependent

variable around the regression function which can be de-

scribed by a probability distribution.

Regression analysis is widely used for prediction and

forecasting, where its use has substantial overlap with the

ﬁeld of machine learning. Regression analysis is also used

to understand which among the independent variables

are related to the dependent variable, and to explore the

forms of these relationships. In restricted circumstances,

regression analysis can be used to infer causal relation-

ships between the independent and dependent variables.

However this can lead to illusions or false relationships,

so caution is advisable;

[1]

for example, correlation does

not imply causation.

Many techniques for carrying out regression analysis have

been developed. Familiar methods such as linear regres-

sion and ordinary least squares regression are parametric,

in that the regression function is deﬁned in terms of a ﬁ-

nite number of unknown parameters that are estimated

from the data. Nonparametric regression refers to tech-

niques that allow the regression function to lie in a speci-

ﬁed set of functions, which may be inﬁnite-dimensional.

The performance of regression analysis methods in prac-

tice depends on the form of the data generating pro-

cess, and how it relates to the regression approach be-

ing used. Since the true form of the data-generating pro-

cess is generally not known, regression analysis often de-

pends to some extent on making assumptions about this

process. These assumptions are sometimes testable if a

suﬃcient quantity of data is available. Regression mod-

els for prediction are often useful even when the assump-

tions are moderately violated, although they may not per-

form optimally. However, in many applications, espe-

cially with small eﬀects or questions of causality based

on observational data, regression methods can give mis-

leading results.

[2][3]

3.1 History

The earliest form of regression was the method of least

squares, which was published by Legendre in 1805,

[4]

and

by Gauss in 1809.

[5]

Legendre and Gauss both applied

the method to the problem of determining, from astro-

nomical observations, the orbits of bodies about the Sun

(mostly comets, but also later the then newly discovered

minor planets). Gauss published a further development

of the theory of least squares in 1821,

[6]

including a ver-

sion of the Gauss–Markov theorem.

The term “regression” was coined by Francis Galton

in the nineteenth century to describe a biological phe-

nomenon. The phenomenon was that the heights of de-

scendants of tall ancestors tend to regress down towards a

normal average (a phenomenon also known as regression

toward the mean).

[7][8]

For Galton, regression had only

this biological meaning,

[9][10]

but his work was later ex-

tended by Udny Yule and Karl Pearson to a more general

statistical context.

[11][12]

In the work of Yule and Pear-

son, the joint distribution of the response and explana-

tory variables is assumed to be Gaussian. This assump-

tion was weakened by R.A. Fisher in his works of 1922

and 1925.

[13][14][15]

Fisher assumed that the conditional

distribution of the response variable is Gaussian, but the

joint distribution need not be. In this respect, Fisher’s

assumption is closer to Gauss’s formulation of 1821.

In the 1950s and 1960s, economists used electromechani-

cal desk calculators to calculate regressions. Before 1970,

it sometimes took up to 24 hours to receive the result from

3.2. REGRESSION MODELS 11

one regression.

[16]

Regression methods continue to be an area of active re-

search. In recent decades, new methods have been de-

veloped for robust regression, regression involving cor-

related responses such as time series and growth curves,

regression in which the predictor or response variables are

curves, images, graphs, or other complex data objects, re-

gression methods accommodating various types of miss-

ing data, nonparametric regression, Bayesian methods for

regression, regression in which the predictor variables are

measured with error, regression with more predictor vari-

ables than observations, and causal inference with regres-

sion.

3.2 Regression models

Regression models involve the following variables:

• The unknown parameters, denoted as β, which

may represent a scalar or a vector.

• The independent variables, X.

• The dependent variable, Y.

In various ﬁelds of application, diﬀerent terminologies are

used in place of dependent and independent variables.

A regression model relates Y to a function of X and β.

Y ≈ f(X, β)

The approximation is usually formalized as E(Y | X) =

f(X, β). To carry out regression analysis, the form of

the function f must be speciﬁed. Sometimes the form of

this function is based on knowledge about the relationship

between Y and X that does not rely on the data. If no such

knowledge is available, a ﬂexible or convenient form for

f is chosen.

Assume now that the vector of unknown parameters β

is of length k. In order to perform a regression analysis

the user must provide information about the dependent

variable Y:

• If N data points of the form (Y, X) are observed,

where N < k, most classical approaches to regres-

sion analysis cannot be performed: since the system

of equations deﬁning the regression model is under-

determined, there are not enough data to recover β.

• If exactly N = k data points are observed, and the

function f is linear, the equations Y = f(X, β) can

be solved exactly rather than approximately. This

reduces to solving a set of N equations with N un-

knowns (the elements of β), which has a unique so-

lution as long as the X are linearly independent. If f

is nonlinear, a solution may not exist, or many solu-

tions may exist.

• The most common situation is where N > k data

points are observed. In this case, there is enough

information in the data to estimate a unique value

for β that best ﬁts the data in some sense, and the

regression model when applied to the data can be

viewed as an overdetermined system in β.

In the last case, the regression analysis provides the tools

for:

1. Finding a solution for unknown parameters β that

will, for example, minimize the distance between

the measured and predicted values of the dependent

variable Y (also known as method of least squares).

2. Under certain statistical assumptions, the regression

analysis uses the surplus of information to provide

statistical information about the unknown parame-

ters β and predicted values of the dependent variable

3.2.1 Necessary number of independent

measurements

Consider a regression model which has three unknown

parameters, β

, β

, and β

. Suppose an experimenter

performs 10 measurements all at exactly the same value

of independent variable vector X (which contains the in-

dependent variables X

, X

, and X

). In this case, regres-

sion analysis fails to give a unique set of estimated values

for the three unknown parameters; the experimenter did

not provide enough information. The best one can do is

to estimate the average value and the standard deviation

of the dependent variable Y. Similarly, measuring at two

diﬀerent values of X would give enough data for a re-

gression with two unknowns, but not for three or more

unknowns.

If the experimenter had performed measurements at

three diﬀerent values of the independent variable vector

X, then regression analysis would provide a unique set of

estimates for the three unknown parameters in β.

In the case of general linear regression, the above state-

ment is equivalent to the requirement that the matrix X

is invertible.

3.2.2 Statistical assumptions

When the number of measurements, N, is larger than the

number of unknown parameters, k, and the measurement

errors εᵢ are normally distributed then the excess of in-

formation contained in (N − k) measurements is used to

make statistical predictions about the unknown param-

eters. This excess of information is referred to as the

degrees of freedom of the regression.

12 CHAPTER 3. REGRESSION ANALYSIS

3.3 Underlying assumptions

Classical assumptions for regression analysis include:

• The sample is representative of the population for

the inference prediction.

• The error is a random variable with a mean of zero

conditional on the explanatory variables.

• The independent variables are measured with no er-

ror. (Note: If this is not so, modeling may be done

instead using errors-in-variables model techniques).

• The predictors are linearly independent, i.e. it is not

possible to express any predictor as a linear combi-

nation of the others.

• The errors are uncorrelated, that is, the variance–

covariance matrix of the errors is diagonal and each

non-zero element is the variance of the error.

• The variance of the error is constant across obser-

vations (homoscedasticity). If not, weighted least

squares or other methods might instead be used.

These are suﬃcient conditions for the least-squares esti-

mator to possess desirable properties; in particular, these

assumptions imply that the parameter estimates will be

unbiased, consistent, and eﬃcient in the class of linear

unbiased estimators. It is important to note that actual

data rarely satisﬁes the assumptions. That is, the method

is used even though the assumptions are not true. Vari-

ation from the assumptions can sometimes be used as a

measure of how far the model is from being useful. Many

of these assumptions may be relaxed in more advanced

treatments. Reports of statistical analyses usually include

analyses of tests on the sample data and methodology for

the ﬁt and usefulness of the model.

Assumptions include the geometrical support of the

variables.

[17]

Independent and dependent variables often

refer to values measured at point locations. There may be

spatial trends and spatial autocorrelation in the variables

that violate statistical assumptions of regression. Geo-

graphic weighted regression is one technique to deal with

such data.

[18]

Also, variables may include values aggre-

gated by areas. With aggregated data the modiﬁable areal

unit problem can cause extreme variation in regression

parameters.

[19]

When analyzing data aggregated by polit-

ical boundaries, postal codes or census areas results may

be very distinct with a diﬀerent choice of units.

3.4 Linear regression

Main article: Linear regression

See simple linear regression for a derivation of these

formulas and a numerical example

In linear regression, the model speciﬁcation is that the de-

pendent variable, y

is a linear combination of the param-

eters (but need not be linear in the independent variables).

For example, in simple linear regression for modeling n

data points there is one independent variable: x

, and

two parameters, β

and β

= β

+ β

+ ε

, i = 1, . . . , n.

In multiple linear regression, there are several indepen-

dent variables or functions of independent variables.

Adding a term in xi

to the preceding regression gives:

= β

+ β

+ ε

, i = 1, . . . , n.

This is still linear regression; although the expression on

the right hand side is quadratic in the independent variable

, it is linear in the parameters β

, β

and β

In both cases, ε

is an error term and the subscript i in-

dexes a particular observation.

Given a random sample from the population, we estimate

the population parameters and obtain the sample linear

regression model:

y



The residual, e

= y

− y

, is the diﬀerence between the

value of the dependent variable predicted by the model,

y

, and the true value of the dependent variable, y

One method of estimation is ordinary least squares. This

method obtains parameter estimates that minimize the

sum of squared residuals, SSE,

[20][21]

also sometimes de-

noted RSS:

SSE =



i=1

Minimization of this function results in a set of normal

equations, a set of simultaneous linear equations in the

parameters, which are solved to yield the parameter esti-

mators,



In the case of simple regression, the formulas for the least

squares estimates are





− ¯x)(y

− ¯y)



− ¯x)

and

= ¯y −



¯x

where ¯x is the mean (average) of the x values and ¯y is the

mean of the y values.

Under the assumption that the population error term has

a constant variance, the estimate of that variance is given

by:

剩余102页未读，继续阅读

万人往372

粉丝: 32
资源: 12

监督学习精要：从分类到回归

吴恩达机器学习课程精华笔记与作业详解

Duda模式识别：第二版关键问题与学习方法探讨

多类异构迁移学习：Semi-Supervised Co-Projection方法

Lecture 2: Supervised Learning of Behaviors

模式识别与机器学习：Supervised+Learning+Methods.pdf

Supervised Learning

Supervised learning

Supervised learning2

supervised learning

DSL: Dense Learning based Semi-Supervised Object Detection代码复现教程

最新资源