R语言入门：机器学习基础与应用详解

需积分: 20 29 浏览量更新于2024-07-18 收藏 2.4MB PDF 举报

本文档是一份关于机器学习的入门指南，特别强调了在R语言中的应用。作者以Michael Clark为中心的社会科学研究机构为背景，目标是为那些在常规统计训练中可能较少接触到机器学习方法的专业人士提供一个概念性的理解框架。机器学习被视为统计学的一种形式，它与传统社会科学和其他领域的分析实践有所不同，其核心在于使用灵活且自动化的技术来发掘数据中的模式，重点在于对未来数据进行预测。章节概述： 1. **简介：解释与预测** - 文档首先介绍了机器学习的基本概念，区分于传统的统计分析，着重于数据挖掘和预测能力的提升。 2. **术语解析** - 为了便于理解，文档提供了机器学习中的一些关键术语，包括但不限于模型、特征、训练集、测试集等。 3. **现成工具** - 介绍读者已经熟悉的统计和编程工具R，以及它们如何在机器学习中发挥作用，比如标准线性模型（如简单线性回归）。 4. **扩展工具** - 包括逻辑回归（针对分类问题）、广义线性模型（GLMs）、广义加性模型（GAMs），这些模型在处理不同类型的数据（连续和离散）时有不同的适用性。 5. **损失函数** - 解释了损失函数在机器学习中的作用，如平方误差、绝对误差、对数似然损失等，以及它们在不同类型的预测任务（如二元分类、多类分类）中的表现。 6. **R的应用示例** - 提供了实际操作的例子，演示如何在R中使用这些工具和概念。 7. **偏差-方差权衡** - 这部分讨论了模型复杂度与预测性能的关系，包括高偏差（欠拟合）、高方差（过拟合）的问题，以及解决这些问题的方法，如交叉验证（如k折交叉验证、留一法）和正则化。 8. **模型评估与选择** - 除了准确率外，文档强调了其他性能指标的重要性，如精确度、召回率、F1分数等，并探讨了如何通过模型评估来挑选最合适的模型。 9. **总结及其他内容** - 文档最后还提到了其他可能涉及的主题，如模型诊断、验证集的添加、自助采样（Bootstrap）等，为深入学习者提供了进一步的扩展知识。通过这份文档，读者可以系统地学习如何在R语言环境中运用机器学习技术，理解并解决实际问题中的数据分析需求。无论是初学者还是经验丰富的专业人员，都能从中找到有价值的知识和实践经验。

Machine Learning 8

Where y is a normally distributed vector of responses [target] with

mean µ and constant variance σ

. X is a typical model matrix, i.e. a

matrix of predictor variables and in which the ﬁrst column is a vec-

tor of 1s for the intercept [bias

], and β is the vector of coefﬁcients

Yes, you will see ’bias’ refer to an

intercept, and also mean something

entirely different in our discussion of

bias vs. variance.

[weights] corresponding to the intercept and predictors in the model.

What might be given less focus in applied courses however is how

often it won’t be the best tool for the job or even applicable in the form

it is presented. Because of this many applied researchers are still ham-

mering screws with it, even as the explosion of statistical techniques

of the past quarter century has rendered obsolete many current intro-

ductory statistical texts that are written for disciplines. Even so, the

concepts one gains in learning the standard linear model are general-

izable, and even a few modiﬁcations of it, while still maintaining the

basic design, can render it still very effective in situations where it is

appropriate.

Typically in ﬁtting [learning] a model we tend to talk about R-

squared and statistical signiﬁcance of the coefﬁcients for a small

number of predictors. For our purposes, let the focus instead be on

the residual sum of squares

with an eye towards its reduction and

∑

(y − f (x))

where f (x) is a function

of the model predictors, and in this

context a linear combination of them

(Xβ).

model comparison. We will not have a situation in which we are only

considering one model ﬁt, and so must ﬁnd one that reduces the sum

of the squared errors but without unnecessary complexity and overﬁt-

ting, concepts we’ll return to later. Furthermore, we will be much more

concerned with the model ﬁt on new data [generalization].

Logistic Regression

Logistic regression is often used where the response is categorical in

nature, usually with binary outcome in which some event occurs or

does not occur [label]. One could still use the standard linear model

here, but you could end up with nonsensical predictions that fall out-

side the 0-1 range regarding the probability of the event occurring, to

go along with other shortcomings. Furthermore, it is no more effort

nor is any understanding lost in using a logistic regression over the

linear probability model. It is also good to keep logistic regression in

mind as we discuss other classiﬁcation approaches later on.

Logistic regression is also typically covered in an introduction to

statistics for applied disciplines because of the pervasiveness of binary

responses, or responses that have been made as such

. Like the stan-

It is generally a bad idea to discretize

continuous variables, especially the

dependent variable. However contextual

issues, e.g. disease diagnosis, might

warrant it.

dard linear model, just a few modiﬁcations can enable one to use it to

provide better performance, particularly with new data. The gist is,

it is not the case that we have to abandon familiar tools in the move

toward a machine learning perspective.

9 Applications in R

Expansions of Those Tools

Generalized Linear Models

To begin, logistic regression is a generalized linear model assuming a

binomial distribution for the response and with a logit link function as

follows:

y = Bin(µ, size = 1)

η = g(µ)

η = Xβ

This is the same presentation format as seen with the standard lin-

ear model presented before, except now we have a link function g(.)

and so are dealing with a transformed response. In the case of the

standard linear model, the distribution assumed is the gaussian and

the link function is the identity link, i.e. no transformation is made.

The link function used will depend on the analysis performed, and

while there is choice in the matter, the distributions used have a typi-

cal, or canonical link function

As another example, for the Poisson

distribution, the typical link function

would be the log(µ)

Generalized linear models expand the standard linear model, which

is a special case of generalized linear model, beyond the gaussian

distribution for the response, and allow for better ﬁtting models of

categorical, count, and skewed response variables. We have also have a

counterpart to the residual sum of squares, though we’ll now refer to it

as the deviance.

Generalized Additive Models

Additive models extend the generalized linear model to incorporate

nonlinear relationships of predictors to the response. We might note it

as follows:

y = f amily(µ, ...)

η = g(µ)

η = Xβ + f (X)

So we have the generalized linear model but also smooth functions

f (X) of one or more predictors. More detail can be found in Wood

(2006) and I provide an introduction here.

Things do start to get fuzzy with GAMs. It becomes more difﬁcult

to obtain statistical inference for the smoothed terms in the model,

and the nonlinearity does not always lend itself to easy interpretation.

However really this just means that we have a little more work to get

the desired level of understanding. GAMs can be seen as a segue to-

ward more black box/algorithmic techniques. Compared to some of

those techniques in machine learning, GAMs are notably more inter-

剩余42页未读，继续阅读

weixin_39516685

粉丝: 0
资源: 43

R语言入门：机器学习基础与应用详解

R语言入门教程：An Introduction to R

"深度学习算法教程：RNN神经网络训练与应用

"暑期夏令营: VIPLE编程1实验手册

An Introduction to Statistical Learning with Application in R (1)

An Introduction to Statistical Learning with Applications in R

Introduction to Machine Learning with Applications in Information Security-2018

An Introduction to Machine Learning, 2nd Edition

Unsupervised.Learning.with.R

Introduction to Machine Learning (2014)

Application of FPGA to real-time machine learning

最新资源