R语言机器学习入门指南

需积分: 9 199 浏览量更新于2024-07-19 收藏 3.18MB PDF 举报

"Introduction to Machine Learning with R" 这本书是"Introduction to Machine Learning with R"的PDF版，由Scott V. Burger撰写，版权属于2018年的Scott Burger。它由O'Reilly Media, Inc.在美国出版，该出版社也提供在线版本的书籍。这本书主要针对教育、商业或销售推广用途，同时也可作为团体购买的选择。这本书的内容涵盖了严谨的数学分析和机器学习的基础知识，适合使用R语言进行机器学习实践的学习者。编辑团队包括Rachel Roumeliotis和Heather Scherer，生产编辑是Kristen Brown，而Bob Russell负责校对工作，Octal Publishing, Inc.的Bob Russell担任复制编辑，Jasmine Kwityn进行校样阅读，WordCo Indexing Services, Inc.编制索引，David Futato设计内页，Karen Montgomery设计封面，Rebecca Demarest则负责插图绘制。该书的初版于2018年3月发布。 "Introduction to Machine Learning with R"深入浅出地介绍了机器学习的理论基础，结合R语言的实际应用，帮助读者理解和应用各种机器学习算法。书中可能包含了数据预处理、模型选择、监督学习（如回归和分类）、无监督学习（聚类和降维）、集成学习（如随机森林和梯度提升）以及评估和验证模型的方法等内容。此外，读者还能学习如何使用R中的库，如 caret、randomForest、e1071等，进行机器学习项目。通过本书，读者将能够掌握如何在R环境中构建和优化机器学习模型，理解如何通过统计学方法解释模型结果，并学会在实际问题中应用这些知识。书中的实例和练习将帮助读者巩固理论知识，提高解决实际问题的能力。此外，O'Reilly的网站上提供了有关该书的错误报告和修订历史，确保了内容的准确性和时效性。对于希望提升R语言编程和机器学习技能的读者来说，"Introduction to Machine Learning with R"是一本不可多得的参考书籍。无论是初学者还是有一定经验的数据科学家，都能从中受益匪浅，进一步提升自己的机器学习能力。

The Bad

R has some drawbacks, as well. Many algorithms in its ecosystem are provided by the community or other third parties, so there can be some

inconsistency between them and other tools. Each package in R is like its own mini-ecosystem that requires a little bit of understanding first before

going all out with it. Some of these packages were developed a long time ago and it’s not obvious what the current “killer app” is for a particular

machine learning model. You might want to do a simple neural network model, for example, but you also want to visualize it. Sometimes, you might

need to select a package you’re less familiar with for its specific functionality and leave your favorite one behind.

Sometimes, documentation for more obscure packages can be inconsistent, as well. As referenced earlier, you can pull up the help file or manual

page for a given function in R by doing something like ?lm() or ?rf() . In a lot of cases, these include helpful examples at the bottom of the page for

how to run the function. However, some cases are needlessly complex and can be simplified to a great extent. One goal of this book is to try to

present examples in the simplest cases to build an understanding of the model and then expand on the complexity of its workings from there.

Finally, the way R operates from a programmatic standpoint can drive some professional developers up a wall with how it handles things like type

casting of data structures. People accustomed to working in a very strict object-oriented language for which you allocate specific amounts of memory

for things will find R to be rather lax in its treatment of boundaries like those. It’s easy to pick up some bad habits as a result of such pitfalls, but this

book aims to steer clear of those in favor of simplicity to explain the machine learning landscape.

Summary

In this chapter we’ve scoped out the vision for our exploration of machine learning using the R programming language.

First we explored what makes up a model and how that differs from a report. You saw that a static report doesn’t tell us much in terms of

predictability. You can turn a report into something more like a model by first introducing another feature and examining if there is some kind of

relationship in the data. You then fit a simple linear regression model using the lm() function and got an equation as your final result. One feature of R

that is quite powerful for developing models is the function operator ~ . You can use this function with great effect for symbolically representing the

formulas that you are trying to model.

We then explored the semantics of what defines a model. A machine learning model like linear regression utilizes algorithms like gradient descent to

do its background optimization procedures. You call linear regression in R by using the lm() function and then extract the coefficients from the

model, using those to build your equation.

An important step with machine learning and modeling in general is to understand the limits of the models. Having a robust model of a complex set of

data does not prevent the model itself from being limited in scope from a time perspective, like we saw with our mtcars data. Further, all models

have some kind of error tied to them. We explore error assessment on a model-by-model basis, given that we can’t directly compare some types to

others.

Lots of machine learning models utilize complicated statistical algorithms for them to compute what we want. In this book, we cover the basics of

these algorithms, but focus more on implementation and interpretation of the code. When statistical concepts become more of a focus than the

underlying code for a given chapter, we give special attention to those concepts in the appendixes where appropriate. The statistical techniques that

go into how we shape the data for training and testing purposes, however, are discussed in detail. Oftentimes, it is very important to know how to

specifically tune the machine learning model of choice, which requires good knowledge of how to handle training sets before passing test data through

the fully optimized model.

To cap off this chapter, we make the case for why R is a suitable tool for machine learning. R has its pedigree and history in the field of statistics,

which makes it a good platform on which to build modeling frameworks that utilize those statistics. Although some operations in R can be a little

different than other programming languages , on the whole R is a relatively simple-to-use interface for a lot of complicated machine learning concepts

and functions.

Being an open source programming language, R offers a lot of cutting-edge machine learning models and statistical algorithms. This can be a double-

edged sword in terms of help files or manual pages, but this book aims to help simplify some of the more impenetrable examples encountered when

looking for help.

In Chapter 2 , we explore some of the most popular machine learning models and how we use them in R. Each model is presented in an introductory

fashion with some worked examples. We further expand on each subject in a more in-depth dedicated chapter for each topic.

Box, G. E. P., J. S. Hunter, and W. G. Hunter. Statistics for Experimenters . 2nd ed. John Wiley & Sons, 2005.

Chapter 2. Supervised and Unsupervised Machine

Learning

In the universe of machine learning algorithms, there are two major types: supervised and unsupervised . Supervised learning models are those in

which a machine learning model is scored and tuned against some sort of known quantity. The majority of machine learning algorithms are supervised

learners. Unsupervised learning models are those in which the machine learning model derives patterns and information from data while determining

the known quantity tuning parameter itself. These are more rare in practice, but are useful in their own right and can help guide our thinking on where

to explore the data for further analysis.

An example of supervised learning might be something like this: we have a model we’ve built that says “any business that sells less than 10 units is a

poor performer, and more than 10 units is a good performer.” We then have a set of data we want to test against that statement. Suppose that our

data includes a store that sells eight units. That is less than 10, so according to our model definition, it is classified as a poor performer. In this

situation, we have a model that ingests data in which we’re interested and gives us an output as decided by the conditions in the model.

In contrast, an unsupervised learning model might be something like this: we have a bunch of data and we want to know how to separate it into

meaningful groups. We could have a bunch of data from a survey about people’s height and weight. We can use some algorithms in the unsupervised

branch to figure out a way to group the data into meaningful clusters for which we might define clothing sizes. In this case, the model doesn’t have an

answer telling it, “For this person’s given height and weight, I should classify them as a small pant size”; it must figure that out for itself.

Supervised Models

Supervised models are more common than their unsupervised counterparts. They come in three major flavors:

Regression

These models are very common, and it’s likely that you encountered one in high school math classes. They are primarily used for looking at how

data evolves with respect to another variable (e.g., time) and examining what you can do to predict values in the future.

Classification

These models are used to organize your data into schemes that make categorical sense. For instance, consider the aforementioned store labeling

example—stores that sell more than 10 units per week could be classified as good performers, whereas those selling fewer than that number

would be classified as poor.

Mixed

These models can often rely on parts of regression to inform how to do classification, or sometimes the opposite. One case might be looking at

sales data over time and whether there is a rapid change in the slope of the line in some time period.

Regression

Regression modeling is something you most likely have done numerous times without realizing you’re doing machine learning. At its core, a regression

line is one for which we fit to data that has an x and a y element. We then use an equation to predict what the corresponding output, y , should be for

any given input, x . This is always done on numeric data.

Let’s take a look at an example regression problem:

head

(

mtcars

)

## mpg cyl disp hp drat wt qsec vs am gear carb

## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

This is one of the many built-in datasets featured in R: the mtcars dataset. It contains data about 32 cars from a 1974 issue of Motor Trend

magazine. We have 11 features ranging from the car’s fuel efficiency in miles per US gallon, to weight, and even whether the car has a manual or

automatic transmission. Figure 2-1 plots the fuel efficiency of the cars (mpg ) in the dataset as a function of their engine size, or displacement (disp )

The cornerstone for regression modeling in R is the lm() function. We are also using another powerful operator featured in R: the formula operator

as denoted by ~ . You might recall that regression modeling is of the form y = mx + b , where the output y is determined from a given slope m ,

intercept b , and input data x . Your linear model in this case is given by the coefficients that you just computed, so the model looks like the following:

Fuel Efficiency = –0.041 × Engine Size + 29.599

You now have a very simple machine learning model! You can use any input for the engine size and get a value out. Let’s look at the fuel efficiency

for a car with a 200-cubic-inch displacement:

-0.041

200

25.599

## [1] 17.399

Another, more accurate, way to do this is to call the coefficients from the model directly:

coef

(

model

)[

]

200

coef

(

model

)[

]

## mtcars$disp

## 21.35683

You can repeat this with any numerical input in which you’re interested. Yet you might want to expand this analysis to include other features. You

might want a model that computes engine efficiency as a function not only of engine size, but maybe the number of cylinders, horsepower, number of

gears, and so on. You might also want to try different functions to fit to the data, because if we try and fit a theoretical engine size of 50,000 cubic

inches, the fuel efficiency goes negative! We explore these types of modeling approaches in greater depth in Chapter 4 , which focuses exclusively on

regression models in R.

Training and Testing of Data

Before we jump into the other major realm of supervised learning, we need to bring up the topic about training and testing data. As we’ve seen with

simple linear regression modeling thus far, we have a model that we can use to predict future values. Yet, we know nothing about how accurate the

model is for the moment. One way to determine model accuracy is to look at the R-squared value from the model:

summary

(

model

)

## Call:

## lm(formula = mtcars$mpg ~ mtcars$disp)

## Residuals:

## Min 1Q Median 3Q Max

## -4.8922 -2.2022 -0.9631 1.6272 7.2305

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***

## mtcars$disp -0.041215 0.004712 -8.747 9.38e-10 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 3.251 on 30 degrees of freedom

## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709

## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10

The function call summary() on our model object gives us a lot of information. The accuracy parameter that’s most important to us at the moment is

the Adjusted R-squared value. This value tells us how linearly correlated the data is; the closer the value is to 1, the more likely the model output

is governed by data that’s almost exactly a straight line with some kind of slope value to it. The reason we are focusing on the adjusted part instead of

the multiple is for future scenarios in which we use more features in a model. For low numbers of features the adjusted and multiple R-squared values

are basically the same thing. For models that have many features, we want to use multiple R-squared values, instead, because it will give a more

accurate assessment of the model error if we have many dependent features instead of just one.

But what does this tell us as far as an error estimate for the model? We have standard error values from the output, but there’s an issue with the

model being trained on all the data, then being tested on the same data. What we want to do, in order to ensure an unbiased amount of error, is to

split our starting dataset into a training dataset and test dataset.

In the world of statistics, you do this by taking a dataset you have and splitting it into 80% training data and 20% test data. You can tinker with those

numbers to your taste, but you always want more training data than test data:

split_size

0.8

sample_size

floor

(

split_size

nrow

(

mtcars

))

set.seed

(

123

)

train_indices

sample

(

seq_len

(

nrow

(

mtcars

)),

size

sample_size

)

train

mtcars

[

train_indices

]

test

mtcars

[

train_indices

]

This example sets the split size at 80% and then the sample size for the training set to be 80% of the total number of rows in the mtcars data. We

剩余231页未读，继续阅读

tianzhiyu1029

粉丝: 0
资源: 7

R语言机器学习入门指南

Introduction to Machine Learning with R Rigorous Mathematical Analysis epub

基于springboot+Javaweb的二手图书交易系统源码数据库文档.zip

Linux课程设计.doc

课程考试的概要介绍与分析

基于Django的食堂点餐系统

基于SpringBoot的同城宠物照看系统源码数据库文档.zip

value_at_a_point.ipynb

基于springboot+Web的心理健康交流系统源码数据库文档.zip

kotlin 实践微信插件助手, 目前支持抢红包（支持微信最新版本 7.0.0及7.0.3）.zip

N32G45X运放电路检测电压

最新资源