统计学习方法与R语言应用导论

1星需积分: 9 145 浏览量更新于2024-07-17 收藏 13.99MB PDF 举报

"8-xgboost.pdf" 是一本关于统计学习和R语言应用的书籍，由Gareth James、Daniela Witten、Trevor Hastie和Robert Tibshirani等知名学者共同编写，属于Springer Texts in Statistics系列。书中涵盖了统计学习的基础理论和实际应用，特别是与R语言的结合使用。本书旨在介绍统计学习的关键概念和技术，包括但不限于监督学习、无监督学习和半监督学习。在内容中，XGBoost作为一个重要的部分，可能被讨论到，因为它是现代机器学习中广泛使用的梯度提升算法之一。XGBoost是Gradient Boosting框架的一个高效实现，尤其在数据挖掘和预测竞赛中表现突出。 XGBoost是由陈天奇开发的，它优化了计算效率，并且提供了良好的并行化处理能力。XGBoost在决策树模型的基础上构建，通过迭代地添加弱预测器（通常是决策树），并优化整体的预测误差，从而构建出一个强预测模型。这种算法在处理大量数据和复杂问题时，既能保持准确性，又能控制计算时间。在统计学习中，XGBoost利用梯度提升策略来逐步改进模型。每个新加入的弱预测器都是为了最小化残差，即前一轮所有预测器预测结果与真实值之间的差异。这种优化过程使得模型能够逐步聚焦于错误预测，从而提高整体性能。除了基本的XGBoost模型，书中可能还涉及了模型调优，如选择合适的树深度、学习率、正则化参数等，这些都会影响模型的复杂性和泛化能力。此外，书中的实际应用部分可能讲解如何使用R包`xgboost`进行数据预处理、模型训练、交叉验证以及特征重要性的评估。通过阅读这本书，读者不仅可以理解XGBoost的工作原理，还能掌握如何在实际项目中有效地应用这个强大的工具。同时，书中涵盖的统计学习方法对理解和提升机器学习技能有着极大的帮助，适合数据分析、数据科学和机器学习领域的专业人士和学生参考学习。

21.Introduction

Age

Wage

Year

Wage

20 40 60 80

50 100 200 300

2003 2006 2009

12345

Education Level

Wage

FIGURE 1.1. Wage data, which contains income survey information for males

from the central Atlantic region of the United States. Left: wage as a function of

age.Onaverage,wage increases with age until about 60 years of age, at which

point it begins to decline. Center: wage as a function of year.Thereisaslow

but steady increase of approximately $10,000 in the average wage between 2003

and 2009. Right: Boxplots displaying wage as a function of education,with1

indicating the lowest level (no high school diploma) and 5 the highest level (an

advanced graduate degree). On average, wage increases with the level of education.

Given an employee’s age,wecanusethiscurvetopredict his wage.However,

it is also clea r from Figure 1.1 that there is a signiﬁcant amount of vari-

ability asso ciated with this average value, and so

age alone is unlikely to

provide an accur ate prediction of a particular man’s

wage.

We also have information regar ding each employee’s education level and

the

year in which the wage was earned. The center and right-hand panels of

Figure 1 .1, which display

wage a s a function of both year and education, in-

dicate that both of these factors are associated with wage. Wages increase

by approximately $10,000, in a r oughly linear (or straight-line) fashion,

between 2003 and 2 009, though this ris e is very slight relative to the vari-

ability in the data. Wages are also typically greater for individuals with

higher education levels: men with the l owest education level (1) tend to

have substantially lower wages than those with the highest education level

(5). Clearly, the most accura te prediction of a given man’s

wage will be

obtained by combining his

age, his education,andtheyear.InChapter3,

we discuss linear regression, which can be used to predict wage from this

data set. Ideally, we should predict wage in a way that accounts for the

non-linear relationship between

wage and age.InChapter7,wediscussa

class of approaches for addressing this problem.

Stock Market Data

The Wage data involves predicting a continuous or quantitative output value.

This is often referred to as a regression problem. However, in certain cases

we may instead wish to predict a non-numerical value—tha t is, a categorical

1. Introduction 3

Yesterday

Today’s Direction

Percentage change in S&P

Two Days Previous

Percentage change in S&P

Down Up

Today’s Direction

Down Up

Today’s Direction

Down Up

−4 −20 2 4 6

Three Days Previous

Percentage change in S&P

FIGURE 1.2. Left: Boxplots of the previous day’s percentage change in the S&P

index for the days for which the market increased or decreased, obtained from the

Smarket data. Center and Right: Same as left panel, but the perc entage changes

for 2 and 3 days previous are shown.

or qualitative output. For e xample, in Chapter 4 we examine a stock mar-

ket data set that contains the daily movements in the Standard & Poor’s

500 (S&P) s tock index over a 5-year period between 200 1 and 2005. We

refer to this as the

Smarket data. The goal is to predict whether the index

will increase or decrease on a given day using the past 5 days’ percentage

changes in the index. Here the statistical learning problem does not in-

volve predicting a numerical value. Instead it involves predicting whether

agivenday’sstockmarketperformancewillfallintothe

Up bucket or the

Down bucket. This is known as a classiﬁcation problem. A model tha t could

accurately predict the dir ection in which the market will move would be

very useful!

The left-hand panel of Figure 1.2 displays two boxplots of the previous

day’s percentage changes in the stock index : one for the 648 days for which

the market increased on the subsequent day, and one for the 602 days for

which the market decreased. The two plots look almost identical, suggest-

ing that there is no simple strategy for using yesterday’s movement in the

S&P to predict today’s returns. The remaining panels, which display box-

plots for the percentage changes 2 and 3 days previous to today, similarly

indicate little asso ciation between past and present returns. Of course, this

lack of pattern is to be expected: in the presence of strong correlations be-

tween successive days’ returns, one could adopt a simple trading strategy

to generate proﬁts from the market. Nevertheless, in Chapter 4, we explore

these data using several diﬀerent statistical learning methods. Interestingly,

there are hints of some weak trends in the data that suggest that, at lea st

for this 5-year period, it is possible to correctly pr edict the direction of

movement in the market approximately 60% of the time (Figure 1.3).

41.Introduction

Down Up

0.46 0.48 0.50 0.52

Today’s Direction

Predicted Probability

FIGURE 1.3. We ﬁt a quadratic discriminant analysis model to the subset

of the Smarket data corresponding to the 2001–2004 time period, and predicted

the probability of a stock market decrease using the 2005 data. On average, the

predicted pr obability of decrease is higher for the days in which the market does

decrease. Based on these results, we are able to correctly predict the direction of

movement in the market 60% of the time.

Gene Expression Data

The previous two applications illustrate data sets with both input and

output variables. However, another important class of problems involves

situations in which we only observe input variables, with no corresponding

output. For example, in a marketing setting, we might have demographic

information for a numbe r of current or potential customers. We may wish to

understand which types of customers are similar to each other by grouping

individuals acco rding to their observed characteristics. This is known as a

clustering problem. Unlike in the previous examples, here we are not trying

to predict an output variable.

We devote Chapter 10 to a discussion of statistical learning methods

for problems in which no natural output variable is available. We consider

the

NCI60 data set, which consists o f 6,830 ge ne expression measurements

for each of 64 cancer cell lines. Instead of predicting a particular output

variable, we are interested in determining whether there are groups, or

clusters, among the cell lines based on their ge ne expression measurements.

This is a diﬃcult question to address, in part because there are thousands

of gene expression measurements per cell line, making it hard to visualize

the data.

The left-hand panel of Figure 1.4 addresses this problem by represent-

ing each of the 64 cell lines using just two numb ers, Z

and Z

.These

are the ﬁrst two princip al components of the da ta, which summarize the

6, 830 expression measurements for each cell line down to two numbers or

dimensions. While it is likely that this dimension reduction has resulted in

1. Introduction 5

−40 −20 0 20 40 60

−60 −40 −20 0 20

−40 −20 0 20 40 60

FIGURE 1.4. Left: Representation of the NCI60 gene expression data set in

atwo-dimensionalspace,Z

and Z

.Eachpointcorrespondstooneofthe64

cel l lines. There appear to be four groups of cel l lines, which we have represented

using diﬀerent colors. Right: Same as left panel except that we have represented

each of the 14 diﬀerent types of cancer using a diﬀerent colored symbol. Cell lines

corresponding to the same cancer type tend to be nearby in the two-dimensional

space.

some loss of information, it is now possible to visually examine the data for

evidence of clustering. Deciding on the numbe r of clusters is often a diﬃ-

cult problem. But the left-hand panel of Figure 1.4 suggests at least fo ur

groups of cell lines, which we have represented using separate colors. We

can now examine the cell lines within each cluster fo r similarities in their

types of cancer, in order to better understand the relatio nship between

gene expressio n levels and cancer.

In this particular data set, it turns out that the cell lines correspond

to 14 diﬀerent types of cancer. (However, this information was not used

to create the left-hand panel of Figure 1.4.) The right-hand panel of Fig-

ure 1.4 is identical to the left-hand panel, except that the 14 cancer types

are shown using distinct color e d symb ols. There is clear evidence that cell

lines with the same cancer type tend to be located near each other in this

two-dimensional representation. In addition, even though the cancer infor-

mation was not used to pr oduce the left-hand panel, the clustering obtained

does be ar some resemblance to some of the ac tual cancer types observed

in the right-hand panel. This provides some independent veriﬁcation of the

accuracy of our clustering analysis.

A Brief Histo ry of Statisti cal Learning

Though the term statistical learning is fairly new, many of the concepts

that underlie the ﬁeld were developed long ago. At the beginning of the

nineteenth century, Legendre and Gauss published papers on the method

61.Introduction

of least squares, which implemented the earliest form of what is now known

as linear regression. The approach was ﬁrst successfully applied to problems

in astronomy. Linear regression is us ed for predicting quantitative values,

such as an individual’s salary. In or der to predict qualitative values, such as

whether a patient survives or dies, or whether the stock market increases

or decreases, Fisher proposed linear discriminant analysis in 1936. In the

1940s, various authors put forth an alternative approach, logistic regression.

In the early 1970s, Ne lde r and Wedderburn coined the term generalized

linear models fo r an entire class of statistical learning methods that include

both linear and logistic regr ession as special cases.

By the end of the 1970s, many more techniques for learning from data

were av ailable. However, they were almost exclusively line ar methods, be-

cause ﬁtting non-linear relationships was computationally infeasible at the

time. By the 198 0s, computing technology had ﬁna lly improved suﬃciently

that non-linear methods were no longer computationally prohibitive. In mid

1980s Breiman, Friedman, Olshen and Stone introduc ed classiﬁcation and

regression trees,andwereamongtheﬁrsttodemonstratethepowerofa

detailed practical implementation of a method, including cross-validation

for model selection. Hastie and Tibshirani coined the term generalized addi-

tive models in 1986 for a class of non-linear extensions to generalized linear

models, and also provided a practical software implementation.

Since that time, inspired by the advent of machine learning and other

disciplines, statistical learning has emerged as a new subﬁeld in statistics,

focused on supervised and unsupervisedmodelingandprediction.Inrecent

years, progress in statistical learning has been mar ked by the incr easing

availability o f p owerful and relatively user-friendly s oftware, such as the

popular and freely available

R system. This has the potential to continue

the transformation of the ﬁeld from a set of techniques used and developed

by statisticians and computer scientists to an essential toolkit for a much

broader community.

This Book

The Elements of Statistical Learning (ESL) by Hastie, Tibshirani, and

Friedman was ﬁrst published in 2001. Since tha t time, it has become an

important reference on the fundamentals of statistical machine learning.

Its success derives from its comprehensive and detailed treatment of many

important topics in statistical learning, as well as the fact that (relative to

many upper-level statistics textbooks) it is accessible to a wide audience.

However, the greatest factor behindthesuccessofESLhasbeenitstopical

nature. At the time of its publication, interest in the ﬁeld of statistical

剩余439页未读，继续阅读

qcm0690

粉丝: 0

统计学习方法与R语言应用导论

李航机器学习课件

XGBoost.pdf

xgboost-eXtreme Gradient Boosting.pdf

( 8-xgboost.pdf )

8-xgboost.7z8-xgboost.7z

Python11-XGBoost-算法项目实战.pdf

Tree Boosting With XGBoost.pdf

2023年美赛特等奖论文-C-2310767-解密.pdf

2023年美赛特等奖论文-C-2311035-解密.pdf

2024年美赛35篇特等奖O奖论文-C-2406324.pdf

最新资源