没有合适的资源？快使用搜索试试~ 我知道了~

首页《The Elements of Statistical Learning》教材答案

《The Elements of Statistical Learning》是一本由Trevor Hastie / Robert Tibshiran著作，Springer出版的Hardcover图书，本书只是提供了机器学习、数据挖掘、模式识别等领域的统计学观点，所以还是建议继续阅读其他的相关书籍，以期达到融会贯通。

资源详情

资源评论

资源推荐

A Solution Manual and Notes for:

The Elements of Statisti cal Learning

by Jerome Friedman, Trevor Hastie,

and Robert Tibshirani

John L. Weatherwax

∗

Dav id Epstein

†

5 March 2018

Introduction

The Elements of Statistical Learning is an inﬂuential and widely studied b ook in the ﬁelds

of machine learning, statistical inference, and pattern recognition. It is a standar d recom-

mended text in many graduate courses on these topics. It is also very challenging, particularly

if one faces it without the support of teachers who are expert in the subject matter. For

various reasons, both authors of these no tes felt the need to understand the book well, and

therefore to produce notes on the text when we found the text diﬃcult at ﬁrst reading,

and answers to the exercises. Gaining understanding is time-consuming and intellectually

demanding, so it seemed sensible to record our eﬀorts in LaTeX, and make it available o n

the web to ot her readers. A valuable by-product of writing up mathematical material, is

that often one ﬁnds gaps and errors in what one has written.

Now it is well-known in all branches of learning, but in particular in mathematical learning,

that the way to learn is to do, rather than to read. It is all too common to read through

some material, convince oneself that one understands it well, but t hen ﬁnd oneself at sea

when trying to apply it in even a slightly diﬀerent situation. Moreover, material understoo d

only at a shallow level is easily forgott en.

It is therefore our strong recommendation that readers of the book should not look at our

responses to any of the exercises before making a substantial eﬀort to understand it without

this aid. Any time spent in this way, even if it ends without success, will make our solutions

∗

wax@alum.mit.edu

†

David.Epstein@warwick.ac.uk

1

easier to understand and more memorable. Quite likely you will ﬁnd better solutions than

those provided here—let us know when you do!

As far as teachers of courses based on the text are concerned, our notes may be regarded as

a disadvantage, because it may be felt that the exercises in the book can no longer be used

as a source of homework or questions for tests and examinations. This is a slight conﬂict

of interest between teachers of courses o n the one hand, a nd independent learners on the

other hand. Finding ourselves in the ranks of the independent learners, the position we take

is hardly surprising. Teachers of courses might beneﬁt by comparing their own solutions

with ours, as might students in classes and independent learners. If you, the reader, ﬁnd

a problem diﬃcult and become stuck, our notes might enable you to unstick yourself. In

addition, there is a myriad of materials on any of the topics this book covers. A search for

“statistical learning theory” on Google (as of 1 January 2010) gave over 4 million hits. The

possible disadvanta ge of not being able to use the book’s problems in an academic course

is not really such a large one. Obt aining additional supplementary problems at the right

level for an academic course is simply a matter of a visit to the library, or a little time spent

surﬁng the net. And, of course, although one is no t supposed to say this, teachers of courses

can also get stuck.

Acknowledgments

We are hoping that people will ﬁnd errors, oﬀer insightful suggestions, and generally improve

the understanding of the text, the exercises a nd of our notes, and that they write to us about

this. We will incorporate additions that we think are helpful. Our plan is to gradually add

material as we work through the book. Comments and criticisms ar e a lways welcome. Special

thanks to (most recent comments are listed ﬁrst): Nicola Doninelli, Mark-Jan Nederhof for

solutions in Chapter 5. Dan Wang for his bug report in the AdaBoost code, Liuzhou Zhuo

for his comments on Exercise 3.25 and R uchi Dhiman for his comments on Chapter 4.

We use the numbering found in the on-line (second edition) version of this text. The ﬁrst

edition should be similar but may have some diﬀerences.

2

Chapter 2 (Overview of Supervised Learning)

Statistical Decision Theory

We assume a linear model: that is we assume y = f(x) + ε, where ε is a random variable

with mean 0 and variance σ

2

, and f(x) = x

T

β. Our expected predicted error (EPE) under

the squared error loss is

EPE(β) =

Z

(y −x

T

β)

2

Pr(dx, dy) . (1)

We regard this expression a s a function of β, a column vector of length p + 1. In order to

ﬁnd the value of β for which it is minimized, we equate to zero the vector derivative with

respect to β. We have

∂EPE

∂β

=

Z

2 (y −x

T

β) (−1)x Pr(dx, dy) = −2

Z

(y −x

T

β)xPr(dx, dy) . (2)

Now this expression has two parts. The ﬁrst has integrand yx and the second has integrand

(x

T

β)x.

Before proceeding, we make a quick g eneral remark about matrices. Suppose that A, B and

C are matr ices of size 1 ×p matrix, p ×1 and q × 1 respectively, where p and q are positive

integers. Then AB can be regarded a s a scalar, and we have (AB)C = C(AB), each of these

expressions meaning that each component of C is multiplied by the scalar AB. If q > 1,

the expressions BC, A(BC) and ABC are meaningless, and we must avoid writing them.

On the other hand, CAB is meaningful as a product of three matrices, and the result is the

q × 1 matr ix ( AB)C = C(AB) = CAB. In our situation we obtain (x

T

β)x = xx

T

β.

From ∂EPE/∂β = 0 we deduce

E[yx] − E[xx

T

β] = 0 (3)

for the particular value o f β that minimizes the EPE. Since this value of β is a constant, it

can be ta ken out o f the expectation to give

β = E[xx

T

]

−1

E[yx] , (4)

which gives Equation 2.16 in the book.

We now discuss some points around Equations 2.26 and 2.27. We have

ˆ

β = (X

T

X)

−1

X

T

y = (X

T

X)

−1

X

T

(Xβ + ε) = β + (X

T

X)

−1

X

T

ε.

So

ˆy

0

= x

T

0

ˆ

β = x

T

0

β + x

T

0

(X

T

X)

−1

X

T

ε. (5)

This immediately gives

ˆy

0

= x

T

0

β +

N

X

i=1

ℓ

i

(x

0

)ε

i

3

where ℓ

i

(x

0

) is the i-th element of the N-dimensional column vector X(X

T

X)

−1

x

0

, as stated

at the bottom of page 24 of the book.

We now consider Equations 2.27 and 2.28. The va r iation is over all training sets T , and over

all values of y

0

, while keeping x

0

ﬁxed. Note that x

0

and y

0

are chosen independent ly of T

and so the expectatio ns commute: E

y

0

|x

0

E

T

= E

T

E

y

0

|x

0

. Also E

T

= E

X

E

Y|X

.

We write y

0

− ˆy

0

as the sum of three terms

y

0

− x

T

0

β

−(ˆy

0

− E

T

(ˆy

0

)) −

E

T

(ˆy

0

) − x

T

0

β

= U

1

− U

2

− U

3

. (6)

In order to prove Equations 2.27 and 2.28, we need to square the expression in Equation 6

and then apply various expectation operators. First we consider the properties of each of

the three t erms, U

i

in Equation 6. We have E

y

0

|x

0

U

1

= 0 and E

T

U

1

= U

1

E

T

. Despite t he

notation, ˆy

0

does not involve y

0

. So E

y

0

|x

0

U

2

= U

2

E

y

0

|x

0

and clearly E

T

U

2

= 0. Equation 5

gives

U

3

= E

T

(ˆy

0

) − x

T

0

β = x

T

0

E

X

(X

T

X)

−1

X

T

E

Y|X

ε

= 0 (7)

since the expectation of the length N vector ε is zero. This shows that the bias U

3

is zero.

We now square the remaining part of Equation 6 and then then apply E

y

0

|x

0

E

T

. The cross-

term U

1

U

2

gives zero, since E

y

0

|x

0

(U

1

U

2

) = U

2

E

y

0

|x

0

(U

1

) = 0. (This works in the same way

if E

T

replaces E

y

0

|x

0

.)

We are left with two squared terms, and the deﬁnition of variance enables us to deal imme-

diately with the ﬁrst of these: E

y

0

|x

0

E

T

U

2

1

= Var(y

0

|x

0

) = σ

2

. It remains to deal with the

term E

T

(ˆy

0

− E

T

ˆy

0

)

2

= Var

T

(ˆy

0

) in Equation 2.27. Since the bias U

3

= 0, we know that

E

T

ˆy

0

= x

T

0

β.

If m is the 1 ×1-matrix with entry µ, then mm

T

is the 1 ×1-ma t r ix with enty µ

2

. It follows

from Equation 5 that the variance term in which we are interested is equal to

E

T

x

T

0

(X

T

X)

−1

X

T

εε

T

X(X

T

X)

−1

x

0

.

Since E

T

= E

X

E

Y|X

, and the expectation of εε

T

is σ

2

I

N

, this is equal to

σ

2

x

T

0

E

T

(X

T

X)

−1

x

0

= σ

2

x

T

0

E

X

X

T

X/N

−1

x

0

/N. (8)

We suppose, as stated by the authors, that the mean of the distribution giving rise to X

and x

0

is zero. Fo r large N, X

T

X/N is then approximately equal to Cov(X) = Cov(x

0

),

the p × p-matrix-variance-covariance matrix relating the p components of a typical sample

vector x—as far as E

X

is concerned, this is a constant. Applying E

x

0

to Equation 8 as in

Equation 2.28, we obtain (approximately)

σ

2

E

x

0

x

T

0

Cov( X)

−1

x

0

/N = σ

2

E

x

0

trace

x

T

0

Cov( X)

−1

x

0

/N

= σ

2

E

x

0

trace

Cov( X)

−1

x

0

x

T

0

/N

= σ

2

trace

Cov( X)

−1

Cov( x

0

)

/N

= σ

2

trace(I

p

)/N

= σ

2

p/N.

(9)

4

This completes our discussion of Equations 2.27 and 2.28.

Notes on Loc al Methods in High Dimensions

The most common error metric used to compare diﬀerent predictions o f the true (but un-

known) mapping function value f(x

0

) is the mean square erro r (MSE). The unknown in the

above discussion is the speciﬁc function mapping function f (·) which can be obtained via

diﬀerent methods many of which are discussed in the book. In supervised learning to help

with t he construction of an appropriate prediction ˆy

0

we have access to a set of “training

samples” that contains the notion of randomness in that these points are not under complete

control of the experimenter. One could ask the question as to how much square error at our

predicted input point x

0

will have on average when we consider a ll possible training sets T .

We can compute, by inserting the “expected value of the predictor obtained over all tra ining

sets”, E

T

(ˆy

0

) int o the deﬁnition of quadratic (MSE) error as

MSE(x

0

) = E

T

[f(x

0

) − ˆy

0

]

2

= E

T

[ˆy

0

− E

T

(ˆy

0

) + E

T

(ˆy

0

) − f(x

0

)]

2

= E

T

[(ˆy

0

− E

T

(ˆy

0

))

2

+ 2(ˆy

0

−E

T

(ˆy

0

))(E

T

(ˆy

0

) − f(x

0

)) + (E

T

(ˆy

0

) −f(x

0

))

2

]

= E

T

[(ˆy

0

− E

T

(ˆy

0

))

2

] + (E

T

(ˆy

0

) − f(x

0

))

2

.

Where we have expanded the quadratic, distributed the expectation across all terms, and

noted that the middle term vanishes since it is equal to

E

T

[2(ˆy

0

−E

T

(ˆy

0

))(E

T

(ˆy

0

) − f(x

0

))] = 0 ,

because E

T

(ˆy

0

) − E

T

(ˆy

0

) = 0. and we a re left with

MSE(x

0

) = E

T

[(ˆy

0

−E

T

(ˆy

0

))

2

] + (E

T

(ˆy

0

) −f(x

0

))

2

. (10)

The ﬁrst term in the above expression E

T

[(ˆy

0

−E

T

(ˆy

0

))

2

] is the variance of our estimator ˆy

0

and the second term (E

T

(ˆy

0

) − f(x

0

))

2

is the bias ( squared) of our estimator. This notion

of variance and bias with respect to our estimate ˆy

0

is to be understood relative to possible

training sets, T , and the speciﬁc computational metho d used in computing the estimate ˆy

0

given that training set.

Exercise Solutions

Ex. 2.1 (target coding)

The authors have suppressed the context here, making the question a little mysterious. For

example, why use the notation ¯y instead of simply y? We imagine that the backgro und is

something like the following. We have some input data x. Some algo rithm assigns to x the

probability y

k

that x is a member of the k-th class. This would explain why we are told to

assume that t he sum of the y

k

is equal to one. (But, if that reason is valid, then we should

5

剩余120页未读，继续阅读

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功

## 评论0