高维问题与大数据预测：统计学习方法解析

需积分: 4 3 浏览量更新于2024-08-02 收藏 2.57MB PDF 举报

"《统计学习要素(第2版)(Trevor Hastie 2008)_18.高维问题.pdf》" 本书是Trevor Hastie、Robert Tibshirani和Jerome Friedman三位斯坦福大学统计学教授合著的作品，主要探讨了在计算能力和信息技术爆炸性增长背景下，如何理解和处理大量来自各种领域的数据，如医学、生物学、金融和市场营销等。随着数据的增多，统计学领域出现了新的工具，如数据挖掘、机器学习和生物信息学，并且这些工具常常有共同的基础但表达方式各异。本书旨在在一个共同的概念框架下解释这些重要概念，注重概念而非数学，通过丰富的彩色图形示例进行讲解，适合统计学家以及对科学或工业数据挖掘感兴趣的读者。书中特别关注的是当特征数量（p）远大于样本数量（N）的高维问题，通常表示为p≫N。在这种情况下，预测任务变得尤为复杂，因为高方差和过拟合成为主要挑战。因此，简单的、高度正则化的模型往往成为首选方法。第一章讨论了分类和回归设置中的预测问题，而第二部分则涉及更基础的特征选择和评估问题。为了进一步阐述在p≫N情况下的问题，书中提到一个小规模的模拟研究。每个样本有100个特征，这些特征是相关度为0.2的标准正态分布随机变量。结果变量Y根据线性模型生成，其中包含所有特征的系数和一个标准正态分布的误差项。模拟研究表明，在这种情况下，过于复杂的模型（过多地拟合特征）并不理想，反而简单、正则化程度高的模型表现更好。书中涵盖的主题包括神经网络、支持向量机、分类树和提升算法，这些都是首次在任何书籍中进行全面介绍。作者Hastie和Tibshirani共同开发了广义加性模型，并撰写了相关的专著；Hastie是S-PLUS软件中的统计建模部分的主要开发者，同时也是主曲线和主曲面的发明者；Tibshirani提出了Lasso正则化，并与人合著了《Bootstrap引论》；Friedman是包括CART、MARS和投影追踪在内的许多数据挖掘工具的共同发明者。《统计学习要素》是一部综合性的统计学教材，它不仅深入浅出地介绍了高维数据分析的关键概念和技术，而且提供了丰富的实例来帮助读者理解这些工具的实际应用。对于那些需要处理大量数据并希望从中提取有用信息的科研人员和从业人员来说，这本书无疑是一份宝贵的资源。

18.3 Linear Classiﬁers with Quadratic Regularization 657

much like ridge regression (Section 3.4.1), which shrinks the total covariance

matrix of the features towards a diagonal (scalar) matrix. In fact, viewing

linear discriminant analysis as linear regression with optimal scoring of the

categorical response [see (12.58) in Section 12.6], the equivalence becomes

more precise.

The computational burden of inverting this large p×p matrix is overcome

using the methods discussed in Section 18.3.5. The value of γ was chosen

by cross-validation in line 2 of Table 18.1; all values of γ ∈ (0.002, 0.550)

gave the same CV and test error. Further development of RDA, including

shrinkage of the centroids in addition to the covariance matrix, can be

found in Guo et al. (2006).

18.3.2 Logistic Regression with Quadratic Regularization

Logistic regression (Section 4.4) can be modiﬁed in a similar way, to deal

with the p  N case. With K classes, we use a symmetric version of the

multiclass logistic model (4.17) on page 119:

Pr(G = k|X = x)=

exp(β

+ x

)



=1

exp(β

0

+ x



)

. (18.10)

This has K coeﬃcient vectors of log-odds parameters β

,β

,...,β

.We

regularize the ﬁtting by maximizing the penalized log-likelihood

max

{β

,β

}



i=1

log Pr(g

) −



k=1

||β

. (18.11)

This regularization automatically resolves the redundancy in the paramet-

rization, and forces



k=1

j

=0,j=1,...,p (Exercise 18.3). Note that

the constant terms β

are not regularized (and so one should be set to

zero). The resulting optimization problem is convex, and can be solved by

a Newton algorithm or other numerical techniques. Details are given in Zhu

and Hastie (2004). Friedman et al. (2008a) provide software for computing

the regularization path for the two- and multiclass logistic regression mod-

els. Table 18.1, line 6 reports the results for the multiclass logistic regres-

sion model, referred to there as “multinomial”. It can be shown (Rosset

et al., 2004a) that for separable data, as λ → 0, the regularized (two-

class) logistic regression estimate (renormalized) converges to the maximal

margin classiﬁer (Section 12.2). This gives an attractive alternative to the

support-vector machine, discussed next, especially in the multiclass case.

18.3.3 The Support Vector Classiﬁer

The support vector classiﬁer is described for the two-class case in Sec-

tion 12.2. When p>N, it is especially attractive because in general the

658 18. High-Dimensional Problems: p  N

classes are perfectly separable by a hyperplane unless there are identical

feature vectors in diﬀerent classes. Without any regularization the support

vector classiﬁer ﬁnds the separating hyperplane with the largest margin;

that is, the hyperplane yielding the biggest gap between the classes in

the training data. Somewhat surprisingly, when p  N the unregularized

support vector classiﬁer often works about as well as the best regularized

version. Overﬁtting often does not seem to be a problem, partly because of

the insensitivity of misclassiﬁcation loss.

There are many diﬀerent methods for generalizing the two-class support-

vector classiﬁer to K>2 classes. In the “one versus one” (ovo) approach,

we compute all





pairwise classiﬁers. For each test point, the predicted

class is the one that wins the most pairwise contests. In the “one versus all”

(ova) approach, each class is compared to all of the others in K two-class

comparisons. To classify a test point, we compute the conﬁdences (signed

distance from the hyperplane) for each of the K classiﬁers. The winner is the

class with the highest conﬁdence. Finally, Vapnik (1998) and Weston and

Watkins (1999) suggested (somewhat complex) multiclass criteria which

generalize the two-class criterion (12.6).

Tibshirani and Hastie (2007) propose the margin tree classiﬁer, in which

support-vector classiﬁers are used in a binary tree, much as in CART

(Chapter 9). The classes are organized in a hierarchical manner, which can

be useful for classifying patients into diﬀerent cancer types, for example.

Line 2 of Table 18.1 shows the results for the support vector classiﬁer

using the ova method; Ramaswamy et al. (2001) reported (and we con-

ﬁrmed) that this approach worked best for this problem. The errors are

very similar to those in line 6, as we might expect from the comments

at the end of the previous section. The error rates are insensitive to the

choice of C [the regularization parameter in (12.8) on page 420], for values

of C>0.001. Since p>N, the support vector hyperplane can perfectly

separate the training data by setting C = ∞.

18.3.4 Feature Selection

Feature selection is an important scientiﬁc requirement for a classiﬁer when

p is large. Neither discriminant analysis, logistic regression, nor the support-

vector classiﬁer perform feature selection automatically, because all use

quadratic regularization. All features have nonzero weights in both models.

Ad-hoc methods for feature selection have been proposed, for example,

removing genes with small coeﬃcients, and reﬁtting the classiﬁer. This is

done in a backward stepwise manner, starting with the smallest weights and

moving on to larger weights. This is known as recursive feature elimination

(Guyon et al., 2002). It was not successful in this example; Ramaswamy

et al. (2001) report, for example, that the accuracy of the support-vector

classiﬁer starts to degrade as the number of genes is reduced from the full

剩余49页未读，继续阅读

普通网友

粉丝: 0
资源:
19

高维问题与大数据预测：统计学习方法解析

The Elements of Statistical Learning(2nd)(Trevor Hastie 2008)_14.Unsupervised Learning.pdf

The Elements of Statistical Learning(2nd)(Trevor Hastie 2008)_11.Neural Networks.pdf

The Elements of Statistical Learning(2nd)(Trevor Hastie 2008)_6.Kernel Smoothing Methods.pdf

The Elements of Statistical Learning(2nd)(Trevor Hastie 2008)_8.Model Inference and Averaging.pdf

The Elements of Statistical Learning(2nd)(Trevor Hastie 2008)_4.Linear Methods for Classification.pdf

The Elements of Statistical Learning(2nd)(Trevor Hastie 2008)_5.Basis Expansions and Regularization.pdf

The Elements of Statistical Learning:Travor Hastie(2ed) 2018 中文+英文版+习题解

最新资源