提升方法的统计视角：加法逻辑回归

需积分: 10 70 浏览量更新于2024-07-09 收藏 728KB PDF 举报

"Additive Logistic Regression - A Statistical View of Boosting" 这篇论文深入探讨了Boosting算法，特别是Adaboost算法，以及它在集成学习中的应用。Adaboost是一种强大的分类方法，通过序列化地对训练数据应用分类算法，并对产生的分类器进行加权多数投票来提升性能。该技术的核心在于其迭代过程，每次迭代都会重新调整数据点的权重，使得难以分类的数据点在后续的迭代中得到更多的关注。论文的作者，Jerome Friedman、Trevor Hastie和Robert Tibshirani，都是统计学和机器学习领域的知名专家，他们来自斯坦福大学。他们揭示了Boosting背后的统计原理，将其与已知的统计概念——加性建模和最大似然估计联系起来。对于二分类问题，Boosting可以看作是在逻辑尺度上对加性模型的一种近似，使用最大伯努利似然作为优化标准。作者进一步发展了更直接的近似方法，这些方法在实践中几乎能与Boosting得到相同的结果。对于多分类问题，他们基于多项式似然提出了直接的推广，展示了这种方法在性能上的优秀表现。这为理解和改进Boosting提供了一个统计学的视角，同时也为实际应用中的分类问题提供了理论支持。 Boosting算法的优越性在于它能够自动识别并强化弱学习器，将一系列弱分类器组合成一个强分类器。通过对每个迭代中分类错误的数据点给予更高的权重，Boosting能逐步提高整体模型的准确性。此外，由于每次迭代只关注那些之前分类错误的样本，这使得算法对噪声和异常值具有一定的鲁棒性。在集成学习中，Adaboost和其他Boosting变体（如Gradient Boosting）已经成为解决分类和回归问题的标准工具。它们广泛应用于数据挖掘、计算机视觉、自然语言处理和生物信息学等领域，因为它们能够处理高维数据，对小样本和不平衡数据集表现出色，并且可以通过调整迭代次数和学习率等参数来控制模型复杂度，防止过拟合。 "Additive Logistic Regression - A Statistical View of Boosting"这篇论文为理解Boosting算法的内在机制提供了统计学的基础，同时也为实际应用中的优化和改进提供了指导。通过深入研究这些原理，数据科学家和机器学习工程师能够更好地利用Boosting来构建高效、准确的分类模型。

348 J. FRIEDMAN, T. HASTIE AND R. TIBSHIRANI

Note that after each Newton step, the weights change, and hence the tree

conﬁguration will change as well. This adds an adaptive twist to the data

version of a Newton-like algorithm.

Parts of this derivation for AdaBoost can be found in Breiman (1997) and

Schapire and Singer (1998), but without making the connection to additive

logistic regression models.

Corollary 2. After each update to the weights, the weighted misclassiﬁ-

cation error of the most recent weak learner is 50%.

Proof. This follows by noting that the c that minimizes JF +cf  satisﬁes

∂JF + cf 

∂c

=−Ee

−yFx+cfx

yfx = 0(21)

The result follows since yf x is 1 for a correct and −1 for an incorrect

classiﬁcation. ✷

Schapire and Singer (1998) give the interpretation that the weights are

updated to make the new weighted problem maximally difﬁcult for the next

weak learner.

The Discrete AdaBoost algorithm expects the tree or other “weak learner”

to deliver a classiﬁer fx∈−1 1. Result 1 requires minor modiﬁcations to

accommodate fx∈R, as in the generalized AdaBoost algorithms [Freund

and Schapire (1996b), Schapire and Singer (1998)]; the estimate for c

differs.

Fixing f, we see that the minimizer of (20) must satisfy

yfxe

−cyfx

=0(22)

If f is not discrete, this equation has no closed-form solution for c, and requires

an iterative solution such as Newton–Raphson.

We now derive the Real AdaBoost algorithm, which uses weighted proba-

bility estimates to update the additive logistic model, rather than the classiﬁ-

cations themselves. Again we derive the population updates and then apply it

to data by approximating conditional expectations by terminal-node averages

in trees.

Result 2. The Real AdaBoost algorithm ﬁts an additive logistic regression

model by stagewise and approximate optimization of JF=Ee

−yFx

.

Proof. Suppose we have a current estimate Fxand seek an improved esti-

mate Fx+fx by minimizing JFx+fx at each x.

JFx+fx = Ee

−yFx

−yfx

x

= e

−fx

Ee

−yFx

y=1

x+e

fx

Ee

−yFx

y=−1

x

Dividing through by Ee

−yFx

x and setting the derivative w.r.t. fx to zero

we get

fx=

log

1

y=1

x

1

y=−1

x

(23)

ADDITIVE LOGISTIC REGRESSION 349

log

y = 1x

y =−1x

(24)

where wx y=exp−yFx. The weights get updated by

wx y←wx ye

−yfx



The algorithm as presented would stop after one iteration. In practice we

use crude approximations to conditional expectation, such as decision trees or

other constrained models, and hence many steps are required.

Corollary 3. At the optimal Fx the weighted conditional mean of y is 0.

Proof. If Fx is optimal, we have

25

∂JFx

Fx

=−Ee

−yFx

y = 0 ✷

We can think of the weights as providing an alternative to residuals for the

binary classiﬁcation problem. At the optimal function F, there is no further

information about F in the weighted conditional distribution of y. If there is,

we use it to update F.

An iteration M in either the Discrete or Real AdaBoost algorithms, we have

composed an additive function of the form

Fx=



m=1

x(26)

where each of the components are found in a greedy forward stagewise fash-

ion, ﬁxing the earlier components. Our term “stagewise” refers to a similar

approach in statistics:

1. Variables are included sequentially in a stepwise regression.

2. The coefﬁcients of variables already included receive no further adjustment.

4.2.WhyEe

−yFx

? So far the only justiﬁcation for this exponential crite-

rion is that it has a sensible population minimizer, and the algorithm described

above performs well on real data. In addition:

1. Schapire and Singer (1998) motivate e

−yFx

as a differentiable upper bound

to misclassiﬁcation error 1

yF<0

(see Figure 2).

2. The AdaBoost algorithm that it generates is extremely modular, requir-

ing at each iteration the retraining of a classiﬁer on a weighted training

database.

Let y

∗

=y + 1/2, taking values 0, 1, and parametrize the binomial prob-

abilities by

px=

Fx

+ e

−Fx



The binomial log-likelihood is

ly

∗

px = y

∗

logpx+1 − y

∗

log1 − px

=−log1 + e

−2yFx



(27)

ADDITIVE LOGISTIC REGRESSION 351

One feature of both the exponential and log-likelihood criteria is that they

are monotone and smooth. Even if the training error is zero, the criteria will

drive the estimates towards purer solutions (in terms of probability estimates).

Why not estimate the f

by minimizing the squared error Ey − Fx

If F

m−1

x=



m−1

x is the current prediction, this leads to a forward

stagewise procedure that does an unweighted ﬁt to the response y −F

m−1

x

at step m as in (6). Empirically we have found that this approach works quite

well, but is dominated by those that use monotone loss criteria. We believe

that the nonmonotonicity of squared error loss (Figure 2) is the reason. Correct

classiﬁcations, but with yFx > 1, incur increasing loss for increasing values

of Fx. This makes squared-error loss an especially poor approximation to

misclassiﬁcation error rate. Classiﬁcations that are “too correct” are penalized

as much as misclassiﬁcation errors.

4.3. Direct optimization of the binomial log-likelihood. In this section we

explore algorithms for ﬁtting additive logistic regression models by stagewise

optimization of the Bernoulli log-likelihood. Here we focus again on the two-

class case and will use a 0/1 response y

∗

to represent the outcome. We repre-

sent the probability of y

∗

= 1bypx, where

px=

Fx

+ e

−Fx

(30)

Algorithm 3 gives the details.

LogitBoost (two classes)

1. Start with weights w

=1/N i =1 2N, Fx=0 and probability esti-

mates px

=

2. Repeat for m = 1 2M:

(a) Compute the working response and weights

∗

− px



px

1 − px





= px

1 − px



(b) Fit the function f

x by a weighted least-squares regression of z

using weights w

x and px←e

Fx

/e

Fx

+ e

−Fx

.

3. Output the classiﬁer signFx = sign



m=1

x.

Algorithm 3. An adaptive Newton algorithm for ﬁtting an additive logis-

tic regression model.

剩余70页未读，继续阅读

小李玉

粉丝: 2
资源: 5

提升方法的统计视角：加法逻辑回归

Additive logistic regression

Springer-Modern.Multivariate.Statistical.Techniques.Regression.classification.and.manifold.learning.(2008)

OpenCV之_HaarTraining算法剖析-公开版.pdf

The Elements of Statistical Learning 统计学习精要

Advanced Data Analysis from an Elementary Point of View

OpenCV之HaarTraining算法剖析-完全版

OpenCV Haar Boosting算法详解及应用

dnSpy-net-win32-222.zip

和美乡村城乡融合发展数字化解决方案.docx

如何看待“适度宽松”的货币政策.pdf

最新资源