梯度提升机(GBDT)原理与应用

需积分: 42 41 浏览量更新于2024-07-18 2 收藏 941KB PDF 举报

"这篇论文是GBDT（Gradient Boosting Decision Trees）的原始出处，由Jerome H. Friedman在2001年发表于《统计学年鉴》上。该论文详细阐述了GBDT的基本思想和算法，对于理解GBDT的理论基础具有重要意义。" GBDT（Gradient Boosting Decision Trees）是一种迭代的决策树算法，它通过构建一系列弱学习器（如决策树），并将它们的结果累加起来，形成一个强学习器。Friedman在论文中提出了将函数拟合视为参数空间中的数值优化问题，而不仅仅是参数空间的问题。他引入了一种称为“梯度提升”的通用框架，该框架基于任何拟合标准进行逐步添加。论文中的关键概念包括： 1. **梯度提升**：这是一种优化方法，它不是直接最小化损失函数，而是沿着损失函数梯度的负方向构建新的模型。每次迭代，新模型的目标是最小化前一轮所有模型残差的梯度，从而逐渐减少整体误差。 2. **加法模型**：GBDT采用加法模型，即模型由多个弱学习器的输出相加构成。每个弱学习器专注于纠正之前模型的错误，使得整体性能逐步提高。 3. **损失函数**：论文提到了几种特定的损失函数，如最小二乘法、绝对偏差和Huber损失函数，用于回归任务；多类逻辑似然函数用于分类任务。这些损失函数的选择影响了GBDT的学习过程和最终性能。 4. **决策树作为基学习器**：Friedman特别指出，当个体的加性组件是决策树时，GBDT的性能尤为突出。决策树的分枝结构使其能够处理非线性关系，并且具有良好的可解释性。他还讨论了如何优化树的构建过程，以提高"TreeBoost"模型的效率和效果。 5. **解读与增强**：论文中还提供了解析"TreeBoost"模型的方法，这对于理解和解释模型的预测行为至关重要。这在数据不清晰或噪声较大的情况下尤其有用，因为GBDT能够生成可解释性强的模型，即使面对复杂的数据模式。 6. **鲁棒性和竞争性**：GBDT因其高鲁棒性和强大的泛化能力，在回归和分类任务中表现优秀，特别是处理有缺失值或异常值的数据集时。这篇原始论文深入探讨了GBDT的数学基础和算法实现，对于想要深入理解GBDT工作原理的读者，它是不可或缺的参考资料。通过阅读论文，我们可以了解GBDT如何通过迭代优化和决策树集成来达到高效、稳健的预测效果。

1196 J. H. FRIEDMAN

The update (16) can be alternatively expressed as

x=F

m−1

x+



j=1

1x ∈ R

(17)

with γ

= ρ

. One can view (17) as adding J separate basis functions

at each step 1x ∈ R



, instead of a single additive one as in (16). Thus,

in this case one can further improve the quality of the ﬁt by using the opti-

mal coefﬁcients for each of these separate basis functions (17). These optimal

coefﬁcients are the solution to

γ



= arg min

γ





i=1



F

m−1

x

+



j=1

1x ∈ R







Owing to the disjoint nature of the regions produced by regression trees, this

reduces to

= arg min



∈R

Ly

F

m−1

x

+γ(18)

This is just the optimal constant update in each terminal node region, based

on the loss function L, given the current approximation F

m−1

x.

For the case of LAD regression (18) becomes

= median

∈R

y

− F

m−1

x



which is simply the median of the current residuals in the jth terminal node at

the mth iteration. At each iteration a regression tree is built to best predict the

sign of the current residuals y

−F

m−1

x

, based on a least-squares criterion.

Then the approximation is updated by adding the median of the residuals in

each of the derived terminal nodes.

Algorithm 3 (LAD

TreeBoost).

x=mediany



For m = 1toM do:

˜y

= signy

− F

m−1

x

i= 1N

R



= J-terminal node tree ˜y

 x





= median

∈R

y

− F

m−1

x

j= 1J

x=F

m−1

x+



j=1

1x ∈ R



endFor

end Algorithm

This algorithm is highly robust. The trees use only order information on

the individual input variables x

, and the pseudoresponses ˜y

(13) have only

two values, ˜y

∈−1 1. The terminal node updates are based on medians.

GREEDY FUNCTION APPROXIMATION 1197

An alternative approach would be to build a tree to directly minimize the loss

criterion,

tree

x=arg min

J-node tree



i=1

y

− F

m−1

x

−treex



x=F

m−1

x

+tree

x

However, Algorithm 3 is much faster since it uses least-squares to induce the

trees. Squared-error loss is much more rapidly updated than mean absolute

deviation when searching for splits during the tree building process.

4.4. M-Regression. M-regression techniques attempt resistance to long-

tailed error distributions and outliers while maintaining high efﬁciency for

normally distributed errors. We consider the Huber loss function [Huber

(1964)]

Ly F=



y − F

 y − F≤δ

δy − F−δ/2y − F >δ.

(19)

Here the pseudoresponse is

˜y

=−



∂Ly

Fx



∂Fx



Fx=F

m−1

x



− F

m−1

x

 y

− F

m−1

x

 ≤ δ,

δ · signy

− F

m−1

x

 y

− F

m−1

x

 >δ,

and the line search becomes

= arg min



i=1

Ly

F

m−1

x

+ρhx

 a

(20)

with L given by (19). The solution to (19), (20) can be obtained by standard

iterative methods [see Huber (1964)].

The value of the transition point δ deﬁnes those residual values that are

considered to be “outliers,” subject to absolute rather than squared-error loss.

An optimal value will depend on the distribution of y − F

∗

x, where F

∗

the true target function (1). A common practice is to choose the value of δ to

be the α-quantile of the distribution of y − F

∗

x, where 1 − α controls the

breakdown point of the procedure. The “breakdown point” is the fraction of

observations that can be arbitrarily modiﬁed without seriously degrading the

quality of the result. Since F

∗

x is unknown one uses the current estimate

m−1

x as an approximation at the mth iteration. The distribution of y −

m−1

x is estimated by the current residuals, leading to

= quantile

y

− F

m−1

x





With regression trees as base learners we use the strategy of Section 4.3,

that is, a separate update (18) in each terminal node R

. For the Huber loss

剩余43页未读，继续阅读

LML1995

粉丝: 0
资源: 11

梯度提升机(GBDT)原理与应用

GBDT算法理解

GBDT论文.zip（三篇）

GBDT源代码

GBDT原始论文+XGB原始论文+陈天奇 ppt

GDBT 原始论文

GBDT+LR预测CTR经典论文阅读笔记1

论文研究-基于SMOTE和GBDT的网络入侵检测方法研究.pdf

20211114_龙真_论文展示1

2023美赛O奖：C题论文翻译（4）.pdf

基于机器学习的问答推荐算法设计-论文初稿1.0DEV 1

最新资源