梯度增强机（GBM）：原理与参数选择

3星 · 超过75%的资源需积分: 14 80 浏览量更新于2024-07-21 3 收藏 187KB PDF 举报

"GBM是Gradient Boosting Machine的缩写，是一种广泛应用的数据挖掘算法，它通过构建一系列弱预测模型并逐步提升其性能来创建一个强预测模型。该算法基于提升方法，结合了梯度下降和决策树的概念。在R语言中，有一个名为`gbm`的包用于实现GBM，该包提供了多种回归和分类任务的支持，包括最小二乘法、绝对损失、t分布损失、分位数回归、逻辑回归、多分类逻辑回归、泊松回归以及Cox比例风险模型等。此外，GBM还可以用于AdaBoost算法和学习排名（LambdaMart）的实现。`gbm`包依赖于`R`、`survival`、`lattice`、`splines`和`parallel`等其他包，建议使用`RUnit`进行测试。" GBM算法的核心思想是通过迭代构建一系列弱预测器，并根据它们的预测误差来调整权重，最终将这些弱预测器组合成一个强预测模型。在每个迭代步骤中，GBM试图最小化残差平方和的负梯度，因此得名梯度增强。这个过程可以视为优化目标函数的过程，目标函数通常是对模型预测值与真实值之差的某个损失函数。在实际应用中，GBM的关键参数包括： 1. **n.trees**: 指定了要构建的决策树数量。增加树的数量可以提高模型的复杂度和准确性，但可能导致过拟合。 2. **interaction.depth**: 决策树中允许的最大节点深度，控制模型的复杂度和计算量。 3. **n.minobsinnode**: 每个内部节点需要的最少样本数，防止树过于稀疏或过拟合。 4. **shrinkage**: 学习率，用于控制每次迭代的步长，降低模型复杂度，防止过拟合。 5. **bag.fraction**: 随机抽样的子集比例，用于 Bagging，可以减少过拟合并提高泛化能力。 6. **cv.folds**: 交叉验证的折数，用于评估模型性能和调整超参数。 7. **verbose**: 控制输出信息的详细程度。 `gbm`包中的函数如`gbm()`用于训练模型，`predict.gbm()`用于预测，`plot.gbm()`用于可视化模型，`gbm.perf()`用于评估模型性能，`gbm.crossval()`用于交叉验证，还有其他辅助函数如`calibrate.plot()`用于校准和展示模型的预测概率。 GBM的优势在于其灵活性和对非线性关系的良好处理能力，但需要注意的是，由于其迭代和树构建的特性，GBM可能会消耗大量计算资源，尤其是在大数据集上。为了提高效率，可以使用并行计算功能（如`parallel`包），并合理设置参数以平衡模型复杂度和计算成本。同时，正则化和特征选择也是优化GBM性能的重要手段。

gbm 7

distribution = "bernoulli",

w = NULL,

var.monotone = NULL,

n.trees = 100,

interaction.depth = 1,

n.minobsinnode = 10,

shrinkage = 0.001,

bag.fraction = 0.5,

nTrain = NULL,

train.fraction = NULL,

keep.data = TRUE,

verbose = TRUE,

var.names = NULL,

response.name = "y",

group = NULL)

gbm.more(object,

n.new.trees = 100,

data = NULL,

weights = NULL,

offset = NULL,

verbose = NULL)

Arguments

formula a symbolic description of the model to be ﬁt. The formula may include an offset

term (e.g. y~offset(n)+x). If keep.data=FALSE in the initial call to gbm then it

is the user’s responsibility to resupply the offset to gbm.more.

distribution either a character string specifying the name of the distribution to use or a list

with a component name specifying the distribution and any additional param-

eters needed. If not speciﬁed, gbm will try to guess: if the response has only

2 unique values, bernoulli is assumed; otherwise, if the response is a factor,

multinomial is assumed; otherwise, if the response has class "Surv", coxph is

assumed; otherwise, gaussian is assumed.

Currently available options are "gaussian" (squared error), "laplace" (absolute

loss), "tdist" (t-distribution loss), "bernoulli" (logistic regression for 0-1 out-

comes), "huberized" (huberized hinge loss for 0-1 outcomes), "multinomial"

(classiﬁcation when there are more than 2 classes), "adaboost" (the AdaBoost

exponential loss for 0-1 outcomes), "poisson" (count outcomes), "coxph" (right

censored observations), "quantile", or "pairwise" (ranking measure using the

LambdaMart algorithm).

If quantile regression is speciﬁed, distribution must be a list of the form

list(name="quantile",alpha=0.25) where alpha is the quantile to estimate.

The current version’s quantile regression method does not handle non-constant

weights and will stop.

If "tdist" is speciﬁed, the default degrees of freedom is 4 and this can be con-

trolled by specifying distribution=list(name="tdist", df=DF) where DF

is your chosen degrees of freedom.

剩余33页未读，继续阅读

greenfer

粉丝: 5
资源: 5

梯度增强机（GBM）：原理与参数选择

信用卡违约预测：LightGBM与数据挖掘方法对比

分类与编号：数据挖掘与人工智能算法概述

数据挖掘：分类算法详解与比较

数据挖掘算法知识包

数据挖掘算法大全pdf

数据挖掘算法决策树算法及应用扩展.pptx

数据挖掘十大算法

优质文档 十大经典数据挖掘算法R语言实现 共28页.rar

smote的matlab代码-machine-learning:数据挖掘算法的一些实现

人工智能-项目实践-数据预处理-基于真实业务上手数据挖掘（银行流失预警）:数据的处理、LightGBM、skLearning包

最新资源

优质文档十大经典数据挖掘算法R语言实现共28页.rar