R中caret包构建预测模型：功能演示与参数调优

需积分: 10 137 浏览量更新于2024-07-15 收藏 506KB PDF 举报

在R语言中构建预测模型是一个广泛应用于科学、金融等领域的重要技能。caret包，全称为classification and regression training，是一个功能强大的工具集，它集中了R语言中丰富的分类和回归模型，旨在简化模型训练和调优过程，支持多种机器学习技术。该包不仅提供了预处理数据的功能，如特征选择和转换，还包含计算变量重要性的方法以及模型可视化工具。核心特点包括： 1. **模型开发工具**：caret整合了多种复杂的分类和回归算法，如逻辑回归、决策树、支持向量机等，为用户提供了方便的接口来构建和比较不同模型的性能。 2. **模型训练与调优**：通过封装训练过程，用户可以简化参数设置和调整（tuning），提高模型的准确性。内置的网格搜索和交叉验证功能帮助优化模型参数。 3. **数据预处理**：包内包含对数据进行标准化、归一化、缺失值处理等预处理步骤，确保模型训练在高质量的数据上进行。 4. **变量重要性评估**：通过诸如递归特征消除（Recursive Feature Elimination, RFE）或基于随机森林的方法，衡量每个特征对模型预测的贡献，帮助理解变量间的关联和影响。 5. **模型可视化**：提供图形界面展示模型的性能，如ROC曲线、混淆矩阵等，有助于直观理解和解释模型效果。 6. **并行处理支持**：对于大规模数据集，caret利用多核计算能力加速模型训练，特别是对于时间敏感的应用，这一点尤为重要。以一个来自计算化学的真实数据为例，展示了如何使用caret进行模型构建，并对比了不同模型在并行处理下的性能提升。通过这个案例，读者可以实际操作并体验到caret带来的便利性和效率。关键词：模型构建、参数调优、并行计算、R语言、工作空间。 caret包是R语言中构建和优化预测模型的强大工具，无论是初学者还是经验丰富的数据科学家，都能从中获益良多。掌握这个包，将大大提高在实际项目中的工作效率和模型质量。

Journal of Statistical Software 5

correlations are below a threshold:

repeat

Find the pair of predictors with the largest absolute correlation;

For both predictors, compute the average correlation between each predictor and all of

the other variables;

Flag the variable with the largest mean correlation for removal;

Remove this row and column from the correlation matrix;

until no correlations are above a threshold ;

This algorithm can be used to ﬁnd the minimal set of predictors that can be removed so

that the pairwise correlations are below a speciﬁc threshold. Note that, if two variables have

a high correlation, the algorithm determines which one is involved with the most pairwise

correlations and is removed.

For illustration, predictors that result in absolute pairwise correlations greater than 0.90 can

be removed using the findCorrelation function. This function returns an index of column

numbers for removal.

R> ncol(trainDescr)

[1] 1576

R> descrCorr <- cor(trainDescr)

R> highCorr <- findCorrelation(descrCorr, 0.90)

R> trainDescr <- trainDescr[, -highCorr]

R> testDescr <- testDescr[, -highCorr]

R> ncol(trainDescr)

[1] 650

For chemical descriptors, it is not uncommon to have many very large correlations between

the predictors. In this case, using a threshold of 0.90, we eliminated 926 descriptors from the

data.

Once the ﬁnal set of predictors is determined, the values may require transformations be-

fore being used in a model. Some models, such as partial least squares, neural networks

and support vector machines, need the predictor variables to be centered and/or scaled.

The preProcess function can be used to determine values for predictor transformations us-

ing the training set and can be applied to the test set or future samples. The function

has an argument, method, that can have possible values of "center", "scale", "pca" and

"spatialSign". The ﬁrst two options provide simple location and scale transformations of

each predictor (and are the default values of method). The predict method for this class is

then used to apply the processing to new samples

R> xTrans <- preProcess(trainDescr)

R> trainDescr <- predict(xTrans, trainDescr)

R> testDescr <- predict(xTrans, testDescr)

The "pca" option computes loadings for principal component analysis that can be applied

to any other data set. In order to determine how many components should be retained, the

preProcess function has an argument called thresh that is a threshold for the cumulative

percentage of variance captured by the principal components. The function will add compo-

nents until the cumulative percentage of variance is above the threshold. Note that the data

6 caret: Building Predictive Models in R

are automatically scaled when method = "pca", even if the method value did not indicate

that scaling was needed. For PCA transformations, the predict method generates values with

column names "PC1", "PC2", etc.

Specifying method = "spatialSign" applies the spatial sign transformation (Serneels et al.

2006) where the predictor values for each sample are projected onto a unit circle using x

∗

x/||x||. This transformation may help when there are outliers in the x space of the training

set.

4. Building and tuning models

The train function can be used to select values of model tuning parameters (if any) and/or

estimate model performance using resampling. As an example, a radial basis function support

vector machine (SVM) can be used to classify the samples in our computational chemistry

data. This model has two tuning parameters. The ﬁrst is the scale function σ in the radial

basis function

K(a, b) = exp(−σ||a − b||

)

and the other is the cost value C used to control the complexity of the decision boundary.

We can create a grid of candidate tuning values to evaluate. Using resampling methods, such

as the bootstrap or cross-validation, a set of modiﬁed data sets are created from the training

samples. Each data set has a corresponding set of hold-out samples. For each candidate tuning

parameter combination, a model is ﬁt to each resampled data set and is used to predict the

corresponding held out samples. The resampling performance is estimated by aggregating the

results of each hold-out sample set. These performance estimates are used to evaluate which

combination(s) of the tuning parameters are appropriate. Once the ﬁnal tuning values are

assigned, the ﬁnal model is reﬁt using the entire training set.

For the train function, the possible resampling methods are: bootstrapping, k-fold cross-

validation, leave-one-out cross-validation, and leave-group-out cross-validation (i.e., repeated

splits without replacement). By default, 25 iterations of the bootstrap are used as the resam-

pling scheme. In this case, the number of iterations was increased to 200 due to the large

number of samples in the training set.

For this particular model, it turns out that there is an analytical method for directly estimating

a suitable value of σ from the training data (Caputo et al. 2002). By default, the train

function uses the sigest function in the kernlab package (Karatzoglou et al. 2004) to initialize

this parameter. In doing this, the value of the cost parameter C is the only tuning parameter.

The train function has the following arguments:

x: a matrix or data frame of predictors. Currently, the function only accepts numeric

values (i.e., no factors or character variables). In some cases, the model.matrix function

may be needed to generate a data frame or matrix of purely numeric data

y: a numeric or factor vector of outcomes. The function determines the type of problem

(classiﬁcation or regression) from the type of the response given in this argument.

method: a character string specifying the type of model to be used. See Table 1 for the

possible values.

剩余25页未读，继续阅读

Kafffka

粉丝: 0
资源: 1

R中caret包构建预测模型：功能演示与参数调优

Machine-Learning-in-Python-Essential-Techniques-for-Predictive-Analysis

Predictive.Analytics.using.Rattle.and.Qlik.Sense

Neural Networks in Finance Gaining Predictive Edge in the Market

Model Predictive Control In The Process Industry

Iterative Distributed Model Predictive Control in the Presence o

MATLAB Gaussian Fitting in Machine Learning: Foundation of Constructing Predictive Models, Enhancing...

Ensemble Learning and Multilayer Perceptrons (MLP): New Approaches for ... and Building Robust Models

Linear Programming Modeling Secrets: From Practical Problems to Mathematical Models, Building ...

Predictive Modeling Using Regression

Predictive Modeling Using Logistic Regression

最新资源