6 caret: Building Predictive Models in R
are automatically scaled when method = "pca", even if the method value did not indicate
that scaling was needed. For PCA transformations, the predict method generates values with
column names "PC1", "PC2", etc.
Specifying method = "spatialSign" applies the spatial sign transformation (Serneels et al.
2006) where the predictor values for each sample are projected onto a unit circle using x
∗
=
x/||x||. This transformation may help when there are outliers in the x space of the training
set.
4. Building and tuning models
The train function can be used to select values of model tuning parameters (if any) and/or
estimate model performance using resampling. As an example, a radial basis function support
vector machine (SVM) can be used to classify the samples in our computational chemistry
data. This model has two tuning parameters. The first is the scale function σ in the radial
basis function
K(a, b) = exp(−σ||a − b||
2
)
and the other is the cost value C used to control the complexity of the decision boundary.
We can create a grid of candidate tuning values to evaluate. Using resampling methods, such
as the bootstrap or cross-validation, a set of modified data sets are created from the training
samples. Each data set has a corresponding set of hold-out samples. For each candidate tuning
parameter combination, a model is fit to each resampled data set and is used to predict the
corresponding held out samples. The resampling performance is estimated by aggregating the
results of each hold-out sample set. These performance estimates are used to evaluate which
combination(s) of the tuning parameters are appropriate. Once the final tuning values are
assigned, the final model is refit using the entire training set.
For the train function, the possible resampling methods are: bootstrapping, k-fold cross-
validation, leave-one-out cross-validation, and leave-group-out cross-validation (i.e., repeated
splits without replacement). By default, 25 iterations of the bootstrap are used as the resam-
pling scheme. In this case, the number of iterations was increased to 200 due to the large
number of samples in the training set.
For this particular model, it turns out that there is an analytical method for directly estimating
a suitable value of σ from the training data (Caputo et al. 2002). By default, the train
function uses the sigest function in the kernlab package (Karatzoglou et al. 2004) to initialize
this parameter. In doing this, the value of the cost parameter C is the only tuning parameter.
The train function has the following arguments:
x: a matrix or data frame of predictors. Currently, the function only accepts numeric
values (i.e., no factors or character variables). In some cases, the model.matrix function
may be needed to generate a data frame or matrix of purely numeric data
y: a numeric or factor vector of outcomes. The function determines the type of problem
(classification or regression) from the type of the response given in this argument.
method: a character string specifying the type of model to be used. See Table 1 for the
possible values.