2 1 Variability, Information, and Prediction
Machine learning refers to the use of formal structures (machines) to do inference
(learning). This includes what empirical scientists mean by model building – proposing
mathematical expressions that encapsulate the mechanism by which a physical process
gives rise to observations – but much else besides. In particular, it includes many tech-
niques that do not correspond to physical modeling, provided they process data into
information. Here, information usually means anything that helps reduce uncertainty.
So, for instance, a posterior distribution represents “information” or is a “learner” be-
cause it reduces the uncertainty about a parameter.
The fusion of statistics, computer science, electrical engineering, and database man-
agement with new questions led to a new appreciation of sources of errors. In narrow
parametric settings, increasing the sample size gives smaller standard errors. However,
if the model is wrong (and they all are), there comes a point in data gathering where
it is better to use some of your data to choose a new model rather than just to con-
tinue refining an existing estimate. That is, once you admit model uncertainty, you can
have a smaller and smaller variance but your bias is constant. This is familiar from
decomposing a mean squared error into variance and bias components.
Extensions of this animate DMML. Shrinkage methods (not the classical shrinkage,
but the shrinking of parameters to zero as in, say, penalized methods) represent a trade-
off among variable selection, parameter estimation, and sample size. The ideas become
trickier when one must select a basis as well. Just as there are well-known sums of
squares in ANOVA for quantifying the variability explained by different aspects of
the model, so will there be an extra variability corresponding to basis selection. In
addition, if one averages models, as in stacking or Bayes model averaging, extra layers
of variability (from the model weights and model list) must be addressed. Clearly,
good inference requires trade-offs among the biases and variances from each level of
modeling. It may be better, for instance, to “stack” a small collection of shrinkage-
derived models than to estimate the parameters in a single huge model.
Among the sources of variability that must be balanced – random error, parameter
uncertainty and bias, model uncertainty or misspecification, model class uncertainty,
generalization error – there is one that stands out: model uncertainty. In the conven-
tional paradigm with fixed parametric models, there is no model uncertainty; only
parameter uncertainty remains. In conventional nonparametrics, there is only model
uncertainty; there is no parameter, and the model class is so large it is sure to con-
tain the true model. DMML is between these two extremes: The model class is rich
beyond parametrization, and may contain the true model in a limiting sense, but the
true model cannot be assumed to have the form the model class defines. Thus, there
are many parameters, leading to larger standard errors, but when these standard errors
are evaluated within the model, they are invalid: The adequacy of the model cannot be
assumed, so the standard error of a parameter is about a value that may not be mean-
ingful. It is in these high-variability settings in the mid-range of uncertainty (between
parametric and nonparametric) that dealing with model uncertainty carefully usually
becomes the dominant issue which can only be tested by predictive criteria.
There are other perspectives on DMML that exist, such as rule mining, fuzzy learning,
observational studies, and computational learning theory. To an extent, these can be
regarded as elaborations or variations of aspects of the perspective presented here,