1.2. Realism and Instrumentalism 415
Second, construct the optimal decision rule using the estimated parameters.
To estimate the densities, Fisher suggested the maximum likelihood method.
This scheme later was generalized for the case when the unknown density belonged
to a nonparametric family. To estimate these generative models the methods of non-
parametric statistics were used (see example in Chapter 2 Section 2.3.5). However,
the main principle of finding the desired rule remained the same: first estimate the
generative models of data and then use these models to find the discriminative rule.
This idea of constructing a decision rule after finding the generative models was
later named the generative model of induction. This model is based on understanding
of how the data are generated. In a wide philosophical sense an understanding of how
data are generated reflects an understanding of the corresponding law of nature.
By the time the Perceptron was introduced, classical discriminant analysis based
on Gaussian distribution functions had been studied in great detail. One of the impor-
tant results obtained for a particular model (two Gaussian distributions with the same
covariance matrix) is the introduction of a concept called the Mahalanobis distance. A
bound on the classification accuracy of the constructed linear discriminant rule depends
on a value of the Mahalanobis distance.
However, to construct this model using classical methods requires the estimation of
about 0.5n
2
parameters where n is the dimensionality of the space. Roughly speaking,
to estimate one parameter of the model requires C examples. Therefore to solve the
ten-digit recognition problem using the classical technique one needs ≈ 10(400)
2
C
examples. The Perceptron used only 512.
This shocked theorists. It looked as if the classical statistical approach failed to
overcome the curse of dimensionality in a situation where a heuristic method that min-
imized the empirical loss easily overcame this curse.
Later the methods based on the idea of minimizing different type of empirical losses
were called the predictive (discriminative) models of induction, in contrast to the clas-
sical generative models. In a wide philosophical sense predictive models do not nec-
essarily connect prediction of an event with understanding of the law that governs the
event; they are just looking for a function that explains the data best.
4
The VC theory was constructed to justify the empirical risk minimization induction
principle: according to VC theory the generalization bounds for the methods that min-
imize the empirical loss do not depend directly on the dimension of the space. Instead
they depend on the so-called capacity factors of the admissible set of functions — the
VC entropy, the Growth function, or the VC dimension — that can be much smaller
than the dimensionality. (In EDBED they are called Entropy and Capacity; the names
VC entropy and VC dimension as well as VC theory appeared later due to R. Dudley.)
4
It is interesting to note that Fisher suggested along with the classical generative models (which he was
able to justify), the heuristic solution (that belongs to a discriminative model) now called Fisher’s linear dis-
criminant function. This function minimizes some empirical loss functional, whose construction is similar
to the Mahalanobis distance. For a long time this heuristic of Fisher was not considered an important result
(it was ignored in most classical statistics textbooks). Only recently (after computers appeared and statisti-
cal learning theory became a subject not only of theoretical but also of practical justification) did Fisher’s
suggestion become a subject of interest.