从实证数据中估计依赖关系：理论与哲学的演变

5星 · 超过95%的资源需积分: 10 158 浏览量更新于2024-08-02 收藏 1.01MB PDF 举报

"Estimation of Dependences Based on Empirical Data" 这篇文章主要讨论了基于经验数据的依赖性估计问题，这是机器学习领域的一个重要主题。作者Vladimir Vapnik在书中的Afterword部分回顾了过去二十五年里这个领域的发展，并探讨了理论基础和技术成果的变化。Vapnik旨在更新书中提出的原始技术结果，并描述新思想如何在这段时间内演变。文章分为三个部分，反映了经验推断科学发展的三个主要思想： 1. **实证主义与工具主义：经典统计与VC理论** 在这一部分，Vapnik解释了1960年代至1980年代间，为什么一种新的经验推断方法（即VC理论）与1930年代至1960年代的经典统计学方法形成对比。他探讨了这两种方法的核心区别，即在处理数学和哲学上的差异，强调了新方法的实用主义特征。 2. **可证伪性和简约性：VC维与实体数量** 这一章节集中在1980年至2000年间新推断理念的合理性证明上。Vapnik阐述了为何VC维度的概念在预测泛化问题中比经典统计中的简约原则更具相关性。他指出，对于可证伪性的关注在解决复杂问题时更为重要。 3. **非归纳推理方法：直接推理而非泛化** 自2000年代开始，Vapnik讨论了尝试构建基于新哲学的预测方法（直接推理）的努力，这些方法适用于复杂世界，与基于简单世界观念的现有方法形成对比。这一部分展示了如何在日益复杂的环境中寻找新的推理策略。 Vapnik特别提到，对于他的学生和年轻科学家来说，理解科学发展的整体图景至关重要，包括相关分支科学的进展，以及激烈的范式之争。他强调内在的专业诚信是科学家成功的必要条件，引用了Cicero和Einstein的观点，强调了智力诚实的重要性。书目列表展示了Vapnik在统计学习理论和相关领域的研究，包括时间序列分析、概率网络、神经网络、贝叶斯网络、决策图、累积和图表、蒙特卡洛模拟、组合优化、机器学习等，这些都是构成现代机器学习理论和技术的重要组成部分。总结起来，"Estimation of Dependences Based on Empirical Data"不仅是一个关于机器学习技术发展的历史回顾，也是对理论基础和哲学观念演进的深刻洞察，对于理解和推动这个领域的发展具有重要意义。

414 1. Realism and Instrumentalism

If, however, the conditions for uniform convergence are valid then (as shown in

Chapter 6 of EDBED) for any ﬁxed number of observations one can obtain a bound

that deﬁnes the guaranteed risk of error for the chosen function.

Using classical statistics terminology the uniform convergence of the frequencies

to their probability over a given set of events can be called the uniform law of large

numbers over the corresponding set of events. (The convergence of frequencies to their

corresponding probability for a ﬁxed event (the Bernoulli law) is called the law of large

numbers.)

Analysis of Bernoulli’s law of large numbers has been the subject of intensive re-

search since the 1930s. Also in the 1930s it was shown that for one particular set

of events the uniform law of large numbers always holds. This fact is the Glivenco–

Cantelli theorem. The corresponding bound on the rate of convergence forms Kol-

mogorov’s bound. Classical statistics took advantage of these results (the Glivenco–

Cantelli theorem and Kolmogorov’s bound are regarded as the foundation of theoretical

statistics).

However, to analyze the problem of generalization for pattern recognition, one

should have an answer to the more general question:

What is the demarcation line that describes whether the uniform law of large

numbers holds?

The obtaining of the existence conditions for the uniform law of large numbers and

the corresponding bound on the rate of convergence was the turning point in the studies

of empirical inference.

This was not recognized immediately, however. It took at least two decades to

understand this fact in full detail. We will talk about this in what follows.

1.2 REALISM AND INSTRUMENTALISM IN STATISTICS

AND THE

PHILOSOPHY OF SCIENCE

1.2.1 THE CURSE OF DIMENSIONALITY AND CLASSICAL STATISTICS

The results of successfully training a Perceptron (which constructed decision rules for

the ten-class digit classiﬁcation problem in 400-dimensional space, using 512 training

examples) immediately attracted the attention of the theorists.

In classical statistics a problem analogous to the pattern recognition problem was

considered by Ronald Fisher in the 1930s, the so-called problem of discriminant analy-

sis. Fisher considered the following problem. One knows the generating model of data

for each class, the density function deﬁned up to a ﬁxed number of parameters (usually

Gaussian functions). The problem was: given the generative models (the model how

the data are generated known up to values of its parameters) estimate the discriminative

rule. The proposed solution was:

First, using the data, estimate the parameters of the statistical laws and

1.2. Realism and Instrumentalism 415

Second, construct the optimal decision rule using the estimated parameters.

To estimate the densities, Fisher suggested the maximum likelihood method.

This scheme later was generalized for the case when the unknown density belonged

to a nonparametric family. To estimate these generative models the methods of non-

parametric statistics were used (see example in Chapter 2 Section 2.3.5). However,

the main principle of ﬁnding the desired rule remained the same: ﬁrst estimate the

generative models of data and then use these models to ﬁnd the discriminative rule.

This idea of constructing a decision rule after ﬁnding the generative models was

later named the generative model of induction. This model is based on understanding

of how the data are generated. In a wide philosophical sense an understanding of how

data are generated reﬂects an understanding of the corresponding law of nature.

By the time the Perceptron was introduced, classical discriminant analysis based

on Gaussian distribution functions had been studied in great detail. One of the impor-

tant results obtained for a particular model (two Gaussian distributions with the same

covariance matrix) is the introduction of a concept called the Mahalanobis distance. A

bound on the classiﬁcation accuracy of the constructed linear discriminant rule depends

on a value of the Mahalanobis distance.

However, to construct this model using classical methods requires the estimation of

about 0.5n

parameters where n is the dimensionality of the space. Roughly speaking,

to estimate one parameter of the model requires C examples. Therefore to solve the

ten-digit recognition problem using the classical technique one needs ≈ 10(400)

examples. The Perceptron used only 512.

This shocked theorists. It looked as if the classical statistical approach failed to

overcome the curse of dimensionality in a situation where a heuristic method that min-

imized the empirical loss easily overcame this curse.

Later the methods based on the idea of minimizing different type of empirical losses

were called the predictive (discriminative) models of induction, in contrast to the clas-

sical generative models. In a wide philosophical sense predictive models do not nec-

essarily connect prediction of an event with understanding of the law that governs the

event; they are just looking for a function that explains the data best.

The VC theory was constructed to justify the empirical risk minimization induction

principle: according to VC theory the generalization bounds for the methods that min-

imize the empirical loss do not depend directly on the dimension of the space. Instead

they depend on the so-called capacity factors of the admissible set of functions — the

VC entropy, the Growth function, or the VC dimension — that can be much smaller

than the dimensionality. (In EDBED they are called Entropy and Capacity; the names

VC entropy and VC dimension as well as VC theory appeared later due to R. Dudley.)

It is interesting to note that Fisher suggested along with the classical generative models (which he was

able to justify), the heuristic solution (that belongs to a discriminative model) now called Fisher’s linear dis-

criminant function. This function minimizes some empirical loss functional, whose construction is similar

to the Mahalanobis distance. For a long time this heuristic of Fisher was not considered an important result

(it was ignored in most classical statistics textbooks). Only recently (after computers appeared and statisti-

cal learning theory became a subject not only of theoretical but also of practical justiﬁcation) did Fisher’s

suggestion become a subject of interest.

1.2. Realism and Instrumentalism 417

very different from the line and therefore cannot be a good estimate of the true BB rule.

From the other point of view, the polynomial rule separates the data well (and as we

will show later can belong to a set with small VC dimension) and therefore can be a

good instrument for prediction.

The lesson the Perceptron teaches us is that sometimes it is useful to give up the

ambitious goal of estimating the rule the BB uses (the generative model of induction).

Why?

Before discussing this question let me make the following remark. The problem

of pattern recognition can be regarded as a generalization problem: using a set of data

(observations) ﬁnd a function

(theory). The same goals (but in more complicated

situations) arise in the classical model of science: using observation of nature ﬁnd

the law. One can consider the pattern recognition problem as the simplest model of

generalization where observations are just a set of i.i.d. vectors and the admissible

laws are just a set of indicator functions. Therefore it is very useful to apply the ideas

described in the general philosophy of induction to its simplest model and vice versa,

to understand the ideas that appear in our particular model in the general terms of the

classical philosophy. Later we will see that these interpretations are nontrivial.

1.2.3 R

EALISM AND INSTRUMENTALISM IN THE PHILOSOPHY OF SCIENCE

The philosophy of science has two different points of view on the goals and the results

of scientiﬁc activities.

(1) There is a group of philosophers who believe that the results of scientiﬁc dis-

covery are the real laws that exist in nature. These philosophers are called the

realists.

(2) There is another group of philosophers who believe the laws that are discovered

by scientists are just an instrument to make a good prediction. The discovered

laws can be very different from the ones that exist in Nature. These philosophers

are called the instrumentalists.

The two types of approximations deﬁned by classical discriminant analysis (using

the generative model of data) and by statistical learning theory (using the function

that explains the data best) reﬂect the positions of realists and instrumentalists in our

simple model of the philosophy of generalization, the pattern recognition model. Later

we will see that the position of philosophical instrumentalism played a crucial role in

the success that pattern recognition technology has achieved.

However, to explain why this is so we must ﬁrst discuss the theory of ill-posed

problems, which in many respects describes the relationship between realism and in-

strumentalism in very clearly deﬁned situations.

The pattern recognition problem can be considered as the simplest generalization problem, since one

has to ﬁnd the function in a set of admissible indicator functions (that can take only two values, say 1 and -1.

418 1. Realism and Instrumentalism

1.3 REGULARIZATION AND STRUCTURAL RISK

MINIMIZATION

1.3.1 REGULARIZATION OF ILL-POSED PROBLEMS

In the beginning of the 1900s, Hadamard discovered a new mathematical phenomenon.

He discovered that there are continuous operators A that map, in a one-to-one manner,

elements of a space f to elements of a space F , but the inverse operator A

−1

from

the space F to the space f can be discontinuous. This means that there are operator

equations

Af = F (1.1)

whose solution in the set of functions f ∈ Φ exists, and is unique, but is unstable. (See

Chapter 1 of EDED). That is, a small deviation F +∆F of the (known) right-hand

side of the equation can lead to a big deviation in the solution. Hadamard thought that

this was just a mathematical phenomenon that could never appear in real-life problems.

However, it was soon discovered that many important practical problems are described

by such equations.

In particular, the problem of solving some types of linear operator equations (for

example, Fredholm’s integral equation of the second order) are ill-posed (see Chapter 1,

Section 5 of EDBED). It was shown that many geophysical problems require solving

(ill-posed) integral equations whose right-hand side is obtained from measurements

(and therefore is not very accurate).

For us it is important that ill-posed problems can occur when one tries to estimate

unknown reasons from observed consequences.

In 1943 an important step in understanding the structure of ill-posed problems was

made. Tikhonov proved the so-called inverse operator lemma:

Let A be a continuous one-to-one operator from E

to E

. Then the inverse

operator A

−1

deﬁned on the images F of a compact set f ∈ Φ

∗

is stable.

This means that if one possesses very strong prior knowledge about the solution (it

belongs to a known compact set of functions), then it is possible to solve the equation.

It took another 20 years before this lemma was transformed into speciﬁc approaches

for solving ill-posed problems.

In 1962 Ivanov [21] suggested the following idea of solving operator equation (1.1).

Consider the functional Ω(f) ≥ 0 that possesses the following two properties

(1) For any c ≥ 0 the set of functions satisfying the constraint

Ω(f) ≤ c (1.2)

is convex and compact.

(2) The solution f

of Equation (1.1) belongs to some compact set

Ω(f

) ≤ c

(1.3)

(where the constant c

> 0 may be unknown).

剩余104页未读，继续阅读

xiaodong20091028

粉丝: 3
资源: 1

从实证数据中估计依赖关系：理论与哲学的演变

Estimation of！dependences￥based on（empirical data

Vapnik大作第二版，Estimation of Dependences Based on Empirical Data(基于经验数据的依赖性估计)

首次提出SVM的英文论文，105页pdf

与Prediction and risk assessment of extreme weather events based on gumbel copula function类似的文献

natural image noise level estimation based on local statistics for blind

high accuracy optical flow estimation based on a theory for warping

Estimation of reliable vehicle dynamic model using IMU/GNSS data fusion for stability controller design

最新资源