Pr[x
1
(t
1
) ≤ ξ
1
|y(t
0
) =
η
(t
0
), …, y(t) =
η
(t)] = F(ξ
1
) (1)
Evidently, F(ξ
1
) represents all the information which the meas-
urement of the random variables y(t
0
), ..., y(t) has conveyed about
the random variable x
1
(t
1
). Any statistical estimate of the random
variable x
1
(t
1
) will be some function of this distribution and
therefore a (nonrandom) function of the random variables y(t
0
), ...,
y(t). This statistical estimate is denoted by X
1
(t
1
|t), or by just X
1
(t
1
)
or X
1
when the set of observed random variables or the time at
which the estimate is required are clear from context.
Suppose now that X
1
is given as a fixed function of the random
variables y(t
0
), ..., y(t). Then X
1
is itself a random variable and its
actual value is known whenever the actual values of y(t
0
), ..., y(t)
are known. In general, the actual value of X
1
(t
1
) will be different
from the (unknown) actual value of x
1
(t
1
). To arrive at a rational
way of determining X
1
, it is natural to assign a penalty or loss for
incorrect estimates. Clearly, the loss should be a (i) positive, (ii)
nondecreasing function of the estimation error ε = x
1
(t
1
) – X
1
(t
1
).
Thus we define a loss function by
L(0) = 0
L(ε
2
) ≥ L(ε
1
) ≥ 0 when ε
2
≥ ε
1
≥ 0 (2)
L(ε) = L(–ε)
Some common examples of loss functions are: L(ε) = aε
2
, aε
4
,
a|ε|, a[1 – exp(–ε
2
)], etc., where a is a positive constant.
One (but by no means the only) natural way of choosing the
random variable X
1
is to require that this choice should minimize
the average loss or risk
E{L[x
1
(t
1
) – X
1
(t
1
)]} = E[E{L[x(t
1
) – X
1
(t
1
)]|y(t
0
), …, y(t)}] (3)
Since the first expectation on the right-hand side of (3) does not
depend on the choice of X
1
but only on y(t
0
), ..., y(t), it is clear that
minimizing (3) is equivalent to minimizing
E{L[x
1
(t
1
) – X
1
(t
1
)]|y(t
0
), ..., y(t)} (4)
Under just slight additional assumptions, optimal estimates can be
characterized in a simple way.
Theorem 1. Assume that L is of type (2) and that the conditional
distribution function F(ξ) defined by (1) is:
(A) symmetric about the mean
ξ :
F(ξ –
ξ ) = 1 – F( ξ – ξ)
(B) convex for ξ ≤
ξ :
F(λξ
1
+ (1 – λ)ξ
2
) ≤ λF(ξ
1
) + (1 – λ)F(ξ
2
)
for all ξ
1
, ξ
2
≤ ξ and 0 ≤ λ ≤ 1
Then the random variable x
1
*(t
1
|t) which minimizes the average
loss (3) is the conditional expectation
x
1
*(t
1
|t) = E[x
1
(t
1
)|y(t
0
), …, y(t)] (5)
Proof: As pointed out recently by Sherman [25], this theorem
follows immediately from a well-known lemma in probability
theory.
Corollary. If the random processes {x
1
(t)}, {x
2
(t)}, and {y(t)}
are gaussian, Theorem 1 holds.
Proof: By Theorem 5, (A) (see Appendix), conditional distribu-
tions on a gaussian random process are gaussian. Hence the re-
quirements of Theorem 1 are always satisfied.
In the control system literature, this theorem appears some-
times in a form which is more restrictive in one way and more
general in another way:
Theorem l-a. If L(ε) = ε
2
, then Theorem 1 is true without as-
sumptions (A) and (B).
Proof: Expand the conditional expectation (4):
E[x
1
2
(t
1
)|y(t
0
), …, y(t)] – 2X
1
(t
1
)E[x
1
(t
1
)|y(t
0
), …, y(t)] + X
1
2
(t
1
)
and differentiate with respect to X
1
(t
1
). This is not a completely
rigorous argument; for a simple rigorous proof see Doob [15], pp.
77–78.
Remarks. (a) As far as the author is aware, it is not known what
is the most general class of random processes {x
1
(t)}, {x
2
(t)} for
which the conditional distribution function satisfies the re-
quirements of Theorem 1.
(b) Aside from the note of Sherman, Theorem 1 apparently has
never been stated explicitly in the control systems literature. In
fact, one finds many statements to the effect that loss functions of
the general type (2) cannot be conveniently handled mathe-
matically.
(c) In the sequel, we shall be dealing mainly with vector-
valued random variables. In that case, the estimation problem is
stated as: Given a vector-valued random process {x(t)} and ob-
served random variables y(t
0
), ..., y(t), where y(t) = Mx(t) (M
being a singular matrix; in other words, not all co-ordinates of
x(t) can be observed), find an estimate X(t
1
) which minimizes the
expected loss E[L(||x(t
1
) – X(t
1
)||)], || || being the norm of a
vector.
Theorem 1 remains true in the vector case also, provided we
re- quire that the conditional distribution function of the n co-
ordi- nates of the vector x(t
1
),
Pr[x
1
(t
1
) ≤ ξ
1
,…, x
n
(t
1
) ≤ ξ
n
|y(t
0
), …, y(t)] = F(ξ
1
, …,ξ
n
)
be symmetric with respect to the n variables ξ
1
– ξ
1
, …, ξ
n
– ξ
n
and convex in the region where all of these variables are
negative.
Orthogonal Projections
The explicit calculation of the optimal estimate as a function of
the observed variables is, in general, impossible. There is an
important exception: The processes {x
1
(t)}, {x
2
(t)} are gaussian.
On the other hand, if we attempt to get an optimal estimate
under the restriction L(ε) = ε
2
and the additional requirement that
the estimate be a linear function of the observed random
variables, we get an estimate which is identical with the optimal
estimate in the gaussian case, without the assumption of linearity
or quadratic loss function. This shows that results obtainable by
linear estimation can be bettered by nonlinear estimation only
when (i) the random processes are nongaussian and even then (in
view of Theorem 5, (C)) only (ii) by considering at least third-
order probability distribution functions.
In the special cases just mentioned, the explicit solution of the
estimation problem is most easily understood with the help of a
geometric picture. This is the subject of the present section.
Consider the (real-valued) random variables y(t
0
), …, y(t). The
set of all linear combinations of these random variables with real
coefficients
∑
=
t
ti
i
iya
0
)( (6)
forms a vector space (linear manifold) which we denote by
Y(t).
We regard, abstractly, any expression of the form (6) as “point”
or “vector” in
Y(t); this use of the word “vector” should not be
confused, of course, with “vector-valued” random variables, etc.
Since we do not want to fix the value of t (i.e., the total number
of possible observations),
Y(t) should be regarded as a finite-
dimensional subspace of the space of all possible observations.
Transactions of the ASME–Journal of Basic Engineering, 82 (Series D): 35-45. Copyright © 1960 by ASME