Remark 2.2 Note that the history h
n
here generalizes that in discrete-time models by taking
into account the decision epochs t
n
as well as the states i
n
and actions a
n
; see Hern´andez-
Lerma [5, 6], Puterman [15], for instance. However, we can view discrete-time models as
special cases of semi-Markov models, in which t
n
= n for all n.
Now we are in a position to introduce the concept of a policy.
Definition 2.1 A randomized history-dependent policy is a sequence π := {π
n
, n = 0, 1, . . .}
of stochastic kernels π
n
on the action space A given H
n
satisfying
π
n
(A(i
n
) | h
n
) = 1 ∀ h
n
∈ H
n
, n ≥ 0.
The set of all randomized history-dependent policies is denoted by Π.
Let Φ represent the set of all stochastic kernels ϕ on A given S such that ϕ(A(i) | i) = 1
for all i ∈ S, and F denote the s et of all decision functions f : S → A such that f (i) is in A(i)
for all i ∈ S. A policy π is said to be a randomized Markov policy if there is a sequence {ϕ
n
}
of stochastic kernels ϕ
n
∈ Φ such that π
n
(· | h
n
) = ϕ
n
(· | i
n
) for every h
n
∈ H
n
and n ≥ 0. A
randomized Markov policy is said to be randomized stationary if there is a stochastic kernel
ϕ ∈ Φ such that π
n
(· | h
n
) = ϕ(· | i
n
) for every h
n
∈ H
n
and n ≥ 0. In this case, we write π as
ϕ for simplicity. Further, a randomized Markov policy is said to be dete rministic if there is a
sequence {f
n
} of decision functions f
n
∈ F such that π
n
(· | h
n
) is the Dirac measure at f
n
(i
n
)
for all h
n
∈ H
n
and n ≥ 0. Thus, we write such policies as π = {f
n
}. A deterministic Markov
policy is s aid to be stationary if there is a decision function f ∈ F such that π
n
(· | h
n
) is the
Dirac measure at f (i
n
) for all h
n
∈ H
n
and n ≥ 0. A deterministic stationary policy is simply
referred to as a stationary policy and is denoted as f . We denote by Π
RM
, Π
RS
, Π
DM
, and
Π
DS
the families of all randomized Markov, randomized s tationary, deterministic Markov,
and stationary policies, respectively. Obviously, Π
RS
⊂ Π
RM
⊂ Π and Π
DS
⊂ Π
DM
⊂ Π.
Moreover, for a policy π = {ϕ
n
} ∈ Π
RM
and m ≥ 1, we let
(m)
π := {ϕ
m
, ϕ
m+1
, . . .} denote
the m-remainder policy of π.
For each (s, i) ∈ R
+
× S and π ∈ Π, by the well-known Tulcea’s theorem, there exist a
unique probability measure space (Ω, F,P
π
(s,i)
) and a stochastic process {S
n
, J
n
, A
n
, n ≥ 0}
such that, for each t ∈ R
+
, j ∈ S, a ∈ A and n ≥ 0,
P
π
(s,i)
(S
0
= s, J
0
= i) = 1, (3)
P
π
(s,i)
(A
n
= a | h
n
) = π
n
(a | h
n
), (4)
P
π
(s,i)
(S
n+1
− S
n
≤ t, J
n+1
= j | h
n
, a
n
) = Q(t, j | i
n
, a
n
), (5)
where S
n
, J
n
and A
n
denote the nth decision epoch, the state and the action chosen at the
nth decision epoch, respectively. The expectation operator with respect to P
π
(s,i)
is denoted
by E
π
(s,i)
. For simplicity, P
π
(0,i)
and E
π
(0,i)
is denoted by P
π
i
and E
π
i
, respectively.
Remark 2.3 The construction of the probability measure space (Ω, F,P
π
(s,i)
) and the above
properties (3)-(5) of the stochastic process {S
n
, J
n
, A
n
, n ≥ 0} follow from those in Limnios
[7, P.33] and Puterman [15, P.534-535].
4
http://www.paper.edu.cn