非负折扣半马尔可夫决策过程的首达目标模型优化

104 浏览量更新于2024-07-16 收藏 353KB PDF 举报

本文《带非负折扣费用的半马尔可夫决策过程中的首达目标模型》由黄永辉和郭先平撰写，发表在中山大学数学与计算机科学学院。研究焦点集中在一种特殊的半马尔可夫决策过程（Semi-Markov Decision Process, SMDP），其中涉及的是有限状态集合且成本带有非负的折扣因子。作者探讨的核心问题是优化问题，即期望的折现成本，这个成本是在首次到达预设目标集的时间段内累积的。文章首先定义了一个基于给定的半马尔可夫决策核和策略的决策过程。在这个框架下，作者引入了折现期望值函数，它是衡量决策策略优劣的关键指标。他们证明了价值函数满足最优性方程，即在一定的条件下，存在一个最优（或ε-最优）的 stationary policy（静止策略），即长期来看，无论初始状态如何，该策略都能提供最低的成本。通过采用最小非负解方法，作者证明了最优策略的存在性和一些特性，这有助于理解这些策略如何在不同状态下做出最经济有效的决策。此外，文中还开发了一种数值迭代算法，用于计算价值函数和最优策略，并通过实例展示了算法的实用性。本文的一大贡献在于，它扩展了经典的首达目标模型，不仅适用于离散时间马尔可夫决策过程，也适用于连续时间的情况。这种扩展使得理论更加全面，能够处理更广泛的现实世界问题，如在网络路由、库存管理等应用中，考虑到时间的连续性和成本的非线性折现因素。《带非负折扣费用的半马尔可夫决策过程中的首达目标模型》这篇论文深入探讨了半马尔可夫决策过程中的优化决策问题，为理解和解决实际问题提供了理论基础，同时为未来的相关研究提供了新的视角和方法论支持。

Remark 2.2 Note that the history h

here generalizes that in discrete-time models by taking

into account the decision epochs t

as well as the states i

and actions a

; see Hern´andez-

Lerma [5, 6], Puterman [15], for instance. However, we can view discrete-time models as

special cases of semi-Markov models, in which t

= n for all n. 

Now we are in a position to introduce the concept of a policy.

Deﬁnition 2.1 A randomized history-dependent policy is a sequence π := {π

, n = 0, 1, . . .}

of stochastic kernels π

on the action space A given H

satisfying

(A(i

) | h

) = 1 ∀ h

∈ H

, n ≥ 0.

The set of all randomized history-dependent policies is denoted by Π.

Let Φ represent the set of all stochastic kernels ϕ on A given S such that ϕ(A(i) | i) = 1

for all i ∈ S, and F denote the s et of all decision functions f : S → A such that f (i) is in A(i)

for all i ∈ S. A policy π is said to be a randomized Markov policy if there is a sequence {ϕ

}

of stochastic kernels ϕ

∈ Φ such that π

(· | h

) = ϕ

(· | i

) for every h

∈ H

and n ≥ 0. A

randomized Markov policy is said to be randomized stationary if there is a stochastic kernel

ϕ ∈ Φ such that π

(· | h

) = ϕ(· | i

) for every h

∈ H

and n ≥ 0. In this case, we write π as

ϕ for simplicity. Further, a randomized Markov policy is said to be dete rministic if there is a

sequence {f

} of decision functions f

∈ F such that π

(· | h

) is the Dirac measure at f

)

for all h

∈ H

and n ≥ 0. Thus, we write such policies as π = {f

}. A deterministic Markov

policy is s aid to be stationary if there is a decision function f ∈ F such that π

(· | h

) is the

Dirac measure at f (i

) for all h

∈ H

and n ≥ 0. A deterministic stationary policy is simply

referred to as a stationary policy and is denoted as f . We denote by Π

, Π

, and

the families of all randomized Markov, randomized s tationary, deterministic Markov,

and stationary policies, respectively. Obviously, Π

⊂ Π

⊂ Π and Π

⊂ Π

⊂ Π.

Moreover, for a policy π = {ϕ

} ∈ Π

and m ≥ 1, we let

(m)

π := {ϕ

, ϕ

m+1

, . . .} denote

the m-remainder policy of π.

For each (s, i) ∈ R

× S and π ∈ Π, by the well-known Tulcea’s theorem, there exist a

unique probability measure space (Ω, F,P

(s,i)

) and a stochastic process {S

, J

, A

, n ≥ 0}

such that, for each t ∈ R

, j ∈ S, a ∈ A and n ≥ 0,

(s,i)

= s, J

= i) = 1, (3)

(s,i)

= a | h

) = π

(a | h

), (4)

(s,i)

n+1

− S

≤ t, J

n+1

= j | h

, a

) = Q(t, j | i

, a

), (5)

where S

, J

and A

denote the nth decision epoch, the state and the action chosen at the

nth decision epoch, respectively. The expectation operator with respect to P

(s,i)

is denoted

by E

(s,i)

. For simplicity, P

(0,i)

and E

(0,i)

is denoted by P

and E

, respectively.

Remark 2.3 The construction of the probability measure space (Ω, F,P

(s,i)

) and the above

properties (3)-(5) of the stochastic process {S

, J

, A

, n ≥ 0} follow from those in Limnios

[7, P.33] and Puterman [15, P.534-535]. 

http://www.paper.edu.cn

剩余15页未读，继续阅读

weixin_38613173

粉丝: 3
资源: 929

非负折扣半马尔可夫决策过程的首达目标模型优化

PyPI 官网下载 | bible-passage-reference-parser-1.0.3.tar.gz

Calculations of first passage time of delayed tree-like networks

A-high-efficient-and-collision-free-algorithm-for-multi-robot-path-planning-in-narrow-passage-env

Passage 1-A-打印版1

Passage:塔防-开源

Passage .NET Enterprise Portal-开源

passage reading--the trashman.pdf

passage reading--the trashman.docx

formatted-task022-cosmosqa-passage-inappropriate-binary.json

热扩散方程matlab代码-First-passage:使用MatlabPDE工具箱数值求解FirstPassage概率问题

最新资源