约束条件下第一代马尔可夫决策过程的最优化

65 浏览量更新于2024-07-15 收藏 174KB PDF 举报

"本文探讨了离散时间马尔可夫决策过程（DTMDP）的约束最优性问题，特别关注了具有约束、状态相关折现系数和可能无界成本的第一遍离散时间马尔可夫决策过程。研究者通过策略的占用度量性质，将约束最优性问题转化为无限维线性规划问题，并证明了在一定条件下存在最佳策略。此外，对于有限状态和行动的情况，文章提供了最优策略的精确形式。最后，通过一个受控排队系统的实例，展示了这些理论结果的应用。" 马尔可夫决策过程（Markov Decision Process，MDP）是决策理论中的一个核心模型，用于描述一个决策者（或智能体）与环境的交互过程。在这个过程中，状态根据马尔可夫性质（即当前状态只依赖于前一个状态，而不依赖于之前的整个历史）转移，并且决策者可以根据当前状态选择行动，影响状态的转移并获得奖励。本文特别关注的是第一遍离散时间马尔可夫决策过程（First-passage Discrete-Time Markov Decision Processes，DTMDP），其中加入了约束条件和状态相关的折现因子。约束条件使得问题更复杂，因为它要求在满足某些限制的同时最大化期望累计奖励。折现因子则反映了未来收益的重要性随时间的减少，这可能是时间敏感的环境或有限资源的考虑。状态相关折现因子意味着每个状态的未来收益不是统一打折，而是根据状态的不同有不同的折现率。这增加了决策的复杂性，因为决策者必须考虑不同状态下的长期影响。在可能存在无界成本的情况下，找到最优策略尤其具有挑战性。然而，通过策略的占用度量，文章表明可以将约束最优性问题转化为一个无限维的线性规划问题。占用度量是描述策略如何在状态空间中分配时间的度量，它为理解和求解这类问题提供了一个有力的工具。文章进一步利用这个等价关系，当状态和动作空间都是有限时，确定了最优策略的具体形式。这为实际应用中的决策问题提供了理论基础。最后，通过一个受控排队系统的实例，作者展示了这些理论成果如何应用于解决实际问题。受控排队系统是一种常见的现实世界模型，可用于分析和优化服务系统，如呼叫中心、生产线或交通管理。这篇论文深化了对具有复杂约束和动态环境的决策问题的理解，为理论研究和实际应用提供了重要的数学工具和方法。

1008 Xiao WU et al.

(i, a)(1 l  q) denote the objective cost and constrained cost functions,

respectively, which are assumed to be measurable on A(i) for each ﬁxed i ∈ S.

Finally, the real numbers d

(1  l  q) denote the constraints, and γ denotes

the initial distribution on S.

The description of the control process M is as follows. Suppose that the

system is in the state i

= i ∈ S at time m. The controllers select an action

= a ∈ A(i) which is imposed on the system according to a policy. Then,

one-stage costs c

)(l =0, 1,...,q) are paid immediately, in general, for

the stage m, the discounted costs are



m−1

k=0

α(i

). At time m +1, the

system visits a new state j according to the transition law:

Q(j | i, a)=Pr{i

m+1

= j | i

= i, a

= a}.

Once the system is in the new state j ∈ S, the process is repeated.

Let H

be the family of admissible histories up to time m for each m =

0, 1,..., that is,

:= S, H

:= K

× S, m =1, 2,...,

and the control policies are given as follows.

Deﬁnition 2.1 A randomized history-dependent policy is a sequence π =

{π

,m=0, 1,...} of stochastic kernels π

on A given H

such that

(A(i

) | h

)=1, ∀ h

:= (i

,...,i

) ∈ H

,m=0, 1,....

Deﬁnition 2.2 A randomized history-dependent π = {ϕ

,m=0, 1,...} is

said to be (randomized) stationary if ϕ

are independent of m, i.e., π is of the

form π = {ϕ,ϕ,...}, and ϕ

(·|h

)=ϕ(·|x

). In this case, the policy π is

also denoted by ϕ.

All randomized history-dependent and stationary policies are denoted by Π

and Φ, respectively.

For any given initial distribution γ on S and π = {π

}∈Π, by the well-

known Tulcea’s theorem [14, p. 178], there exist a unique probability space

(Ω, F ,P

) and a state-action process {i

,m=0, 1,...} deﬁned on this

space such that, for each C ⊆ S, Γ ∈ B(A), and m  0,

∈ C)=γ(C),P

∈ Γ | h

)=π

(Γ | h

m+1

∈ C | h

)=Q(C | i

see, e.g., [14, p. 16] for the construction of P

. Denote by E

the expectation

operation corresponding to P

. If γ is concentrated on some state i, then we

write P

as P

and E

as E

Deﬁnition 2.3 For each π ∈ Π, 0  l  q, and initial distribution γ, the ﬁrst

剩余18页未读，继续阅读

weixin_38528680

粉丝: 8
资源: 875

约束条件下第一代马尔可夫决策过程的最优化

马尔可夫决策过程详解：理论与应用实践

非负折扣半马尔可夫决策过程的首达目标模型优化

马尔可夫决策过程在接纳控制模型中的应用研究

马尔可夫决策过程

第2课 马尔可夫决策过程

my-mdp:不确定概率的马尔可夫决策过程和马尔可夫决策过程的实现

马尔可夫决策过程引论

实用马尔可夫决策过程

约束马尔可夫决策过程(cmdp)

PPO算法和马尔可夫决策过程是什么关系，为什么PPO算法要用到马尔可夫决策过程

最新资源

第2课马尔可夫决策过程