强化学习与最优控制：MIT教材草案

需积分: 10 132 浏览量更新于2024-07-16 收藏 2.85MB PDF 举报

"这是一本由麻省理工学院（MIT）的Dimitri P. Bertsekas教授编写的关于机器学习和强化学习的教材草案，名为《Reinforcement Learning and Optimal Control》。该书主要关注在计算上难以精确解决的大规模、多阶段决策问题，这些问题理论上可以通过动态规划（DP）来解决，但实际操作中计算复杂度极高。书中讨论了依赖于近似方法来生成次优策略，以确保在性能上达到合理水平的解决方案。此教材尚在完善中，可能会有错误，且对文献引用不全。读者可以向作者提供反馈和建议，最后修订日期为2019年2月6日。该书的信息和订购可通过Athena Scientific出版社的官方网站获取。" 在《Reinforcement Learning and Optimal Control》一书中，Bertsekas教授深入探讨了强化学习和最优控制的理论与实践。强化学习是人工智能的一个关键领域，它涉及智能代理通过与环境的交互来学习最优行为策略。该书将强化学习与经典的动态规划理论相结合，动态规划是解决多步决策问题的一种强大工具，尤其是在理论上的最优解。书中的主要内容可能包括以下几个方面： 1. 强化学习基础：介绍强化学习的基本概念，如状态、动作、奖励函数和马尔可夫决策过程（MDP）。 2. 动态规划理论：详细阐述动态规划的基本原理，包括贝尔曼方程和价值迭代、策略迭代等算法。 3. 近似方法：由于大规模问题的复杂性，书中会讨论如何使用近似动态编程（ADP）和函数逼近技术来求解近似最优策略。 4. 学习策略：涵盖Q-learning、SARSA等在线学习算法，以及深度强化学习（Deep RL）中使用的神经网络模型。 5. 实时决策问题：讨论在实际环境中，如何处理部分可观测性和不确定性，以及如何设计稳健的控制策略。 6. 最优控制理论：结合经典控制理论，如线性二次型最优控制（LQR）和Lyapunov稳定性分析，解释如何在动态系统中实现最优控制。 7. 应用案例：可能会包含来自实际领域的案例研究，如机器人控制、资源管理或游戏策略，以展示理论在实际问题中的应用。 8. 数值方法与算法：详细描述数值计算技巧和实现算法，以便读者能够理解和实现这些方法。这本书对于想要深入理解强化学习和最优控制理论，以及在实际问题中应用这些理论的学者和工程师来说，是一份宝贵的资源。通过阅读和学习，读者可以掌握如何在无法精确求解的复杂问题中设计有效的智能决策策略。

展开

Sec. 1.1 Deterministic Dynamic Programming 3

...

Control u

Cost g

, u

)

) x

k+1

Stage k

k Future Stages

) x

Future Stages Terminal Cost

Future Stages Terminal Cost g

)

Figure 1.1.1 Illustration of a deterministic N-stage optimal control problem.

Starting from state x

, the next state under control u

is generated nonrandomly,

according to

k+1

= f

, u

and a stage cost g

, u

) is incurred.

is a function of (x

, u

) that describes the mechanism by which the

state is updated from time k to time k + 1.

N is the horizon or number of times control is applied,

The set of all possible x

is called the state space at time k. It can be

any set and can depend on k; this generality is one of the great strengths

of the DP methodology. Similarly, the set of all possible u

is called the

control space at time k. Again it can be any set and can depend on k.

The problem also involves a cost function that is additive in the se nse

that the cost incurred at time k, denoted by g

, u

), accumulates over

time. Formally, g

is a function of (x

, u

) that takes real number values,

and may depend on k. For a given initial state x

, the total cost of a control

sequence {u

, . . . , u

N−1

} is

J(x

; u

, . . . , u

N−1

) = g

) +

N−1

k=0

, u

), (1.2)

where g

) is a terminal cost incurred at the end of the pr ocess. This

cost is a well-deﬁned number, since the control sequence {u

, . . . , u

N−1

}

together with x

determines exactly the state sequence {x

, . . . , x

} via

the system eq uation (1.1). We want to minimize the cost (1.2) over all

sequences {u

, . . . , u

N−1

} that satisfy the control constraints, thereby ob-

taining the optimal value†

) = min

∈U

)

k=0,...,N −1

J(x

; u

, . . . , u

N−1

as a function of x

. Figure 1.1.1 illustrates the main elements of the prob-

lem.

We will next illustrate deterministic problems with some examples.

† We use t hroughout “min” (in place of “inf”) to indicate minimal value over

a feasible set of controls, even when we are not sure that the minimum is attained

by some feasible control.

4 Exact Dynamic Programming Chap. 1

Initial State Stage 0 Stage 1 Stage 2 Stage

s t u

Artiﬁcial Terminal Node Terminal Arcs with Cost Equal to Ter-

Terminal Arcs with Cost Equal to Terminal Cost AB AC CA

Initial State Stage 0 Stage 1 Stage 2 Stage

Initial State Stage 0 Stage 1 Stage 2 Stage N − 1 Stage

1 Stage N

. . . .

Figure 1.1.2 Transition graph for a deterministic ﬁnite-state system. Nodes

correspond to states x

. Arcs correspond to state-control pairs (x

, u

). A n arc

, u

) has start and end nodes x

and x

k+1

= f

, u

), respectively. We

view the cost g

, u

) of the transition as the length of this arc. The problem

is equivalent to ﬁnding a shortest path from initial node s to terminal node t.

Discrete Optimal Control Problems

There are many situations where the s tate and control are naturally discr ete

and take a ﬁnite number of va lue s. Such problems are often conveniently

speciﬁed in terms of an acyclic graph specifying for each state x

the pos-

sible transitions to next states x

k+1

. The nodes of the graph c orrespond

to states x

and the arcs of the graph cor respond to s tate-control pairs

, u

). Each arc with start node x

corres ponds to a choice of a single

control u

∈ U

) and has as end node the next state f

, u

). T he

cost of a n arc (x

, u

) is deﬁned as g

, u

); see Fig. 1.1.2. To handle the

ﬁnal stag e , an artiﬁcial terminal node t is added. Each state x

at stage

N is connected to the terminal node t with an arc having cost g

Note that control sequences correspond to paths originating a t the

initial state (node s at stage 0) and terminating at one of the nodes corre-

sponding to the ﬁnal stage N. If we view the cost of an arc a s its length,

we see that a deterministic ﬁnite-state ﬁnite-horizon problem is equivalent

to ﬁnding a minimum-length (or shortest) path from the initial node s of

the graph to the terminal node t. Here , by a path we mean a sequence of

arcs such that given two successive arcs in the sequence the end node of

the ﬁrst arc is the same as the start node of the second. By the leng th of

a path we mean the sum of the lengths of its arcs.†

† It turns out also that any shortest path problem (with a possibly nona-

cyclic graph) can be reformulated as a ﬁnite-state deterministic optimal control

problem, as we will see in Section 1.3.1. See also [Ber17], Section 2.1, and [Ber98]

for an extensive discussion of shortest path methods, which connects with our

discussion here.

6 Exact Dynamic Programming Chap. 1

determined from the preceding th ree). It is appropriate to consider as state

the set of operations already performed, the initial state being an artiﬁcial

state corresponding to the beginning of the decision process. The possible

state transitions corresponding to the possible states and decisions for this

problem are sh own in Fig. 1.1.3. Here the problem is deterministic, i.e., at

a given state, each choice of control leads to a uniquely determined state.

For example, at state AC the decision to perform operation D leads to state

ACD with certainty, and has cost C

. Thus the problem can be conveniently

represented in terms of the transition graph of Fig. 1.1.3. The optimal solution

corresponds to the path that starts at the initial state and ends at some state

at the t erminal time an d has minimum sum of arc costs plus the terminal

cost.

Continuous-Spaces Optimal Control Problems

Many classical problems in control theory involve a state that belongs to a

Euclidean space, i.e., the space of n-dimensional vectors of real varia ble s,

where n is some positive integer. The fo llowing is representative of the class

of linear-quadratic problems, where the system equation is linear, the cost

function is quadratic, and there are no control constraints. In our example,

the states and controls are one-dimensional, but there are multidimensional

extensions, which are very popular (see [Ber17], Section 3.1).

Example 1.1.2 (A Linear-Quadratic Problem)

A certain material is passed through a sequence of N ovens (see Fig. 1.1.4).

Denote

: initial temperature of the material,

, k = 1, . . . , N: temperature of the material at the exit of oven k,

k−1

, k = 1, . . . , N: heat energy applied to t he material in oven k.

In practice there will be some constraints on u

, such as nonnegativity.

However, for analytical tractability one may also consider the case where

is unconstrained, and check later if the solution satisﬁes some natural

restrictions in the problem at hand.

We assume a system equation of the form

k+1

= (1 − a)x

+ au

, k = 0, 1, . . . , N − 1,

where a is a known scalar from the interval (0, 1). The objective is to get

the ﬁnal temperature x

close to a given target T , while expending relatively

little energy. We express this with a cost function of the form

r(x

− T )

N−1

k=0

where r > 0 is a given scalar.

Sec. 1.1 Deterministic Dynamic Programming 7

Initial Temperature

Oven 1 Oven 2 Final Temperature

Initial Temperature

Oven 1 Oven 2 Final Temperature

Initial Temperature

Initial Temperature x

Figure 1.1.4 The linear-quadratic problem of Example 1.1.2 for N = 2. The

temperature of the material evolves according to the system equation x

k+1

(1 − a)x

+ au

, where a is some scalar with 0 < a < 1.

Linear-quadratic problems with no constraints on the state or the con-

trol admit a nice analytical solution, as we will see later in Section 1.3.6.

In another frequently arising optimal c ontrol problem there ar e linear con-

straints on the sta te and/or the control. In the preceding ex ample it would

have been natural to require that a

≤ x

≤ b

and/or c

≤ u

≤ d

, where

, b

, c

, d

are given scalars. T he n the pro ble m would be solvable not only

by DP but also by quadratic programming methods. Generally determin-

istic optimal control problems with continuous state and control spaces

(in addition to DP) admit a solution by nonlinear programming methods,

such as gradient, conjugate gradient, and Newton’s method, which can be

suitably adapted to their special structure.

1.1.2 The Dynamic Programming Algorithm

The DP algor ithm rests on a simple idea, the principle of optimality, which

roughly states the following; see Fig. 1 .1.5.

Principle of Optimality

Let {u

∗

, . . . , u

∗

N−1

} be an optimal control sequence, which together

with x

determines the corres ponding state sequence {x

∗

, . . . , x

∗

} via

the system equation (1.1). Consider the subproblem whereby we start

at x

∗

at time k and wish to minimize the “cost-to-go” from time k to

time N,

∗

, u

) +

N−1

m=k+1

, u

) + g

over {u

, . . . , u

N−1

} with u

∈ U

), m = k, . . . , N − 1. Then the

truncated optimal control sequence {u

∗

, . . . , u

∗

N−1

} is optimal for this

subproblem.

Stated succinctly, the principle o f optimality says that the tail of an

optimal sequence is optimal for the tail subproblem. Its intuitive justiﬁca-

tion is simple. If the truncated control sequence {u

∗

, . . . , u

∗

N−1

} were not

optimal as stated, we would be able to reduce the cost further by switching

剩余283页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

fuwell

粉丝: 1

强化学习与最优控制：MIT教材草案

机器学习算法精选集 - Awesome Machine Learning

Julia语言开发的MachineLearning.jl机器学习库

大数据时代的数据挖掘：Machine Learning for Hackers

MachineLearning

machineLearning

深度学习入门：动手学Machine Learning

机器学习实战书籍：《Machine Learning Gladiator》

Spring.md

无人系统编队控制：基于虚拟结构一致性与人工势场法的MATLAB实现

水下声学水中有限长加肋圆柱壳体振动和声辐射近似解析解：基于Python的复现与分析介绍了水中有限长(论文复现或解答，含详细代码及解释）

最新资源