强化学习入门习题解答：策略与对称性应用

1星需积分: 50 59 浏览量更新于2024-07-19 7 收藏 265KB PDF 举报

《强化学习：一种介绍》(Reinforcement Learning: An Introduction) 是 Richard S. Sutton 和 Andrew G. Barto 著名的教材，该书深入浅出地阐述了强化学习的基本原理和应用。本书的习题解答部分对于理解和实践强化学习至关重要，特别是针对那些希望在实际问题中应用这一理论的学生和研究者。习题1.1探讨了自我对弈（Self-Play）的概念。在这个练习中，作者指出，如果一个强化学习算法与自己对战，它可能会发展出一种策略，通过交替进行“好”和“坏”动作，确保每次游戏获胜。然而，这实际上阻碍了算法学习标准的最小最大化策略（minimax），因为这种策略依赖于理性对手的决策，而自我对弈中的“帮助”并不符合真实游戏环境中的预期行为。因此，学习到的策略可能无法泛化到面对真正的对手。习题1.2涉及的是对称性（Symmetries）在强化学习中的利用。通过简化状态和动作的定义，我们可以减少状态空间的维度，这使得算法在更小、更具代表性的状态下学习，从而提高了学习结果的统计显著性。在像井字棋（ Tic-Tac-Toe）这样的游戏中，如果对手利用了对称性，那么我们的算法通过识别并处理这些对称性，可以提升对抗这类对手的能力。这意味着，算法不仅需要学会基本的游戏规则，还需要能够适应并超越对称策略的对手，以达到更高的竞技水平。这两个习题展示了强化学习在处理复杂决策问题时面临的挑战，同时也强调了对问题结构的理解和设计精简状态空间的重要性。通过解决这些问题，读者不仅可以掌握强化学习的基础概念，还能理解如何优化算法以应对现实生活中的复杂情境。对于希望在强化学习领域深入研究的人来说，这些习题是必不可少的实践环节。

Exercise 2.15:

One idea is to have the algorithm perform an ǫ-persuit algo rithm where ǫ percent of the time

we randomly try an arm. The remaining 1 − ǫ amount of the time we follow the standard

persuit algorithm. Thus even if the action probabilities π

(a) converge to something incorrect

the fa ct that ǫ percent of the time we are selecting a r andom arm means t hat in enough trials

we will explore the entire space.

As a practical detail, we could code our methods so that on the ǫ percent of the time it is

exploring it is explicitly forbidden from drawing o n the “greedy” arm. This would be the

arm that would be most likely be selected using the action probabilities π

. That is, don’t

draw on the arm a

∗

given by

∗

= ArgMax

(π

(a)) .

This modiﬁcation may help in the explorat io n process since we don’t waste an exploratory

draw.

Associative Search

Exercise 2.16 (unknown bandits):

If you are not told which of the cases o f the problem one fa ces, we can assume that on

aver age half of the time we will be facing a case A instance and ha lf of the time a case B

instance. In general if we knew which case we were facing we would like to pull diﬀerent

arms (the optimal play would be to pull the second arm in the case of A instance and pull

the ﬁrst arm in the case of a case B instance). Since we have no knowledge the best we can

obtain is the average reward. The expectation o f playing each type of game (assuming that

we have no way to select which arm to pull a nd we select randomly ) is given by

E[A] = 0.5(0.1) + 0.5(0.9) = 0.5

E[B] = 0.5(0.2) + 0.5(0.8) = 0.5 .

So assuming a uniform distribution over which type of case we are given and where we play

many games is given by

0.5(0.5) + 0.5(0.5) = 0.5 .

In the case where one is told which case one is playing one could learn the optimal arm to

pull in each case and then play that arm repeatedly. In this case the expected reward for

playing many times is given by

0.5(0.2) + 0.5(0.9) = 0.55 ,

thus we see that when we know the type of the game we are playing our expect ed rewar d

increases as one would expect.

剩余34页未读，继续阅读

lwccb

粉丝: 0
资源: 5

强化学习入门习题解答：策略与对称性应用

Reinforcement learning合集

Reinforcement Learning_An Introduction多版本合集

An Introduction to Reinforcement Learning

2018年强化学习经典教材：《Reinforcement Learning: An Introduction》第二版

强化学习入门经典：Reinforcement Learning_An Introduction

深度逆强化学习（Deep Inverse Reinforcement Learning）

TensorFlow中的增强学习（Reinforcement Learning）

解析深度逆强化学习（Inverse Reinforcement Learning）

TensorFlow中的强化学习（Reinforcement Learning）基础

深度强化学习（Deep Reinforcement Learning）基础概念

最新资源