强化学习：计算机科学视角的调查

版权申诉

91 浏览量更新于2024-07-21 收藏 511KB PDF 举报

"这篇PDF文献是关于强化学习的综合调查，由Leslie Pack Kaelbling、Michael L. Littman和Andrew W. Moore撰写。它深入浅出地介绍了强化学习这一计算机科学领域的研究，旨在让熟悉机器学习的研究者能够理解。文章回顾了强化学习的历史背景，并总结了当前的各种研究工作。强化学习关注的是一个通过试错与动态环境互动来学习行为的智能体问题。这项工作的理念与心理学有一定的相似性，但在细节和应用上有所不同。" 正文: 强化学习（Reinforcement Learning, RL）是机器学习的一个重要分支，它主要涉及智能体如何在与环境的交互过程中通过奖励和惩罚机制学习最优策略。这篇论文《强化学习：一项调查》对这个领域进行了全面的梳理和总结，旨在为那些已经对机器学习有一定了解的研究者提供深入的见解。 RL的核心概念是一个智能体（agent）在特定环境中执行动作，并根据其行为的结果（即奖励或惩罚）调整其策略。这种学习过程可以看作是通过不断试验和反馈进行优化，类似于动物或人类的学习过程，但算法的设计更为形式化且更注重效率和性能。历史背景部分，论文可能涵盖了早期的理论基础，如贝尔曼等式（Bellman equation）和动态规划（Dynamic Programming）方法，这些都是强化学习理论的基石。它们提供了求解环境模型已知情况下的最优策略的方法。随着研究的发展，人们开始关注模型未知的情况，这导致了Q学习和SARSA等无模型学习算法的出现。当前的工作总结中，可能会讨论到深度强化学习（Deep Reinforcement Learning, DQN）的突破，这是通过结合深度神经网络（Deep Neural Networks, DNNs）来处理高维状态空间的问题，使得智能体能够在复杂环境中如Atari游戏和围棋等领域取得重大进展。此外，论文可能还会涉及探索与利用之间的平衡策略、经验回放缓冲区（Experience Replay）和双线性DQN等技术。除了算法和方法，论文可能还讨论了RL在实际应用中的挑战，如环境建模的不确定性、延迟奖励、长期信用分配问题以及探索和泛化能力的提升。此外，RL在连续控制、机器人学、推荐系统、资源调度等领域的重要应用也可能被提及。最后，尽管强化学习与心理学中的操作条件反射理论有联系，但RL在算法设计和目标设定上更加关注计算效率和可扩展性，这使其在工程问题和理论研究中呈现出独特的价值。这篇论文是对强化学习领域的全面概述，对于想要深入理解和应用强化学习的研究者来说，是一份非常有价值的参考资料。

Reinforcement Learning: A Survey

1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1

a = 0

a = 1

r = 0

r = 1

1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1

a = 0

a = 1

Figure 3: A Tsetlin automaton with 2

states. The top row shows the state transitions

that are made when the previous action resulted in a reward of 1; the b ottom

row shows transitions after a reward of 0. In states in the left half of the gure,

action 0 is taken; in those on the right, action 1 is taken.

Because of the guarantee of optimal exploration and the simplicity of the technique

(given the table of index values), this approach holds a great deal of promise for use in more

complex applications. This method proved useful in an application to robotic manipulation

with immediate reward (Salganico & Ungar, 1995). Unfortunately, no one has yet been

able to nd an analog of index values for delayed reinforcement problems.

2.1.3 Learning Automata

A branch of the theory of adaptive control is devoted to

learning automata

, surveyed by

Narendra and Thathachar (1989), whichwere originally described explicitly as nite state

automata. The

Tsetlin automaton

shown in Figure 3 provides an example that solves a

2-armed bandit arbitrarily near optimally as

approaches innity.

It is inconvenient to describe algorithms as nite-state automata, so a movewas made

to describe the internal state of the agent as a probability distribution according to which

actions would be chosen. The probabilities of taking dierent actions would b e adjusted

according to their previous successes and failures.

An example, which stands among a set of algorithms independently developed in the

mathematical psychology literature (Hilgard & Bower, 1975), is the

linear reward-inaction

algorithm. Let

be the agent's probability of taking action



When action

succeeds,



)

p

for



When action

fails,

remains unchanged (for all

This algorithm converges with probability 1toavector containing a single 1 and the

rest 0's (choosing a particular action with probability 1). Unfortunately, it does not always

converge to the correct action; but the probability that it converges to the wrong one can

be made arbitrarily small by making



small (Narendra & Thathachar, 1974). There is no

literature on the regret of this algorithm.

245

Kaelbling, Littman, & Moore

2.2 Ad-Ho c Techniques

In reinforcement-learning practice, some simple,

ad hoc

strategies have b een popular. They

are rarely,if ever, the best choice for the models of optimalitywehave used, but they may

be viewed as reasonable, computationally tractable, heuristics. Thrun (1992) has surveyed

avariety of these techniques.

2.2.1 Greedy Strategies

The rst strategy that comes to mind is to always choose the action with the highest esti-

mated payo. The aw is that early unlucky sampling might indicate that the best action's

reward is less than the reward obtained from a suboptimal action. The sub optimal action

will always be picked, leaving the true optimal action starved of data and its sup eriority

never discovered. An agentmust explore to ameliorate this outcome.

A useful heuristic is

optimism in the face of uncertainty

in which actions are selected

greedily, but strongly optimistic prior beliefs are put on their payos so that strong negative

evidence is needed to eliminate an action from consideration. This still has a measurable

danger of starving an optimal but unlucky action, but the risk of this can be made arbitrar-

ily small. Techniques like this have been used in several reinforcement learning algorithms

including the interval exploration method (Kaelbling, 1993b) (described shortly), the

ex-

ploration bonus

in Dyna (Sutton, 1990),

curiosity-driven exploration

(Schmidhuber, 1991a),

and the exploration mechanism in prioritized sweeping (Mo ore & Atkeson, 1993).

2.2.2 Randomized Strategies

Another simple exploration strategy is to take the action with the best estimated expected

reward by default, but with probability

,cho ose an action at random. Some versions of

this strategy start with a large value of

to encourage initial exploration, which is slowly

decreased.

An ob jection to the simple strategy is that when it exp eriments with a non-greedy action

it is no more likely to try a promising alternative than a clearly hop eless alternative. A

slightly more sophisticated strategy is

Boltzmann exploration

. In this case, the expected

reward for taking action

(

) is used to choose an action probabilistically according to

the distribution

(

)

(

)

The

temperature

parameter

can be decreased over time to decrease exploration. This

method works well if the best action is well separated from the others, but suers somewhat

when the values of the actions are close. It may also converge unnecessarily slowly unless

the temperature schedule is manually tuned with great care.

2.2.3 Interval-based Techniques

Exploration is often more ecient when it is based on second-order information ab out the

certaintyorvariance of the estimated values of actions. Kaelbling's

interval estimation

algorithm (1993b) stores statistics for each action

is the number of successes and

the number of trials. An action is chosen by computing the upper bound of a 100





246

剩余48页未读，继续阅读

卷积神经网络

粉丝: 359
资源: 8440

强化学习：计算机科学视角的调查

Algorithm-Deep-reinforcement-learning-with-pytorch.zip

Reinforcement-learning-with-tensorflow-master.zip

人工智能会用到的常见英文以及对应的中文

DMRO: A Deep Meta Reinforcement Learning-Based Task Offloading Framework for Edge-Cloud Computing

python强化学习项目 python reinforcement learning projects - 2018.pdf

深度学习中有哪些算法分类

查询以下文献的GB/T 7713.1-2006的标准格式，包含期、卷和起止页码：Liu, H., & Liu, T. Q. (2019). A Deep Reinforcement Learning Algorithm with a Q-table for a Large State Space. IEEE Access, 7, 31212-31222. doi: 10.1109/ACCESS.2019.2909381.

给出RL——Policy Gradient的解释和资料

development of multi-agent reinforcement learning

交通灯英文参考文献引用

最新资源