关联学习与回放经验在强化学习中的应用

需积分: 9 101 浏览量更新于2024-07-19 收藏 812KB PDF 举报

"Associative Learning from Replayed Experience" 是一篇关于强化学习的论文，由 Elliot A. Ludvig、Mahdieh S. Mirian、E. James Kehoe 和 Richard S. Sutton 合著。该论文探讨了一种从重播经验中进行关联学习的方法，对提高强化学习算法的稳定性和效率有重大贡献，特别适用于游戏领域的应用。正文：强化学习是一种机器学习方法，通过与环境的交互来优化策略，以最大化长期奖励。在《Associative Learning from Replayed Experience》这篇论文中，作者扩展了 Rescorla-Wagner 模型，这是一个经典的关联学习模型，用于描述动物如何通过条件刺激与无条件刺激之间的关系来学习。传统的 Rescorla-Wagner 模型假设学习只发生在当前的试验中。然而，论文提出的新模型引入了一个创新的概念：动物（或在机器学习中，智能体）不仅从当前的体验中学习，还会存储并重播过去的试验。这个过程类似于深度强化学习中的经验回放机制，其中智能体会随机抽取过去的经验片段来更新其策略。在重播过程中，智能体使用相同的 learning rule（可能是类似于Q-learning的算法，如DQN——深度Q网络）从这些回放的试验中学习。这种方法提供了一个统一的理论框架，可以解释以前难以用单一理论解释的各种现象。例如，它可能有助于解决在强化学习中常见的问题，如过拟合、样本效率低和训练不稳定性。在游戏场景中，这种关联学习和重播经验的方法特别有用，因为游戏环境通常具有复杂的动态性和不确定性。通过重播过去的决策和结果，智能体可以更好地理解环境模式，改进策略，并更有效地收敛到最优解决方案。这种方法也有助于智能体在面临类似情况时做出更好的决策，因为它可以反复学习和适应过去的失败经验。《Associative Learning from Replayed Experience》为强化学习领域带来了新的见解，通过模拟生物学习过程中的记忆和重播机制，提高了学习算法的性能。这一理论不仅有助于理论研究，也为实际应用，特别是游戏AI的设计，提供了宝贵的指导。通过将学习扩展到当前试验之外，这种方法为创建更加智能和适应性强的机器学习系统开辟了新的可能性。

Associative Learning & Replay

model: Acquisition trials (X+) are sampled from the trial memory and replayed, leading to

further increments in the associative strength. In this simple-acquisition scenario, the additional

processing leads to faster learning of the association between X and the reward. As may be

apparent, for this simple-acquisition scenario, the additional processing would be invisible. As

will be shown in the next sections, however, the impact of the additional processing becomes

highly visible as more variables are manipulated.

Spontaneous Recovery

Spontaneous recovery has proven to be particularly difficult to reconcile with the RW

model and similar associative accounts of learning (Bouton, 1993; Pavlov, 1927; Kehoe, 1988;

Rescorla, 2004; Sissons & Miller, 2009). In a spontaneous recovery experiment, acquisition

training using reinforced presentations of a conditioned stimulus (as above) is followed by

extinction training containing stimulus-alone presentations. During the acquisition training,

animals learn to respond to the stimulus (see Fig 2), but then they progressively cease responding

during extinction training. If, however, after the end of extinction training, the stimulus is re-

presented to the animals after a delay, the extinguished response reappears, sometimes at nearly

full strength (Kehoe, 2006; Napier et al., 1992). The degree of recovery increases as the delay

between the end of extinction training and recovery testing is increased (e.g., Haberlandt et al.,

1978—see Figure 3C).

According to the RW model, at the end of extinction, the associative strength (V)

approaches 0. There is no mechanism for any change in this associative strength during the

intervening interval before the recovery test. Thus, the original RW model cannot account for

spontaneous recovery. Within the wider associative framework, numerous explanations for

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a

The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/100800doi: bioRxiv preprint first posted online Jan. 16, 2017;

Associative Learning & Replay

10!

spontaneous recovery have been suggested, including different decay rates for excitatory and

inhibitory processes (Bouton, 1993; Pan et al., 2008), different decay rates for the first and

second things learned (Bouton, 1993; Devenport, 1998; Sissons & Miller, 2009), changes in the

sampled stimulus characteristics over time (Estes, 1955), and the inference of different

underlying states of the world for acquisition and extinction (Gershman, Blei & Niv, 2010;

Gershman & Niv, 2012).

The replay model augments the RW model by assuming that during the delay between

the end of extinction and the recovery test, both acquisition trials and extinction trials stored in

the trial memory are replayed randomly. As a result of the replays of the acquisition trials, the

associative strength recovers to an intermediate value. Figure 3 shows a simulation of

spontaneous recovery. In the simulation, the replay model was first trained with 100 acquisition

trials (X+) followed by 100 extinction trials (X-). Associative strength increases to an asymptote

near 1 during acquisition and then decreases toward 0 during extinction (Fig 3A).

At the end of the extinction phase, the model is not given any further training trials, but is

allowed to continue replaying previously experienced trials. In this simplest formulation of the

replay model, the trials replayed only depend on the frequency with which those trial types have

appeared in the past. Because both acquisition trials and extinction trials appeared equally often

in the initial training, with further replay (i.e., more time), the associative strength of stimulus X

approaches 0.5 (Fig. 3B). This pattern resembles the degree of recovery exhibited by the

conditioned eyeblink response in rabbits as the delay between extinction and the recovery test is

increased (Haberlandt et al., 1978; Fig 3C). The replay model makes some clearly testable

predictions: for example, the degree of spontaneous recovery should depend on the relative

number of acquisition and extinction trials. With an increased number of extinction trials,

.CC-BY-NC-ND 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a

The copyright holder for this preprint (which was not. http://dx.doi.org/10.1101/100800doi: bioRxiv preprint first posted online Jan. 16, 2017;

剩余49页未读，继续阅读

coolrainman

粉丝: 3
资源: 10

关联学习与回放经验在强化学习中的应用

Multi-Label Lazy Associative Classification

associative embedding: end-to-end learning for joint detection and grouping

什么是set-associative cache

associative embedding

(-32)+7+(-8)的associative law怎么运算？

根据REQ-EXT-10007，除了使用associative array代替integer array以外，还有其他方法修改这段代码吗？

请问你的回答有文档证明吗

最新资源