一击即中：一次模仿学习框架

需积分: 50 133 浏览量更新于2024-07-18 收藏 4.52MB PDF 举报

"模仿学习论文《One-Shot Imitation Learning》作者：Yan Duan、Marcin Andrychowicz、Bradly Stadie、Jonathan Ho、Jonas Schneider、Ilya Sutskever、Pieter Abbeel、Wojciech Zaremba。来自Berkeley AI Research Lab和OpenAI的研究成果。" 在人工智能领域，模仿学习是一种让智能体通过观察和模仿人类或其他专家的行为来学习新任务的技术。这篇由Yan Duan等人发表的论文《One-Shot Imitation Learning》探讨了一种新的模仿学习框架，该框架旨在使机器人能够从少量的示范中学习，并能迅速地将所学应用到类似的新环境中，而无需针对每个特定任务进行大量的特征工程或样本收集。传统的模仿学习方法通常局限于独立的任务，每项任务都需要大量的示例或精细的特征工程才能实现良好的学习效果。然而，这并不符合实际需求，理想的模仿学习应当允许机器人从寥寥无几的示范中学习，并能快速泛化到相似任务的新实例。论文提出的“一次性”（One-Shot）模仿学习是一种元学习（Meta-Learning）的策略，它考虑了存在大量（甚至无限）任务集的情况，每个任务都有许多变体。例如，一个任务可能是将所有方块堆叠成一个单一的塔，另一个任务则可能是将方块排列成两个塔。在这种设置下，智能体需要从一个或少数几个示例中学习通用的策略，然后能够在未见过的环境中有效地应用这些策略。通过这种方法，智能体不仅学习如何执行特定任务，还学习如何学习新任务。这涉及到了对学习算法本身的优化，使得它能够快速适应新任务，而不需要重新训练或大量的额外数据。论文可能会详细介绍所提出的模型结构、训练过程以及在不同任务上的实验结果，以证明其有效性。《One-Shot Imitation Learning》论文为解决模仿学习中的泛化问题提供了一种创新的解决方案，有望推动机器人和其他智能系统的自主学习能力，使其更接近人类的学习效率。这一研究对于机器人技术、自动驾驶、游戏AI等领域具有深远的影响，有助于提升人工智能在复杂环境下的适应性和实用性。

output

Linear(concat(h

, result

, (x

),s

robot

))

. In practice, we use multiple query

heads per block, so that the size of each result

will be proportional to the number of query heads.

4.2 Context network

The context network is the crux of our model. It processes both the current state and the embedding

produced by the demonstration network, and outputs a context embedding, whose dimension does

not depend on the length of the demonstration, or the number of blocks in the environment. Hence, it

is forced to capture only the relevant information, which will be used by the manipulation network.

Attention over demonstration

: The context network starts by computing a query vector as a function

of the current state, which is then used to attend over the different time steps in the demonstration

embedding. The attention weights over different blocks within the same time step are summed

together, to produce a single weight per time step. The result of this temporal attention is a vector

whose size is proportional to the number of blocks in the environment. We then apply neighborhood

attention to propagate the information across the embeddings of each block. This process is repeated

multiple times, where the state is advanced using an LSTM cell with untied weights.

Attention over current state

: The previous operations produce an embedding whose size is inde-

pendent of the length of the demonstration, but still dependent on the number of blocks. We then

apply standard soft attention over the current state to produce ﬁxed-dimensional vectors, where the

memory content only consists of positions of each block, which, together with the robot’s state, forms

the context embedding, which is then passed to the manipulation network.

Intuitively, although the number of objects in the environment may vary, at each stage of the

manipulation operation, the number of relevant objects is small and usually ﬁxed. For the block

stacking environment speciﬁcally, the robot should only need to pay attention to the position of the

block it is trying to pick up (the source block), as well as the position of the block it is trying to place

on top of (the target block). Therefore, a properly trained network can learn to match the current

state with the corresponding stage in the demonstration, and infer the identities of the source and

target blocks expressed as soft attention weights over different blocks, which are then used to extract

the corresponding positions to be passed to the manipulation network. Although we do not enforce

this interpretation in training, our experiment analysis supports this interpretation of how the learned

policy works internally.

4.3 Manipulation network

The manipulation network is the simplest component. After extracting the information of the source

and target blocks, it computes the action needed to complete the current stage of stacking one block

on top of another one, using a simple MLP network.

This division of labor opens up the possibility

of modular training: the manipulation network may be trained to complete this simple procedure,

without knowing about demonstrations or more than two blocks present in the environment. We leave

this possibility for future work.

5 Experiments

We conduct experiments with the block stacking tasks described in Section 3.2.

These experiments

are designed to answer the following questions:

• How does training with behavioral cloning compare with DAGGER?

•

How does conditioning on the entire demonstration compare to conditioning o n the ﬁnal

state, even when it already has enough information to fully specify the task?

•

How does conditioning on the entire demonstration compare to conditioning on a “snapshot”

of the trajectory, which is a small subset of frames that are most informative?

In principle, one can replace this module with an RNN module. But we did not ﬁnd this necessary for the

tasks we consider.

Additional experiment results are available in the Appendix, including a simple illustrative example of

particle reaching tasks and further analysis of block stacking

剩余26页未读，继续阅读

快乐地笑

粉丝: 57

一击即中：一次模仿学习框架

可解码所有jpeg格式图片的开源JPEG解码库函数(FatFs文件系统作者又一力作)

Inverse-Reinforcement-Learning:选定的逆强化学习算法的实现

模仿学习Imitation Learning最新论文2018

模仿学习论文 无模式的模仿学习 Model-Free Imitation Learning with Policy Optimization

imitation-learning-master.zip_Python 深度学习_imitation learning_模仿学

Third-Person Imitation Learning, OpenAI, 2017.pdf模仿学习

imitation learning.pdf

simple-imitation-Jingdong-Mall-master.zip

Imitation-windows-basic-calculator.zip_imitation

Global overview of Imitation Learning.pdf

最新资源

模仿学习论文无模式的模仿学习 Model-Free Imitation Learning with Policy Optimization