没有合适的资源?快使用搜索试试~ 我知道了~
首页深度强化学习的调度策略优化算法
资源详情
资源推荐
Scheduled Policy Optimization for Natural Language Communication with
Intelligent Agents
Wenhan Xiong
1
, Xiaoxiao Guo
2
, Mo Yu
2
, Shiyu Chang
2
, Bowen Zhou
3
, William Yang Wang
1
,
1
University of California, Santa Barbara
2
IBM Research
3
JD AI Research
{xwhan,william}@cs.ucsb.edu
Abstract
We investigate the task of learning to follow natural
language instructions by jointly reasoning with vi-
sual observations and language inputs. In contrast
to existing methods which start with learning from
demonstrations (LfD) and then use reinforcement
learning (RL) to fine-tune the model parameters,
we propose a novel policy optimization algorithm
which dynamically schedules demonstration learn-
ing and RL. The proposed training paradigm pro-
vides efficient exploration and better generalization
beyond existing methods. Comparing to existing
ensemble models, the best single model based on
our proposed method tremendously decreases the
execution error by over 50% on a block-world envi-
ronment. To further illustrate the exploration strat-
egy of our RL algorithm, We also include system-
atic studies on the evolution of policy entropy dur-
ing training.
1 Introduction
Language is a natural form for humans to express their inten-
tion. In recent years, although researchers have successfully
built intelligent systems which are able to accomplish com-
plicated tasks
[
Levine et al., 2016; Silver et al., 2017
]
, few of
them are able to cooperate with humans via natural language.
To build better AI systems that can safely and robustly work
along with people, it is necessary to teach machines to un-
derstand free-form human language instructions and output
low-level working actions. This is a challenging task, mainly
due to the ambiguity of human language and the complexity
of the working environment.
In this work, we aim at developing an intelligent agent
which can take as inputs human language instructions as well
as environment observations to finish the task specified by the
human language in a simulated working environment
[
Bisk et
al., 2016; Misra et al., 2017
]
. The specific task is illustrated
in Figure 1. In order to accomplish the task, the agent should
be able to recognize potential obstacles in the environment
and move around. Besides, since the same task may be de-
scribed by different humans, the agent must also be robust to
instruction
Move the Adidas block to the
same column as the Nvidia
block, and one and a half
rows above the Texaco block
Figure 1: Task illustration. The intelligent agent is expected to un-
derstand human language instructions and make sequential actions
based on its observations about the working environment.
language variations.
Early methods for similar tasks
[
Chen and Mooney, 2011;
Matuszek et al., 2010; Tellex et al., 2011
]
rely on human
defined spatial or language features to parse the language.
Meticulous engineering in terms of environment domain and
language lexicon is often required. In this work, we focus on
developing a neural-network based model that can be trained
end-to-end with minimum domain and linguistic knowledge.
More recently, the task of mapping natural language into
low-level actions or programs has been tackled with neural
network based methods
[
Mei et al., 2016; Liang et al., 2016
]
.
In the simplest case, a cross-entropy loss can be used to train
the model so that it can imitate the human-demonstrated ac-
tions. However, the pure supervised model fails to explore
the state-action space outside the demonstration path, which
undermines the model’s generalization ability.
To develop a model that is able to not only imitate but also
generalize, Misra et al.
[
2017
]
apply various deep reinforce-
ment learning (RL) techniques to this task. The RL agent
is able to explore more state-action space via its stochastic
policy (probability distribution over actions). Since RL from
scratch can be highly data-inefficient due to sparse rewards
and the large action space. Misra el al.
[
2017
]
warm-start the
network parameters with several epochs of supervised learn-
ing which imitates human actions. The RL algorithm is then
adopted to fine-tune the parameters. This training paradigm
is successful at speeding up training. However, we show by
experiments that the supervised pre-training often results in
arXiv:1806.06187v1 [cs.CL] 16 Jun 2018
cctbdlm
- 粉丝: 0
- 资源: 21
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- zigbee-cluster-library-specification
- JSBSim Reference Manual
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功