深度强化学习的调度策略优化算法_adam多目标优化

深度强化学习

调度策略优化

5星 · 超过95%的资源需积分: 22 47 浏览量更新于2023-03-16 16 收藏 724KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Scheduled Policy Optimization for Natural Language Communication with

Intelligent Agents

Wenhan Xiong

, Xiaoxiao Guo

, Mo Yu

, Shiyu Chang

, Bowen Zhou

, William Yang Wang

University of California, Santa Barbara

IBM Research

JD AI Research

{xwhan,william}@cs.ucsb.edu

Abstract

We investigate the task of learning to follow natural

language instructions by jointly reasoning with vi-

sual observations and language inputs. In contrast

to existing methods which start with learning from

demonstrations (LfD) and then use reinforcement

learning (RL) to ﬁne-tune the model parameters,

we propose a novel policy optimization algorithm

which dynamically schedules demonstration learn-

ing and RL. The proposed training paradigm pro-

vides efﬁcient exploration and better generalization

beyond existing methods. Comparing to existing

ensemble models, the best single model based on

our proposed method tremendously decreases the

execution error by over 50% on a block-world envi-

ronment. To further illustrate the exploration strat-

egy of our RL algorithm, We also include system-

atic studies on the evolution of policy entropy dur-

ing training.

1 Introduction

Language is a natural form for humans to express their inten-

tion. In recent years, although researchers have successfully

built intelligent systems which are able to accomplish com-

plicated tasks

[

Levine et al., 2016; Silver et al., 2017

]

, few of

them are able to cooperate with humans via natural language.

To build better AI systems that can safely and robustly work

along with people, it is necessary to teach machines to un-

derstand free-form human language instructions and output

low-level working actions. This is a challenging task, mainly

due to the ambiguity of human language and the complexity

of the working environment.

In this work, we aim at developing an intelligent agent

which can take as inputs human language instructions as well

as environment observations to ﬁnish the task speciﬁed by the

human language in a simulated working environment

[

Bisk et

al., 2016; Misra et al., 2017

]

. The speciﬁc task is illustrated

in Figure 1. In order to accomplish the task, the agent should

be able to recognize potential obstacles in the environment

and move around. Besides, since the same task may be de-

scribed by different humans, the agent must also be robust to

instruction

Move the Adidas block to the

same column as the Nvidia

block, and one and a half

rows above the Texaco block

Figure 1: Task illustration. The intelligent agent is expected to un-

derstand human language instructions and make sequential actions

based on its observations about the working environment.

language variations.

Early methods for similar tasks

[

Chen and Mooney, 2011;

Matuszek et al., 2010; Tellex et al., 2011

]

rely on human

deﬁned spatial or language features to parse the language.

Meticulous engineering in terms of environment domain and

language lexicon is often required. In this work, we focus on

developing a neural-network based model that can be trained

end-to-end with minimum domain and linguistic knowledge.

More recently, the task of mapping natural language into

low-level actions or programs has been tackled with neural

network based methods

[

Mei et al., 2016; Liang et al., 2016

]

In the simplest case, a cross-entropy loss can be used to train

the model so that it can imitate the human-demonstrated ac-

tions. However, the pure supervised model fails to explore

the state-action space outside the demonstration path, which

undermines the model’s generalization ability.

To develop a model that is able to not only imitate but also

generalize, Misra et al.

[

2017

]

apply various deep reinforce-

ment learning (RL) techniques to this task. The RL agent

is able to explore more state-action space via its stochastic

policy (probability distribution over actions). Since RL from

scratch can be highly data-inefﬁcient due to sparse rewards

and the large action space. Misra el al.

[

2017

]

warm-start the

network parameters with several epochs of supervised learn-

ing which imitates human actions. The RL algorithm is then

adopted to ﬁne-tune the parameters. This training paradigm

is successful at speeding up training. However, we show by

experiments that the supervised pre-training often results in

arXiv:1806.06187v1 [cs.CL] 16 Jun 2018

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余7页未读，立即下载

cctbdlm

粉丝: 0
资源: 21

会员权益专享

深度强化学习的调度策略优化算法

车辆调度优化算法

大规模项目调度问题的分解和协调优化方法 (2009年)

车间调度问题

利用OneFlow实现深度强化学习算法

"schlably：深度强化学习生产调度Python框架

强化学习算法与优化问题的结合应用

深度强化学习：从强化学习到AlphaGo

Keras中的深度强化学习

深度强化学习优化调度

深度强化学习 车间调度

基于深度强化学习的车辆调度

基于多动作深度强化学习的柔性车间调度研究(python代码实现)

深度强化学习构建ems系统

光伏微电网能源系统优化调度最新算法有哪些

深度强化学习 边缘计算

如果采用强化学习进行一次调频优化调度应该怎么做

基于深度强化学习的交通信号灯智能控制

基于强化学习的车间调度

强化学习 车间 调度 车间信息提取

基于深度确定性策略梯度的能量管理策略

会员权益专享

最新资源

深度强化学习车间调度

深度强化学习边缘计算

强化学习车间调度车间信息提取