政策依赖的人类反馈互动学习：COACH算法与机器人行为优化

人工智能

需积分: 1 95 浏览量更新于2024-08-04 收藏 401KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Interactive Learning from Policy-Dependent Human Feedback

James MacGlashan

Mark K Ho

Robert Loftin

Bei Peng

Guan Wang

David L. Roberts

Matthew E. Taylor

Michael L. Littman

Abstract

This paper investigates the problem of interac-

tively learning behaviors communicated by a hu-

man teacher using positive and negative feed-

back. Much previous work on this problem has

made the assumption that people provide feed-

back for decisions that is dependent on the be-

havior they are teaching and is independent from

the learner’s current policy. We present empirical

results that show this assumption to be false—

whether human trainers give a positive or neg-

ative feedback for a decision is inﬂuenced by

the learner’s current policy. Based on this in-

sight, we introduce Convergent Actor-Critic by

Humans (COACH), an algorithm for learning

from policy-dependent feedback that converges

to a local optimum. Finally, we demonstrate that

COACH can successfully learn multiple behav-

iors on a physical robot.

1. Introduction

Programming robots is very difﬁcult, in part because

the real world is inherently rich and—to some degree—

unpredictable. In addition, our expectations for physical

agents are quite high and often difﬁcult to articulate. Nev-

ertheless, for robots to have a signiﬁcant impact on the lives

of individuals, even non-programmers need to be able to

specify and customize behavior. Because of these complex-

ities, relying on end-users to provide instructions to robots

programmatically seems destined to fail.

Reinforcement learning (RL) from human trainer feedback

provides a compelling alternative to programming because

agents can learn complex behavior from very simple posi-

tive and negative signals. Furthermore, real-world animal

training is an existence proof that people can train complex

Equal contribution

Cogitai

Brown University

North Car-

olina State University

Washington State University. Correspon-

dence to: James MacGlashan <james@cogitai.com>.

Proceedings of the 34

International Conference on Machine

by the author(s).

behavior using these simple signals. Indeed, animals have

been successfully trained to guide the blind, locate mines

in the ocean, detect cancer or explosives, and even solve

complex, multi-stage puzzles.

Despite success when learning from environmental reward,

traditional reinforcement-learning algorithms have yielded

limited success when the reward signal is provided by hu-

mans. This failure underscores the importance that algo-

rithms for learning from humans are based on appropriate

models of human-feedback. Indeed, much human-centered

RL work has investigated and employed different mod-

els of human-feedback (Knox & Stone, 2009b; Thomaz &

Breazeal, 2006; 2007; 2008; Grifﬁth et al., 2013; Loftin

et al., 2015). Many of these algorithms leverage the ob-

servation that people tend to give feedback that is best in-

terpreted as guidance on the policy the agent should be fol-

lowing, rather than as a numeric value to be maximized

by the agent. However, these approaches assume models

of feedback that are independent of the policy the agent

is currently following. We present empirical results that

demonstrate that this assumption is incorrect and further

demonstrate cases in which policy-independent learning al-

gorithms suffer from this assumption. Following this result,

we present Convergent Actor-Critic by Humans (COACH),

an algorithm for learning from policy-dependent human

feedback. COACH is based on the insight that the ad-

vantage function (a value roughly corresponding to how

much better or worse an action is compared to the current

policy) provides a better model of human feedback, cap-

turing human-feedback properties like diminishing returns,

rewarding improvement, and giving 0-valued feedback a

semantic meaning that combats forgetting. We compare

COACH to other approaches in a simple domain with sim-

ulated feedback. Then, to validate that COACH scales to

complex problems, we train ﬁve different behaviors on a

TurtleBot robot.

2. Background

For modeling the underlying decision-making problem of

an agent being taught by a human, we adopt the Markov

Decision Process (MDP) formalism. An MDP is a 5-tuple:

hS, A, T, R, γi, where S is the set of possible states of the

下载后可阅读完整内容，剩余9页未读，立即下载

IT徐师兄

粉丝: 1980
资源: 2689

政策依赖的人类反馈互动学习：COACH算法与机器人行为优化

前端项目-dependent-dropdown.zip

timing-jitter-tutorial-and-measurement-guide-ebook.pdf

IEEE-Std-802.3bm-(40GE-100GE-标准).pdf

最新的密度泛函理论计算方法有哪些，请列举出10种

TAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing

epoch-dependent dropout

abaqus疲劳裂纹代码

出现File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!

有关果蝇粪肠球菌感染后代谢变化的文献

多尺度有限元方法的参考文献

epoch-dependent dropout是什么

File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!

Pearson's product-moment correlation data: cur_data$dependent and cur_independent_data t = 0.94813, df = 27, p-value = 0.3515 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.2001709 0.5123054 sample estimates: cor 0.1795039

用matlab写出具有模型几何、节点、单元和边界条件的gyroid模型

用matlab写出具有模型几何，结点，单元和边界条件的gyroid代码

N-dependent scaling of grand-averaged estimates是什么意思

最新资源