多机器人辅助的可扩展操作员分配：不宁Bandit方法

版权申诉

126 浏览量更新于2024-07-06 收藏 1.73MB PDF 举报

"这篇论文‘用于多机器人协助的可扩展操作员分配一种不宁Bandit方法_Scalable Operator Allocation for Multi-Robot Assistance: A Restless Bandit Approach’是发表在IEEE Transactions on Control of Network Systems 2021年的一篇文章，作者包括Abhinav Dahiya, Nima Akbarzadeh, Aditya Mahajan和Stephen L. Smith。" 本文关注的是在多机器人系统中的人工操作员分配问题。在这种系统中，多个半自主机器人各自需要执行一系列独立的任务，每个任务都有可能失败并进入故障状态。当需要时，人类操作员可以协助或远程操作机器人。传统的马尔科夫决策过程（Markov Decision Process, MDP）技术在解决此类问题时面临可扩展性问题，因为随着机器人和操作员数量的增加，状态和动作空间会呈指数增长。为了克服这个问题，论文提出了一种不宁Bandit（Restless Bandit）的方法。在Bandit理论中，"不宁"意味着各个臂（即不同的选择）的状态在没有被选择时也会发生变化，这很好地模拟了机器人即使未被操作员干预也可能发生故障的情况。作者推导出了操作员分配问题满足可指数化条件的情况，这使得可以应用Whittle指数启发式策略。这种可指数化条件易于检查，并且论文表明这些条件适用于广泛的问题场景。关键洞察在于利用单个机器人价值函数的结构，从而得出可以分别验证的条件。这种方法使得问题的复杂度不再随机器人数量的增加而线性增长，提高了算法的可扩展性。通过这种方法，操作员可以更有效地被分配到需要的机器人上，优化系统的整体性能。这篇论文为大规模多机器人系统的人工操作员分配提供了一种创新的、基于不宁Bandit的解决方案，解决了传统MDP方法的可扩展性挑战，有助于提升多机器人协作任务的效率和可靠性。

DAHIYA et al.: OPERATOR ALLOCATION FOR MULTI-ROBOT ASSISTANCE 3

B. Organization

The contents of this paper are organized as follows: The

multi-robot assistance problem is presented in Section II.

We discuss the general Restless Bandit Problem and deﬁne

property of indexability in Section III. In Section V, we present

two practical classes of transition functions and establish

conditions under which problem indexability is ensured. In

Section VI, we cover the calculation of Whittle index heuristic

and present an efﬁcient policy for the problem. Next, we

present simulations of the problem in Section VII to examine

validity and performance of the presented policy. The paper

ends with a brief discussion and conclusion.

II. MULTI-ROBOT ASSISTANCE PROBLEM

Consider a decision support system (DSS), consisting of

a team of M human operators supervising a ﬂeet of K

semi-autonomous robots. Each robot k ∈ K

= {1, . . . , K}

is required to complete a sequence of N

tasks to reach

its goal. We will use a ﬂeet of robots delivering packages

in a city as a running example but similar interpretations

hold for other applications mentioned in previous sections

(e.g., robots reaching a sequence of conﬁgurations [12]). In

this case, the robot’s trajectory would correspond to a series

of waypoints that a robot needs to navigate to reach its

destination (goal location). At each waypoint, a robot can

either operate autonomously or be teleoperated by one of the

human operators. We assume that all human operators are

identical in the way they operate the robots and that a human

operator can operate at most one robot at a time. We now

provide a mathematical model for different components of the

system

A. Model of the robots

It is assumed that when operating autonomously, each robot

uses a pre-speciﬁed control algorithm to complete its task.

For the delivery robot example, this could be, for instance,

a SLAM-based local path planner that the robot uses for

navigating between the waypoints. We will not model the

details of this control algorithm but simply assume that this

control is imperfect and occasionally causes the robot to enter

a fault state while doing a task (e.g., delivery robot getting

stuck in a pothole or losing its localization). We model this

behaviour by assuming that while completing each task, the

robot may be in one of the two internal states: a normal state

(denoted by s = 0) or a fault state (denoted by s = 1). When

a robot is being teleoperated, it may still be possible for it to

enter into a fault state.

The operating state of robot k ∈ K at time t, denoted by

= (n

, s

), is tuple of its current task and internal state.

The state space for robot k is given by

[

n=1

{(n, 0), (n, 1)} ∪ {(G, 0)},

Remark on notation: Throughout this paper, we use calligraphic font to

denote sets and roman font to denote variables. Uppercase letters are used to

represent random variables and the corresponding lowercase letters represent

their realizations. Bold letters are used for variables pertaining to multi-robot

system while light letters represent corresponding single-robot variables.

where the terminal state (G, 0) indicates that all tasks have

been completed. The state space for all robots is denoted by

X = X

× · · · × X

The state of a robot evolves differently depending on

whether it is operating autonomously (denoted by mode a

0) or teleoperated (denoted by a

= 1). Given robot k ∈ K

in state (n, s) ∈ X

operating in mode a ∈ {0, 1}, let p

denote the probability of successfully completing the current

task at the current time step and let q

denote the probability

of toggling the current internal state (i.e. going from normal

to fault state and vice-versa). A diagram describing these

transitions is shown in Fig. 2. Note that the terminal state

(G, 0) is an absorbing state, so p

= 0 and q

= 0.

Fig. 2. State-transition diagram for robot k working on task n, where in

(n, 0) the robot is in the normal state s = 0, and in (n, 1) the robot is

in the fault state s = 1. Transitions can occur between (n, 0), (n, 1)

and (n + 1, 0), and the probabilities change with operating mode a.

There is a per-step cost C

: X

× {0, 1} → R

≥0

, where

((n

, s

), a

) denotes the cost of operating robot k ∈ K in

mode a

when the robot is in state (n

, s

). Note that the per-

step cost is zero in the terminal state, i.e, C

((G, 0), a) = 0.

B. Model of the decision support system (DSS)

There is a decision support system that helps to allocate

operators to the robots. At each time the decision support

system observes the operating state X

= (X

, . . . , X

)

of all robots and picks at most M robots to teleoperate. We

capture this by the allocation A

= (A

, . . . , A

) ∈ A, where

A =



= (a

, . . . , a

) ∈ {0, 1}

k=1

≤ M



. (1)

The allocation is selected according to a time-homogeneous

Markov policy π : X

X → A. The expected total cost incurred

by any policy π is given by

J(π) = E



∞

t=0

k=1

, A

)



= x



, (2)

where γ ∈ (0, 1) is the discount factor and x

= (x

, . . . , x

)

is the initial state with x

= (1, 0) for every k ∈ K.

C. Problem objective

We impose the following assumptions on the model:

(A1) Given an allocation a = (a

, . . . , a

) by the DSS, the

operating states of the robots evolve independently of

each other.

剩余14页未读，继续阅读

易小侠

粉丝: 6547
资源: 9万+

多机器人辅助的可扩展操作员分配：不宁Bandit方法

用于在线学习的Bandit算法模拟___下载.zip

e-greedy.zip_artmfx_bandit算法_greedy算法_multi armed bandit_widek3o

epsilon_greedy_solver = EpsilonGreedy(bandit_10_arm, epsilon=0.01)

将贝叶斯优化与多保真度(如bandit-based方法)相结合

请使用python写一个模拟对比学习的奖励机制的强化学习代码

overthewire bandit攻略

加步探索法用python实现

PMD支持python吗

Python代码分析工具

使用LinUCB算法解决MAB问题，并写出代码

最新资源