基于互信息与高斯过程的非线性逆强化学习

151 浏览量更新于2024-08-27 收藏 1.53MB PDF 举报

"Nonlinear Inverse Reinforcement Learning with Mutual Information and Gaussian Process" 本文介绍了一种名为MEIRL（Mutual Information-based Extreme Learning Machine Inverse Reinforcement Learning）的新算法，该算法旨在解决非线性逆强化学习问题，通过结合互信息（Mutual Information, MI）和高斯过程（Gaussian Process, GP）来构建非线性奖励函数。逆强化学习（Inverse Reinforcement Learning, IRL）是一种机器学习方法，它试图从观察到的专家行为中推断出潜在的奖励函数。在传统的GP-IRL中，奖励函数是利用高斯过程来学习的，并通过自动相关性确定（Automatic Relevance Determination, ARD）来评估每个特征的重要性。而MEIRL算法进一步引入了互信息的概念，以评估每个特征对奖励函数的影响。互信息是一种衡量两个随机变量之间依赖性的度量，可以用于选择对奖励函数贡献最大的特征子集。 MEIRL通过极端学习机（Extreme Learning Machine, ELM）实现了一个自适应模型构造过程。ELM是一种快速的单层神经网络训练方法，它能够高效地处理大量特征。在选择最优特征子集的过程中，互信息被用来指导特征的选择，从而提高算法的性能和效率。为了验证MEIRL的有效性，作者构建了一个名为“高速公路驾驶”的模拟场景。模拟结果表明，MEIRL在泛化能力上与最先进的IRL算法相当，但在处理大量特征时具有更高的效率。这表明，MEIRL在处理复杂环境决策问题时，如自动驾驶等，具有显著优势。此外，IRL在机器人、游戏策略和人工智能等领域有广泛应用。通过理解专家行为背后的奖励机制，IRL可以帮助机器学习自主决策，并在未知环境中优化行为。MEIRL通过结合MI和GP，提供了一种更有效的方法来处理这些挑战，尤其是在特征空间庞大且关系复杂的任务中。 "Nonlinear Inverse Reinforcement Learning with Mutual Information and Gaussian Process"这篇研究论文提出了一种新颖的IRL算法，通过互信息和高斯过程的结合，提高了在非线性环境下的奖励函数学习效率和准确性。这种方法不仅有助于理论上的研究，而且对实际应用中的智能决策系统设计也具有重要的实践意义。

Abstract—In this paper, a mutual information (MI) and

Extreme Learning Machine (ELM) based inverse reinforcement

learning (IRL) algorithm, which termed as MEIRL, is proposed

to construct nonlinear reward function. The basic idea of MIIRL

is that, similar to GPIRL, the reward function is learned by

using Gaussian process and the importance of each feature is

obtained by using automatic relevance determination (ARD).

Then mutual information is employed to evaluate the impact of

each feature to the reward function, based on which extreme

learning machine is introduced along with an adaptive model

construction procedure to choose the optimal subset of features

and the performance of the original GPIRL algorithm is

enhanced as well. Furthermore, to demonstrate the effectiveness

of MEIRL, a simulation called highway driving is constructed.

The simulation results show that MEIRL is comparable with the

state of art IRL algorithms in terms of generalization capability,

but more efficient while the number of features is large.

I. INTRODUCTION

Reinforcement learning (RL) techniques solve problems

through an agent, which acquires experience by interacting

with a dynamic environment

[1-3]

. The result is a policy that can

resolve complex tasks without specific instructions on how the

tasks are to be achieved. However, there is one problem

associated with RL, which is the reward function in RL has to

be specified in advance, and its design difficulties promoted

the introduction of Inverse reinforcement learning, where the

reward function can be derived from expert’s demonstrations

[4-6]

Although original IRL algorithms can somewhat achieve

their objectives and their effectiveness has been proved by

some applications, there are still assumptions and constraints

[7]

.For example, the expert’s demonstrations are assumed to be

optimal, which is usually not true when the demonstrations are

noisy and imperfect. Moreover, the original IRL algorithms are

always ill-posed and it is difficult to obtain a reasonable

solution. These drawbacks limit the applications of IRL in a

restricted range.

To solve the problems mentioned above, many researchers

participate into the further refinements of IRL, some of which

are listed as follows.

Maximum margin planning (MMP) uses similar ideas as

the original IRL algorithm [8], where the solver attempts to

*Resrach supported by National Nature Science Foundation (61305121,

61035005).

D. C. Li, is with the Shenyang Institute of Automation, the Chinese

Academy of Sciences, Shenyang, China (phone: 0086-24-23970276; fax:

0086-24-23970276; e-mail: lidecai@sia.cn).

Y. Q. He, with the Shenyang Institute of Automation, the Chinese

Academy of Sciences, Shenyang, China (e-mail: heyuqing@sia.cn).

F. Gu, with the Shenyang Institute of Automation, the Chinese Academy

of Sciences, Shenyang, China (e-mail: fenggu@sia.cn).

make the demonstrations look better than any other solutions

by a margin and solves the ill-posed problem by introducing

proper loss-functions. In addition, Ramachandran point out

that, the uncertainty existed in the obtained reward function is

a main reason leading to the ill-posed problem of IRL

algorithms. Therefore, Bayesian IRL is presented to model this

uncertainty, which approaches the problem from probability

distribution perspective and is able to deal with noisy and

incomplete observations. Similar to Bayesian IRL, maximum

entropy IRL uses a probability approach to resolve the IRL

problem as well

[10]

. However, different from the former, the

principle of maximum entropy, which gives the least biased

estimate based on the given information, is employed in the

maximum entropy IRL algorithm. This algorithm was applied

to a set of GPS data collected from taxi drivers, and recover a

reward function that can be used for route recommendation

and predicting driver’s behavior. Although these IRL

algorithms can solve the ill-posed problem to a certain extent,

they are still suffer some drawbacks, such as these IRL

algorithms generally learn a reward as a linear combination of

features, which can be wrong while the reward function

follows a nonlinear form.

Levine presented a Gaussian process based algorithm for

nonlinear inverse reinforcement learning (GPIRL)

[11]

. The goal

of the algorithm is to learn the reward function in a Markov

decision process from expert demonstrations and use Gaussian

processes to learn the reward as a nonlinear function, while

also determining the relevance of each feature to the expert’s

policy.

However, this is a costly procedure as the running time of

the optimization method increases dramatically with the

number of feature M. Hence, the GPIRL algorithm will be

computational complex when M is large.

In this paper, an inverse reinforcement learning algorithm

named MEIRL is proposed by combing mutual information,

Extreme Learning Machine and Gaussian Process. Similar to

GPIRL, the reward function is constructed through Gaussian

process. Then a mutual information based algorithm called

Max-Relevance and Min-Redundancy is employed to evaluate

the importance of each feature with the reward function, based

on which the number of the selected features is reduced, so as

the computational complexity of the GPIRL. Furthermore, a

computational efficiently neural network that named extreme

learning machine is introduced along with an adaptive model

construction procedure to refine the selection process and a

more compact subset of features is obtained. In the end, a

highway driving simulation is constructed to demonstrate the

effectiveness of MEIRL.

Nonlinear Inverse Reinforcement Learning with Mutual Information

and Gaussian Process

De C. Li, Yu Q. He, Member, IEEE, and Feng Fu, Member, IEEE

1445

Proceedings of the 2014 IEEE

International Conference on Robotics and Biomimetics

December 5-10, 2014, Bali, Indonesia

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38720050

粉丝: 3

基于互信息与高斯过程的非线性逆强化学习

Solving Nonlinear partial differential equation with maple and mathematica

A nonlinear inverse scale space method for a convex multiplicative noise model

Solving Nonlinear Partial Differential Equations with Maple and Mathematica

基于RBF神经网络的qlearning算法matlab代码

Stability analysis and stabilisation for a class of 2-D nonlinear discrete systems的matlab代码

matlab优化算法 100例

bouc-wen hysteresis for nonlinear discrete systems

matlab Strongly Implicit Procedure

Defining nonlinear kinematic hardening

最新资源