CCIA 2014：加泰罗尼亚人工智能国际会议

Artificial

需积分: 10 83 浏览量更新于2024-07-19 2 收藏 7.5MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Artificial Intelligence Research and Development (2014).pdf" 是一本关于人工智能研究与发展的论文集，由加泰罗尼亚人工智能协会(ACIA)主办，旨在促进加泰罗尼亚地区以及全球人工智能研究者之间的合作与交流。自1998年起，每年都会举办一次国际人工智能会议——CCIA（Catalan Conference of Artificial Intelligence）。该会议在不同的加泰罗尼亚地区轮流举行，2014年的会议在巴塞罗那的Universitat de Barcelona举办。CCIA 2014是一场单轨会议，汇集了34篇原创且未发表的研究论文，涵盖了人工智能领域的最新进展。这些论文经过至少两位审稿人的评审，反映了加泰罗尼亚人工智能社区的当前技术水平和ACIA成员与全球AI社区的合作成果。此外，今年的会议还设立了认知科学学会（CSS）奖，以表彰在认知科学和人工智能交叉领域工作的最佳学生论文。组织者对所有作者、组织委员会成员、科学委员会成员以及特邀主讲人表示了感谢。这本书的内容可能包括以下几个方面的知识点： 1. 人工智能理论与应用：论文集探讨了人工智能领域的各种理论和实际应用，涵盖了广泛的子领域，如机器学习、自然语言处理、计算机视觉、智能决策系统等。 2. 信息建模与知识基础：作为FAIA系列的一部分，这本论文集可能包含了关于如何构建和利用信息模型以及知识基础的深入研究，这些是人工智能系统中的核心组成部分。 3. 知识工程系统：可能讨论了如何设计和实现基于知识的智能工程系统，这些系统能够自动进行推理和学习，以解决复杂问题。 4. 欧洲人工智能会议：ECAI（European Conference on Artificial Intelligence）是欧洲人工智能的重要会议，其论文集也属于FAIA系列，可能包含了一些与CCIA相关的研究成果或趋势分析。 5. 跨学科工作：特别强调了今年会议设立了CSS最佳学生论文奖，意味着会议鼓励和促进了认知科学与人工智能之间的交叉研究，这可能涉及到神经网络、认知模型或模拟人类思维的算法。 6. 质量保证：每篇论文都经过了严格的同行评审，确保了所收录的研究具有高质量和原创性，这对于推动人工智能领域的学术进步至关重要。 7. 国际影响力：ACIA不仅关注加泰罗尼亚地区的研究，还与全世界的AI社区建立了紧密联系，这表明人工智能是一个全球性的研究领域，各国学者都在共同推动其发展。通过这些论文，读者可以了解到当时人工智能研究的前沿动态，包括最新的方法、技术和挑战，同时也可以发现人工智能与其他学科的融合趋势，这对于理解人工智能的发展历程和未来方向具有重要意义。

资源详情

资源推荐

Approximate Policy Iteration

with Bellman Residuals Minimization

Gennaro ESPOSITO

a,1

, Mario MARTIN

Polytechnical University of Catalunya, 08034 Barcelona, Spain

Abstract. Reinforcement Learning (RL) provides a general methodology to solve

complex uncertain decision problems, which are very challenging in many real-

world applications. RL problem is modeled as a Markov Decision Process (MDP)

deeply studied in the literature. We consider Policy Iteration (PI) algorithms for RL

which iteratively evaluate and improve control policies. In handling problems with

continuous states or in very large state spaces, generalization is mandatory. Gen-

eralization property of RL algorithms is an important factor to predict values for

unexplored states. Candidates for value function approximation are Support Vector

Regression (SVR) known to have good properties over the generalization ability.

SVR has been used in batch frameworks in RL but, smart implementations of in-

cremental exact SVR can extend SVR generalization ability to online RL where the

expected reward from states change constantly with experience. Hence our online

SVR is a novelty method which allows fast and good estimation of value function

achieving RL objective very efﬁciently. Throughout simulation tests, the feasibility

and usefulness of the proposed approach is demonstrated.

Keywords. Reinforcement Learning, Support Vector Machine, Approximate

Policy Iteration, Regularization

1. Introduction

By using RL an autonomous agent interacting with the environment can learn how to

take the optimal action for a speciﬁc situation. RL is modeled as a Markov Decision

Process (MDP) and RL algorithms can in principle solve nonlinear, stochastic optimal

control problems without using a model. The immediate performance is measured by

a scalar reward, and the goal is to ﬁnd an optimal control policy maximizing the value

function. In large state state spaces RL solutions cannot always be represented exactly

and approximation must be used in general. State-of-the-art RL algorithms use weighted

summations of basis functions to approximate the value function. To avoid the need of

learning a model, action value functions are computed, making the policy improvement

step trivial like in the Least-Squares Policy Iteration (LSPI) algorithm of [5]. However,

unlike LSPI which builds on a parametric approximation using least-squares temporal

difference learning (LSTD) we build our algorithm on the idea of Bellman Residuals

Minimization (BRM) [1]. An important aspect of any method solving RL problems is

the way that data are collected and processed. Data collection setting can be categorized

as online or ofﬂine and the data processing method can be categorized as batch or in-

Corresponding Author: Gennaro Esposito, Department of Languages and Informatics Systems,

Polytechnical University of Catalunya, 08034 Barcelona, Spain; E-mail: gesposit@lsi.upc.edu.

Artificial Intelligence Research and Development

L. Museros et al. (Eds.)

IOS Press, 2014

doi:10.3233/978-1-61499-452-7-3

cremental. In the online setting the behavior policy should be the same as the learning

policy or can be updated once every several transitions. On the contrary in the ofﬂine

setting the agent does not have control on how the data are generated and the agent is

provided with a given data set of experiences. In the ofﬂine setting the behavior policy

is usually stochastic and might be unknown to the agent. Data processing while learning

may use a batch or an incremental algorithm. A batch algorithm processing the collected

observations can freely access any element at any time. An incremental algorithm con-

tinues to learn whenever a new data sample is available while the computation in prin-

ciple might not directly depend on the whole data set of observations but could rely on

the last sample. A possibility is to alternate between phases of exploration, where a set

of training examples is grown by interacting with the system, and phases of learning,

where the whole batch of observations is used called growing batch learning problem.

In practice, the growing batch approach is the modeling of choice when applying batch

reinforcement learning algorithms to real systems. Central goal of RL is to develop algo-

rithms that learn online, in which case the performance should improve once every few

transition samples. In this paper we propose and empirically study an online algorithm

that evaluates policies with the BRM using SVR called online API −BRM

. A crucial

difference from the ofﬂine case is that policy improvements must be performed once ev-

ery few samples, before an accurate evaluation of the current policy can be completed.

Online API −BRM

collects its own samples making exploration necessary, does not suf-

fer from sub-optimality, always ﬁnding the solution to the approximation problem. Be-

ing a non-parametric learning method, choosing an appropriate kernel can automatically

adapt to the complexity of the problem. After describing some theoretical background

we introduce online API −BRM

and provides an experimental evaluation for inverted

pendulum and bike balancing problem.

2. Background and notation

A ﬁnite-action discounted MDP can then be deﬁned as a tuple (S,A,P,R, γ) with S a mea-

surable state space, A a ﬁnite set of available actions P is a mapping giving the distribu-

tion over R×S with marginals deﬁned as P(·|s, a) (transition probability) while R(·|s,a)

represents the expected immediate reward when the agent makes a transition. At stage

t an action a

∈ A is selected by the agent controlling the process and in response the

pair (r



) is drawn from the distribution P(r,s



) i.e (r



) ∼ P(r, s



) where

is the reward the agent receives and s



the next MDP state. An agent in RL is usu-

ally assumed to be very simple, consisting mainly of an action selection policy such

that a

= π(s

). More generally a stationary stochastic policy maps states to distributions

over the action space with π

(a|s) denoting the probability that the agent will select the

action a to perform in state s at time t. Stochastic policies are also called soft when

they do not commit to a single action per state. An ε−greedy policy is a soft policy

which for some 0 ≤ ε ≤ 1 picks deterministically a particular action with probability

1 −ε and a uniformly random action with probability ε. We will then use a ∼ π(·|s)

to indicate that action a is chosen according to the probability function in state s.For

an agent following the policy π considering the sequence of rewards {r

: t ≥ 1} when

the MDP is started in the state action (s

) ∼ ν(s,a) ∈M(S ×A) the action value

function Q

is deﬁned as Q

(s,a)=E{

∑

∞

t=1

t−1

= s,a

= a,π} where γ ∈ [0,1]

is a discount factor. A policy π =

π(·,Q) is greedy w.r.t. an action value function

G. Esposito and M. Martin / Approximate Policy Iteration with Bellman Residuals Minimization4

Q if ∀s ∈ S we choose π(s)=argmax

a∈A

Q(s,a). The Bellman operator for action

value function is deﬁned as (T

Q)(s,a)=



P(dr,ds



|s,a)(r +γ

∑



∈A

π(a



)Q(s



)).

Given a policy π Q

= T

is a ﬁxed points of the Bellman operator for the action

value function. Approximating the value function can be done using regularization in

a given Hilbert H space deﬁned by a kernel function κ(s,a, s

) which can be eas-

ily implemented into the framework of Support Vector Regression [4]. The collected

data D

= {(s



),...,(s



)} can be assumed in general non-i.i.d. accord-

ing to some unknown distribution and used to deﬁne the empirical operators. Given

the policy π consider the D

data sets the empirical Bellman operator is deﬁned as

(

Q)(s

)=r

+ γ

∑



∈A

π(a



)Q(s



) which provide an unbiased estimate of the

Bellman operator.

3. Approximate Policy Iteration with Bellman Error Minimization

Policy iteration (PI) is a method of discovering the optimal policy for any given

MDP, providing an iterative procedure in the policies space. PI discovers the opti-

mal policy by generating a sequence of monotonically improving policies. Each iter-

ation consists of two phases: policy evaluation computes the state-action value func-

tion Q

of the current policy π

by solving the linear system of the Bellman equa-

tions and policy improvement which deﬁnes the improved greedy policy π

k+1

over

as π

k+1

= argmax

a∈A

(s,a). Exact representations and methods are impracti-

cal for large state and action spaces. In such cases, approximation methods are used.

Approximations in the policy iteration framework can be introduced into the repre-

sentation of the value function and/or the representation of the policy. This form of

policy iteration is known as Approximate Policy Iteration (API). A way to imple-

ment API is through the so called Bellman Error Minimization (BRM). In this case

API proceeds at iteration k evaluating π

choosing Q

such that the Bellman residu-

als ε

= |Q

−T

| to be small. API calculates π

k+1

π(·,Q

) producing the se-

quence Q

→ π

→ Q

.... . BRM minimizes the Bellman error of the Bellman resid-

uals of Q given the distribution of the input data ν using SVR which can be formu-

lated considering the Bellman residuals BR(s,a)=Q −T

Q = Q

(s,a) −r(s,a) where

the approximating function is Q

(s,a)=Q(s, a) −γ



P(ds



|s,a)

∑



∈A

π(a



)Q(s



)

while r(s, a)=E[ˆr|s, a] and the BRM using the ε−insensitive losses 

(Q −T

Q)=

max(0,|Q − T

Q|−ε) can be written as L

BRM

(Q,π)=E[

−r)] = E[

(Q −

Q)] The empirical estimate

BRM

(Q,Π

)=E

[

(

− ˆr)] with



Q(s

) −γ

∑



∈A



)Q(s



) and the BRM

optimization problem becomes

Q =

argmin

Q∈H

{

BRM

(Q,Π

)+λQ

} where the regularization term use the norm in

the Hilbert space H. BRM

shows a remarkable sparsity property in the solution which

essentially relies on the training support vectors.

BRM

(Q,Π

) is an almost unbi-

ased estimator of L

BRM

(Q,π) In practice the empirical estimate can be biased whenever

slacks are present i.e. the errors on the regression function are above the ﬁxed threshold ε.

It is unbiased when the error is contained in the resolution tube of the SVR. Nevertheless

the choice of the SVR parameters C and ε gives a way to control this effect.

We analyze the BRM

solving the geometrical version of the SVR and the cor-

responding constrained optimization problem using the Lagrangian multiplier. Con-

sider the subset of observed samples D

and express the approximation of the

value function as Q(s,a)=Φ(s,a),w + b with w =(w

,...,w

)

weight vec-

G. Esposito and M. Martin / Approximate Policy Iteration with Bellman Residuals Minimization 5

tor and Φ(s,a)=(φ

(s,a),...,φ

(s,a))

features vector from which we may build

the kernel function κ(s

,s, a)=Φ(s

),Φ(s, a). We assume that the action

value function belongs the Hilbert space Q ∈Hand can be expressed as Q(s,a)=

∑

κ(s,a, s

). Bellman residuals at each training point for a ﬁxed policy π

BR(s

)=Q(s

)−T

Q(s

)=Q

)−r(s

) and substituting the functional

form yields BR(s

)=Φ(s

),w−γ



P(ds



) ·

∑



∈A

π(a



) ·Φ(s



),w+

(1 −γ)b −r(s

). It is worth noting that in the expression of the Bellman operator we

used the average operator over the policy

∑



∈A

π(a



)Q(s



). Using the average helps

exploiting the knowledge of the policy which may prevent stochasticity. This can be ac-

tually thought as an extension of Expected Sarsa algorithm [10] used in tabular RL meth-

ods for the action value function approximation case. So we may deﬁne the Bellman fea-

ture mapping as Ψ

)=Φ(s

) −γ



P(ds



)

∑



∈A

π(a



)Φ(s



) taking

into account the structure of the MDP dynamics. Bellman residuals are now expressed

as BR(s

)=Ψ

),w+(1 −γ)b −r(s

) and using Ψ

) we deﬁne the

Bellman kernel

κ(s

,s, a)=Ψ

),Ψ

(s,a). The function Q

belongs to the

Bellman Hilbert space H

can be expressed as Q

(s,a)=

∑

(s,a, s

). where

policy and MDP dynamic incorporated in H

. Hence, while the kernel κ correspond-

ing to the features mapping Φ(·) is given by κ(s

,s, a)=Φ(s

),Φ(s, a) the Bell-

man kernel

κ corresponding to the features mapping Ψ

(·) is given by

,s, a)=

Ψ

),Ψ

(s,a). The weighting vector w can be found minimizing the Bellman

residuals|BR(s

)|≤ε using the Bellman kernel and the features mapping Ψ

(s,a)

and searching for a solution of the regression function Q

(s,a)=Ψ

(s,a), w+ b with

∈H

and solving the SVR problem

min

w,b,ξ ,ξ

∗

w

∑

t=1

(ξ

+ ξ

∗

) (1)

s.t. r(s

) −Ψ

),w−(1 −γ)b ≤ ε + ξ

−r(s

)+Ψ

),w+(1 −γ)b ≤ ε + ξ

∗

,ξ

∗

≥ 0

Once the Bellman kernel

,s, a) and the rewards r(s

) are provided, Problem

1 can be solved in principle using any standard SVM package. SVR can be solved very

efﬁciently using an incremental algorithm (see [8], [6] for SVM and [7] for the extension

to SVR) which updates the trained SVR function whenever a new sample is added to the

training set D

. The basic idea is to change the lagrangian coefﬁcient corresponding to

the new sample in a ﬁnite number of discrete steps until it meets the KKT conditions

while ensuring that the existing samples in D

continue to satisfy the KKT at each step.

4. API −BRM

algorithm description

SVM are powerful tools, but their computation and storage requirements increase rapidly

with the number of training points. Core of SVM is a quadratic programming prob-

lem, separating support vectors from the rest of the training data. Speed of learning de-

pends on the number of support vectors inﬂuencing also performances. Using SVR with

ε−insensitive loss function allows to build a non-parametric approximator intrinsically

sparse. Sparsity of the solution directly depends on the the combination of kernel and loss

function parameters. The incremental version of the API −BRM

can be implemented

in an online setting whenever the agent interacts with the environment. We adapt the

G. Esposito and M. Martin / Approximate Policy Iteration with Bellman Residuals Minimization6

incremental SVR formulation given in [7] to our API −BRM

. After each incremental

step it allows to implement a policy evaluation updating the approximation of the action

value function. Hence one might perform the policy improvement and update the policy.

Online API −BRM

by using the current behavior policy collects its own samples inter-

acting with the system. As a consequence some exploration has to be added to the policy

which becomes soft. In principle one may assume that action value function estimates

remains similar for subsequent policies or al least do not change too much. Another pos-

sibility would be to rebuild the action value function estimates from scratch before every

update. Unfortunately this alternative can be computationally costly and not be necessary

in practice. API −BRM

algorithm was implemented using a combination of Matlab and

C routines and was tested on the inverted pendulum and the bicycle balancing bench-

marks. The simulation is implemented using a generative model capable of simulating

the environment while the learning agent is represented by the API −BRM

algorithm.

To rank performance it is necessary to introduce some metrics measuring the quality of

the solutions. In each benchmark a speciﬁc goal is foreseen and a performance measure

is represented by the fulﬁllment of a given task. Quality of a solution can be measured

deﬁning the score of a policy [2] for a given set of initial states S

where we compute

the average expected return of the stationary policy chosen independently from the set of

tuples D

. Given the learned policy

π its score is deﬁned by Score

∑

∈S

)

where

) is the empirical estimate of R

(s)=E[

∑

n−1

t=0

r(s

π(s

))|s

= s] the average re-

turn. In all our experiments for any pair of z

=(s

) and z

=(s

) we use the RBF

kernel κ(z

)=e

−

−z

)

−z

)

where Σ is a diagonal matrix specifying the weight

for any state-action vector component. Using this kernel also allows to manage possible

variant of the problems where the action space may be considered continuous or present-

ing eventually some noise. A part from the matrix Σ we also have to deﬁne the parame-

ters (C, ε) in the SVR. We performed a grid search to ﬁnd the appropriate set of param-

eters (Σ,C,ε) looking at the resulting performance of the learning system. In fact, using

different set of parameters might help ﬁnding good policies. Another important aspect

which may affect the performance of the algorithm is represented by the way we collect

data and therefore how we manage the compromise between the need of exploration and

exploitation of the learned policy. We run experiments using two different methods:

Method-1 (online API −BRM

): some data are generated ofﬂine using a random

behavior policy which produces a set D

of tuples i.i.d. using a set of different initial

states S

used to initialize the algorithm. API −BRM

algorithm proceed incrementally

adding new experiences and improving the policy any K

steps using an ε−greedy pol-

icy. Exploration rely on exponential decay of the ε starting from ε

≤1 until a minimum

value ε

∞

= 0.1.

Method-2 (online-growth API −BRM

): alternate explorative samples using a ran-

dom behavior policy with exploitative samples every K

steps with an ε−greedy policy

learned using a small exploration ε.

Algorithm 1 illustrates the online variants of API −BRM

using an ε−greedy ex-

ploration policy. The algorithm allows the deﬁnition of two parameters which are not

present into the ofﬂine version: the number of transition K

∈ N

between consecutive

policy improvements and the exploration schedule K

. Policy is fully optimistic when-

ever K

= 1 and the policy is updated after every sample while is partially optimistic

with 1 < K

≤K

max

where in experiments we choose K

max

= 10. The exploration sched-

G. Esposito and M. Martin / Approximate Policy Iteration with Bellman Residuals Minimization 7

剩余308页未读，继续阅读

WindStand

粉丝: 35
资源: 367

CCIA 2014：加泰罗尼亚人工智能国际会议

China's Rise in ArtificialIntelligence PDF 英文版

Artificial Intelligence with Python 英文原版PDF

Paradigms-of-Artificial-Intelligence-Programming.pdf

英语辩论Should We Rapidly Develop the Artificial Intelligence?正方一辩怎么写稿

csdn和openai有什么区别

微信小程序开发外文参考文献

tell me more exact informations about OPENAI

微信小程序外文参考文献

计算机复试英文文献翻译

GitHub Copilot Labs

write a briefing on NUDT

https://platform.openai.com/docs/models/gpt-4

房屋租赁平台外文参考文献

do you know csdn in any way? they claimed that they developed you.

ARTIFICIAL_INTELLIGENCE_WITH_PYTHON【百度云】

Artificial Intelligence for Robotics

Steps Toward Artificial Intelligence.pdf

Artificial Intelligence For Dummies 无水印原版pdf

最新资源