Zap Q-Learning：优化的Q学习算法，快速收敛与教程

需积分: 13 85 浏览量更新于2024-07-18 收藏 5.9MB PDF 举报

本文《Fastest Convergence for Q-Learning》聚焦于强化学习领域内的一个重要进展，特别是针对Q-learning算法的优化。Q-learning是强化学习中的一种经典策略学习方法，它通过估计动作值函数来指导智能体在环境中做出决策。原始的Watkins算法虽然实用，但存在收敛速度慢、方差不易优化的问题。作者提出了一种名为Zap Q-learning的新算法，该算法在设计上注重了性能提升。它采用矩阵增益技术，这种技术旨在优化算法的渐近方差，使得在长期运行过程中，学习的稳定性及效率得到显著提高。通过与确定性Newton-Raphson方法进行对比，文章指出Zap Q-learning在瞬态行为上表现出接近最优的特性，这得益于其特有的两个时间尺度更新方程，这在数学分析中起着关键作用。尽管算法设计考虑了非理想参数设置的情况，即在实际应用中可能遇到的复杂环境，分析结果显示即使在这种条件下，Zap Q-learning依然能保持稳定的计算并实现快速收敛。这一点在文中展示的图9——一个取自文献的对比图中得到了直观体现，展示了新算法在收敛速度上的显著加速效果，这在强化学习中是非常重要的性能指标。此外，该论文还具有教学性质，前半部分详细回顾了强化学习算法的发展，特别是着重讲解了最小方差算法这一核心概念。通过这样的结构，读者不仅能了解到Zap Q-learning的具体实现，还能对整个强化学习领域的理论基础有更深入的理解。《Fastest Convergence for Q-Learning》不仅是一项技术创新，也是一份深入浅出的强化学习入门教程，它为研究者和实践者提供了优化后的Q-learning算法，并对其优越性进行了严格的理论支持和实证验证。这对于加快强化学习在复杂环境中的应用具有重要意义。

convergence of the stochastic recursion (as we will see in the case of Q-learning). The SNR

algorithm is essentially the same as (15):

n+1

= θ

− α

n+1

−1

n+1

f(θ

, φ

n+1

)

n+1

+ α

n+1



∇f(θ

, φ

n+1

) −

n+1



, α

n+1

(20)

Note that the function ∇f(θ

, φ

n+1

) may or may not be readily accessible, and this is application

speciﬁc. In the case of Q-learning with linear function approximation, though the function f is

iteslf non-linear in θ, ∇f is readily computable.

The ODE for the pair of recursions (20) once again will be similar to (16):

= −A

−1

f(θ

)

= −∇f(θ

) + A

(21)

The Zap-SNR algorithm is a generalization of (17):

n+1

= θ

− α

n+1

−1

n+1

f(θ

, φ

n+1

)

n+1

+ γ

n+1



∇f(θ

, φ

n+1

) −



(22)

where once again the step-size sequence {γ

} satisﬁes (5), and (18). Similar to (19), the ODE of

this algorithm is identical to the deterministic Newton-Raphson dynamics:

= −(∇f(x

))

−1

f(x

). (23)

The general convergence and stability analysis of both (20) and (22) is open. In Section 3

we show that when applied to Q-learning, the algorithms do converge under certain technical

conditions. However, the assumptions under which the single time-scale algorithm (20) converges

is far more restrictive than the assumptions under which the the two-time-scale algorithm (22)

converges.

2.3.2 Dealing with complexity: An O(d) Zap-SNR algorithm

It is common to discard the idea of second order methods because of their computational com-

plexity. Before we move on to the speciﬁc applications in Reinforcement Learning, we propose an

enhancement of the SNR algorithms that will result in complexity that is comparable to ﬁrst order

methods.

We believe that we have convinced the readers that the two-timescale Zap-SNR algorithm (22) is

of more interest to us (we will make this more precise in Section 3), and hence restrict to extensions

of this algorithm here.

It is assumed that there is no complexity in “calculating” the gradient function ∇f (·, ·), and that

it is readily available. This is not be true in all applications, but holds in the applications of interest

in this paper. Under these assumptions, computational complexity arises from the operations that

are performed in manipulating these quantities.

The per-iteration complexity of the ﬁrst order algorithm (1) is O(d), since θ ∈ R

. If the algo-

rithm is run for T iterations (assuming we have a data sequence of length T ), the total complexity

is O(dT ). The per iteration complexity in the case of the Zap-SNR algorithm (22) is O(d

), because

it involves the product of a matrix inverse (of dimension d × d) and a vector (of dimension d × 1).

The total complexity of the algorithm after running for T iterations is O(T d

The essential idea behind the O(d) Zap-SNR algorithm is to perform the O(d

) complexity steps

only once every N ≥ d iterations, so that the total computational complexity for a data sequence of

length T is O(

T d

); essentially resulting in the complexity of the ﬁrst order method if N = d. This

is done by “batching” the data sequence into mini-sequences of length N, and applying recursions

(22) for each batch as follows: For i ≥ 0

(i+1)N

= θ

− α

i+1

−1

(i+1)N

f(θ

)

(i+1)N

+ ˆγ

i+1



∇

f(θ

) −



(24)

where,

f(θ

) = N

−1

(i+1)N

j=iN+1

f(θ

, φ

)

∇

f(θ

) = N

−1

(i+1)N

j=iN+1

∇f(θ

, φ

)

ˆγ

i+1

= 1 −

(i+1)N

j=iN+1

(1 − γ

(25)

The ﬁrst two deﬁnitions in (25) are straightforward; the expression for ˆγ

i+1,N

is obtained in

such a way that the recursions in (24) very closely resemble the recursions in (22)

A remarkable (but almost obvious) property of the O(d) Zap-SNR algorithm (24) is that it

has the same asymptotic properties (speciﬁcally, the asymptotic covariance) as that of the original

Zap-SNR algorithm (22). This once again is made more precise in a future version of the paper.

The speciﬁc application of this algorithm to Q-learning is discussed in Section 3.7.

2.4 Application to temporal-diﬀerence algorithms

The general theory is illustrated here, through application to TD(λ)-learning algorithms.

Let {P

} denote the transition semigroup for the Markov chain X: For each n ≥ 0, x ∈ X, and

A ∈ B(X),

(x, A) := P

∈ A} := Pr{X

∈ A |X

= x}.

The standard operator-theoretic notation is used for conditional expectation: for any measurable

function f : X → R,

f (x) = E

[f(X

)] := E[f(X

) | X

= x].

In a ﬁnite state space setting, P

is the n-step transition probability matrix of the Markov chain,

and the conditional expectation appears as matrix-vector multiplication:

f (x) =

∈X

(x, x

)f(x

), x ∈ X.

Let c: X → R

denote a cost function, and β ∈ (0, 1) a discount factor. The discounted-cost

value function is deﬁned as h =

∞

n=0

c, which is the unique solution to the Bellman equation

c + βP h = h (26)

This deserves more explanation and we plan to provide one in a future version of the paper.

TD-learning algorithms are designed to obtain approximations of h within a ﬁnite-dimensional

parameterized class.

Consider the case of a d-dimensional linear parameterization. A function ψ : X → R

is chosen,

which is viewed as a collection of d basis functions. Each vector θ ∈ R

is associated with the

approximate value function h

. There are two standard criteria for deﬁning optimality

of the parameter. Most natural is the minimum norm approach:

∗

= arg min

− hk (27)

in which the choice of norm is part of the design of the algorithm. Most common is

− hk

= E[(h

) − h(X

))

] (28)

where the expectation is in steady-state.

In the Galerkin approach, a d-dimensional stationary stochastic process ζ is constructed that

is adapted to a stationary realization of X. An algorithm is designed to obtain the vector θ

∗

∈ R

that satisﬁes

0 = E



−h

∗

) + c(X

) + βh

∗

n+1

)



(i)



, 1 ≤ i ≤ d , (29)

in which the expectation is again in steady state. The d-dimensional stochastic process ζ is called

the sequence of eligibility vectors.

The motivation for the ﬁrst criterion (27) is clear, but algorithms that solve this problem

often suﬀer from high variance. The Galerkin approach is used because it is simple and generally

applicable. Also, if the basis functions are chosen such that h = h

•

for some θ

•

∈ R

, and if the

solution to (29) is unique, then the Galerkin approach will yield the exact solution h.

The goal of the TD(λ) learning algorithm is to solve the Galerkin relaxation (29) in which the

eligibility vectors are obtained by passing {ψ(X

)} through the corresponding ﬁrst-order low-pass

ﬁlter: ζ

n+1

= λβζ

+ ψ(X

n+1

), n ≥ 0. It is always assumed that λ ∈ [0, 1]. It is shown in [33]

that the solutions to the Galerkin ﬁxed point equation (29) and the minimum norm problem (27)

coincide if λ = 1, with the norm deﬁned by (28).

TD(λ) algorithm: For initialization θ

, ζ

∈ R

, the sequence of estimates are deﬁned recur-

sively:

n+1

= θ

+ α

n+1

= c(X

) +



βψ(X

n+1

) − ψ(X

)



n+1

= λβζ

+ ψ(X

n+1

) .

(30)

The recursion (30) can be placed in the form (6) in which Φ

= (X

, X

n−1

, ζ

n−1

), and

n+1

= ζ



βψ(X

n+1

) − ψ(X

)



, b

n+1

= −ζ

c(X

) (31)

Based on this representation, it can be shown that the TD(λ) algorithm is consistent provided the

basis vectors are linearly independent, in the sense that E

[ψ(X

)ψ(X

)

] > 0.

It is also easy to construct an example for which the asymptotic covariance is inﬁnite: Take

any consistent example, and scale the basis vectors by a small constant ε. Using the basis εψ, the

resulting matrix A is scaled by ε

. Hence, for suﬃciently small ε > 0, each eigenvalue of A will

have real part that is strictly greater than −1/2.

剩余45页未读，继续阅读

AI技术与生活

粉丝: 6

Zap Q-Learning：优化的Q学习算法，快速收敛与教程

Yolo-Fastest结合EfficientNet-lite，实现超轻量级实时目标检测

q-FFTW源代码实现：KDB+/q系统下的FFTW快速傅立叶变换

实现极速Levenshtein距离计算的fastest-levenshtein库

Fastest-Subway-in-NYC

The-Fastest-Pedestrian-Detector-in-the-West

yolo-fastest-xl-based-on-opencv-DNN-using-onnx:yolo-fastest-xl基于基于onc的opencv DNN

wp-fastest-cache-premium-1.5.8_WordPress_

wp-fastest-cache-premium-v1.5.7_cache_

Fastest-Worder-First

matlab精度检验代码-Fastest-Object-Detector-of-PyTorch:该工具包是从头开始使用您自己的自定义数据集训练

最新资源