基于嵌套优化的稀疏proximal强化学习算法

PDF格式 | 1.4MB | 更新于2024-08-26 | 73 浏览量 | 举报

本文主要探讨了在强化学习问题中，利用线性价值函数近似进行特征选择和政策评估时面临的挑战。高维特征向量和有限的样本数据可能导致过拟合和计算成本高昂的问题。为了解决这些问题，传统的思路是采用正则化方法，特别是L1正则化（也称为Lasso回归），它能够促使模型得到稀疏解，从而提升泛化性能。作者们提出了一个高效且具有O(n^2)复杂度的在线算法，名为L1-RC（L1-Regularized Recursive Least Squares），它是在递归最小二乘（RLS）的基础上扩展的。L1-RC通过嵌套优化分解策略来处理问题，避免了直接最小化带有L1正则化的均方投影贝尔曼误差，这种方法在保持计算效率的同时，实现了特征选择与模型更新的有效结合。在L1-RC算法的核心部分，作者引入了迭代细化（iterative refinement）技术，这是一种迭代优化过程，旨在逐步逼近优化目标。通过这种策略，L1-RC能够在每个时间步都找到更优的解决方案，确保了学习过程中的稳定性和准确性。此外，L1-正则项在算法中起到了关键作用，不仅促进了模型的稀疏性，还降低了对噪声数据的敏感性，提高了学习算法的鲁棒性。总结来说，这篇研究论文提出了一个新颖的强化学习框架——L1-RC，它结合了嵌套优化和L1正则化，有效地解决了高维特征和数据稀缺带来的问题，提升了模型的泛化能力和计算效率。这对于实际应用中的在线决策问题、资源管理和控制任务具有重要的理论和实践意义。未来，研究人员可能会进一步探索如何将这种方法扩展到非线性价值函数或深度学习架构，以适应更复杂的环境和任务。

展开

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

SONG et al.: SPARSE PROXIMAL RL VIA NESTED OPTIMIZATION 3

transition (s, a, r, s



), the sample matrices are deﬁned as

 ≡

⎡

⎢

⎣

⎤

⎥

⎦





≡

⎡

⎢

⎣

T

⎤

⎥

⎦

and

R ≡

⎡

⎢

⎣

⎤

⎥

⎦

(4)

and we obtain an empirical version of (3)asfollows:

θ = u

∗

= arg min

u∈R

n×1



u −



R + γ









. (5)

The derivation from (3)to(5) can be referred to [20] and [22].

It has been proved that with m →∞, the ﬁxed-point of (5)

approaches that of (3) with probability one.

A least squares problem orthogonally projects θ back to F

as a new ﬁxed-point θ

∗

. We can solve ﬁxed-point θ

∗

in the

following nested optimization problem [28], [33], [34]:

⎧

⎪

⎨

⎪

⎩

∗

= arg min

u∈R

n×1



u −



R + γ









∗

= arg min

θ∈R

n×1



θ −

u

∗



(6)

The ﬁrst step of (6) is to minimize the OPE and the second

is to minimize the FPE. In this paper, we refer to these two

steps as projection step and ﬁxed-point step as in [28].

Given a ﬁxed θ , the closed-form solution to u

∗

in the

projection step of (6) can be found as follows:

∗

= arg min

u∈R

n×1



u −



R + γ















−1





R + γ







. (7)

By substituting (7) into the ﬁxed-point step in (6) and setting

θ = u, the nested optimization problem in (6) is changed to the

following optimization problem to minimize the mean-square

projected Bellman error (MSPBE):

∗

= arg min

u∈R

n×1



u −









−1





R + γ









. (8)

Equation (8) is an empirical version, and it can be derived by

substituting (4) into the original deﬁnition of MSPBE in [8]:

MSPBE(θ) = 1/2V

−TV



, and  = (

D)

−1



is the Bellman projection operator.

In [28], two schemes of 

-regularization problem are

proposed, termed L

and L

. The difference lies in, where to

place the 

penalty. L

scheme that places 

penalty in the

ﬁxed-point step forms a solvable Lasso problem by 

solvers;

whereas, the 

penalty in the projection step keeps the pro-

jection matrix nonsingular, but the effects on the algorithm

is not entirely clear. For these reasons, we choose the nested

optimization problem scheme that has an 

penalty in the

ﬁxed-point step. Differently, we use iterative reﬁnement [36]

instead of the 

penalty to keep the projection matrix nonsin-

gular, and ensure that the result of the projection step converge

to the unbiased solution. The nested optimization problem in

this paper is proposed as

⎧

⎪

⎨

⎪

⎩

∗

= arg min

u∈R

n×1



u −



R + γ









∗

= arg min

θ∈R

n×1



θ −

u

∗



+ β



(9)

Fig. 1. Graphical illustration of the proposed nested optimization problem.

where β

∈ R

is an 

-regularization parameter. An equiv-

alent nested optimization problem of (9) including MSPBE

in (8) can be obtained as follows:

⎧

⎪

⎨

⎪

⎩

∗

= arg min

u∈R

n×1



u −









−1





R + γ









∗

= arg min

θ∈R

n×1



θ −

u

∗



+ β



(10)

We see that the nested optimization in (10) contains a sub-

problem minimizing MSPBE and a subproblem minimizing

a Lasso problem. Both subproblems are relatively easier to

solve than MSPBE with 

-regularization. The equivalency

of (9) and (10) can be proved as follows. By substituting (7)

into the ﬁxed-point step in (9) and setting θ = u, the ﬁxed

point of (9) can be obtained as

∗

= arg min

θ∈R

n×1



θ −









−1





R + γ









+ β



. (11)

Also we can substitute the closed-form solution to projection

step in (10) into the corresponding ﬁxed-point step as follows:

∗

= arg min

u∈R

n×1



u −









−1





R + γ















−1













−1





R + γ









∗

= arg min

θ∈R

n×1



θ −

u

∗



+ β



= arg min

θ∈R

n×1



θ −









−1













−1





R + γ











+ β



= arg min

θ∈R

n×1



θ −









−1





R + γ









+ β



. (12)

We choose the nested optimization problem in (10)asthe

regularization scheme in this paper. The graphical illustration

of this problem is shown in Fig. 1.

With a more generalized meaning, an elastic net

penalty [37] can be used

(

)

≡ β



+ β



= β





+ μ/β





(13)

where β

∈ R

is an 

-regularization parameter, μ ∈ R

is tradeoff weight between 

and 

penalties. The value of

下载后可阅读完整内容，剩余12页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38623249

粉丝: 10

基于嵌套优化的稀疏proximal强化学习算法

Fast Sparse Recovery via Non-Convex Optimization

Semi-supervised learning via sparse model

Semi-supervised dictionary learning via structural sparse preserving

Transfer Sparse Discriminant Subspace Learning for Cross-Corpus Speech Emotion Recognition

Hands-On Reinforcement Learning - Sudharsan Ravichandiran(带书签PDF+代码）

sparse_learning.zip_Sparse bayesian_matlab sbl_sparse learning _

Sparse self-calibration via iterative minimization against phase synchronization mismatch for MIMO radar imaging

Sparse Bayesian Learning -压缩感知

Combating the class imbalance problem in sparse representation learning

Clustering-Guided Sparse Structural Learning for Unsupervised Feature Selection

最新资源