input : Parameters V , {γi, ci}Ni=1, K, training interval δT , Mt update interval δM ; output: Control actions xt,ytKt=1; 1 Initialize the DNN with random parameters θ1 and empty replay memory, M1 ← 2N; 2 Empty initial data queue Qi(1) = 0 and energy queue Yi(1) = 0, for i = 1,··· ,N; 3 fort=1,2,...,Kdo 4 Observe the input ξt = ht, Qi(t), Yi(t)Ni=1 and update Mt using (8) if mod (t, δM ) = 0; 5 Generate a relaxed offloading action xˆt = Πθt ξt with the DNN; 6 Quantize xˆt into Mt binary actions xti|i = 1, · · · , Mt using the NOP method; 7 Compute Gxti,ξt by optimizing resource allocation yit in (P2) for each xti; 8 Select the best solution xt = arg max G xti , ξt and execute the joint action xt , yt ; { x ti } 9 Update the replay memory by adding (ξt,xt); 10 if mod (t, δT ) = 0 then 11 Uniformly sample a batch of data set {(ξτ , xτ ) | τ ∈ St } from the memory; 12 Train the DNN with {(ξτ , xτ ) | τ ∈ St} and update θt using the Adam algorithm; 13 end 14 t ← t + 1; 15 Update {Qi(t),Yi(t)}N based on xt−1,yt−1 and data arrival observation At−1N using (5) and (7). i=1 i i=1 16 end这个算法中的基于模型的优化和无优化DRL和DNN深度学习怎么体现
时间: 2023-06-19 21:03:28 浏览: 129
在这个算法中,第7步是基于模型的优化,即对于每个二进制决策向量 $x_{ti}$,都通过解决问题(P2)来计算全局效用函数 $G(x_{ti},\xi_t)$。这个优化问题需要求解资源分配向量 $y_{it}$,使得总能耗最小。这是一个标准的凸优化问题,可以使用传统的优化算法进行求解,比如线性规划或者二次规划等。
第5步和第6步涉及到无优化的强化学习方法,即使用深度神经网络(DNN)来学习一个策略 $\pi_{\theta_t}(x_t|\xi_t)$,将状态 $\xi_t$ 映射到一个松弛的决策向量 $\hat{x}_t$。然后使用 NOP(Noise Optimization Procedure)方法将 $\hat{x}_t$ 量化成 $M_t$ 个二进制决策向量 $x_{ti}$,这些决策向量是等概率采样的。接着,对于每个决策向量 $x_{ti}$,使用基于模型的优化方法计算全局效用函数 $G(x_{ti},\xi_t)$。最后,选择具有最大全局效用的决策向量作为当前时刻的控制决策,并执行该决策。
第10步是通过DRL和DNN深度学习方法来更新策略参数 $\theta_t$。具体来说,使用记忆回放机制从先前的决策和状态中随机抽取一批数据,然后将其用作训练数据,使用Adam算法对DNN进行训练,并更新参数 $\theta_t$。这样,DNN可以逐渐学习到更好的策略,并逐步提高决策的性能。
相关问题
input : Parameters V , {γi, ci}Ni=1, K, training interval δT , Mt update interval δM ; output: Control actions xt,ytKt=1; 1 Initialize the DNN with random parameters θ1 and empty replay memory, M1 ← 2N; 2 Empty initial data queue Qi(1) = 0 and energy queue Yi(1) = 0, for i = 1,··· ,N; 3 fort=1,2,...,Kdo 4 Observe the input ξt = ht, Qi(t), Yi(t)Ni=1 and update Mt using (8) if mod (t, δM ) = 0; 5 Generate a relaxed offloading action xˆt = Πθt ξt with the DNN; 6 Quantize xˆt into Mt binary actions xti|i = 1, · · · , Mt using the NOP method; 7 Compute Gxti,ξt by optimizing resource allocation yit in (P2) for each xti; 8 Select the best solution xt = arg max G xti , ξt and execute the joint action xt , yt ; { x ti } 9 Update the replay memory by adding (ξt,xt); 10 if mod (t, δT ) = 0 then 11 Uniformly sample a batch of data set {(ξτ , xτ ) | τ ∈ St } from the memory; 12 Train the DNN with {(ξτ , xτ ) | τ ∈ St} and update θt using the Adam algorithm; 13 end 14 t ← t + 1; 15 Update {Qi(t),Yi(t)}N based on xt−1,yt−1 and data arrival observation At−1N using (5) and (7). i=1 i i=1 16 end无优化的drl怎么提现
在无优化的 DRL 中,控制策略是由深度神经网络(DNN)直接生成的,而不是通过优化算法进行优化。这意味着在每个时间步骤,DNN 会根据当前的状态和过去的经验生成一组控制动作,然后选择最优的动作执行。由于没有优化算法的参与,无优化的 DRL 通常会受到训练过程中的局部最优问题的影响,并且可能无法达到最优解。因此,与优化算法结合使用的 DRL 方法通常能够更好地解决控制问题。
Algorithm 1: The online LyDROO algorithm for solving (P1). input : Parameters V , {γi, ci}Ni=1, K, training interval δT , Mt update interval δM ; output: Control actions xt,ytKt=1; 1 Initialize the DNN with random parameters θ1 and empty replay memory, M1 ← 2N; 2 Empty initial data queue Qi(1) = 0 and energy queue Yi(1) = 0, for i = 1,··· ,N; 3 fort=1,2,...,Kdo 4 Observe the input ξt = ht, Qi(t), Yi(t)Ni=1 and update Mt using (8) if mod (t, δM ) = 0; 5 Generate a relaxed offloading action xˆt = Πθt ξt with the DNN; 6 Quantize xˆt into Mt binary actions xti|i = 1, · · · , Mt using the NOP method; 7 Compute Gxti,ξt by optimizing resource allocation yit in (P2) for each xti; 8 Select the best solution xt = arg max G xti , ξt and execute the joint action xt , yt ; { x ti } 9 Update the replay memory by adding (ξt,xt); 10 if mod (t, δT ) = 0 then 11 Uniformly sample a batch of data set {(ξτ , xτ ) | τ ∈ St } from the memory; 12 Train the DNN with {(ξτ , xτ ) | τ ∈ St} and update θt using the Adam algorithm; 13 end 14 t ← t + 1; 15 Update {Qi(t),Yi(t)}N based on xt−1,yt−1 and data arrival observation At−1N using (5) and (7). i=1 i i=1 16 end通过什么进行最优动作决策
最优动作决策是通过计算每个可能的离散动作的预期收益来实现的。具体来说,在第7步中,对于每个离散的二进制动作(xi, yi),算法会优化资源分配yit,以最大化问题(P2)的目标函数G(xi, yi, ξt)。然后,算法会选择具有最大G值的动作作为最优动作,执行这个动作并更新回放记忆。整个过程在算法的第8步中完成。
阅读全文