df[['discounted_price']] = df[['discounted_price']].replace({'\₹':''},regex=True)作用详细

这行代码的作用是将DataFrame中名为'discounted_price'的列的值中的"₹"符号替换为空字符串。具体来说，`df[['discounted_price']]`选取了名为'discounted_price'的列，并将其转化为一个DataFrame对象。然后，`.replace({'\₹':''},regex=True)`方法调用了DataFrame对象的replace()方法，将所有包含符号"₹"的字符串替换为空字符串。其中，`{'\₹':''}`是一个字典，表示将"₹"替换为空字符串。`regex=True`表示使用正则表达式进行替换。最终，将替换后的值赋值回原DataFrame对象中的'discounted_price'列，实现了从原始数据中去除"₹"符号的目的。

while not ep_done: num_steps += 1 if train_params.RENDER: self.env_wrapper.render() action = self.sess.run(self.actor_net.output, {self.state_ph:np.expand_dims(state, 0)})[0] # Add batch dimension to single state input, and remove batch dimension from single action output action += (gaussian_noise() * train_params.NOISE_DECAY**num_eps) next_state, reward, terminal = self.env_wrapper.step(action) episode_reward += reward next_state = self.env_wrapper.normalise_state(next_state) reward = self.env_wrapper.normalise_reward(reward) self.exp_buffer.append((state, action, reward)) if len(self.exp_buffer) >= train_params.N_STEP_RETURNS: state_0, action_0, reward_0 = self.exp_buffer.popleft() discounted_reward = reward_0 gamma = train_params.DISCOUNT_RATE for (_, _, r_i) in self.exp_buffer: discounted_reward += r_i * gamma gamma *= train_params.DISCOUNT_RATE run_agent_event.wait() PER_memory.add(state_0, action_0, discounted_reward, next_state, terminal, gamma) state = next_state

这段代码是主循环中的一部分，其中包含了执行动作、观察环境、更新经验缓存等操作。具体来说，算法执行以下步骤： 1. 累计步数； 2. 如果需要渲染环境，则渲染环境； 3. 使用Actor网络计算当前状态的动作； 4. 对动作添加高斯噪声； 5. 执行动作并观察新状态和奖励； 6. 累计回报； 7. 对新状态进行归一化处理； 8. 对奖励进行归一化处理； 9. 将经验添加到经验缓存中； 10. 如果经验缓存已满，则计算N步回报，并将经验添加到优先经验回放缓存中； 11. 更新状态。在这个过程中，Actor网络用于计算当前状态下的动作，而高斯噪声则用于增加探索性，以便算法能够更好地探索环境。在执行动作之后，算法会观察新状态和奖励，并将它们添加到经验缓存中。如果经验缓存已满，算法会计算N步回报，并将经验添加到优先经验回放缓存中。最后，算法会更新状态并继续执行主循环。

class PPO(object): def init(self): self.sess = tf.Session() self.tfs = tf.placeholder(tf.float32, [None, S_DIM], 'state') # critic with tf.variable_scope('critic'): l1 = tf.layers.dense(self.tfs, 100, tf.nn.relu) self.v = tf.layers.dense(l1, 1) self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r') self.advantage = self.tfdc_r - self.v self.closs = tf.reduce_mean(tf.square(self.advantage)) self.ctrain_op = tf.train.AdamOptimizer(C_LR).minimize(self.closs) # actor pi, pi_params = self._build_anet('pi', trainable=True) oldpi, oldpi_params = self._build_anet('oldpi', trainable=False) with tf.variable_scope('sample_action'): self.sample_op = tf.squeeze(pi.sample(1), axis=0) # choosing action with tf.variable_scope('update_oldpi'): self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)] self.tfa = tf.placeholder(tf.float32, [None, A_DIM], 'action') self.tfadv = tf.placeholder(tf.float32, [None, 1], 'advantage') with tf.variable_scope('loss'): with tf.variable_scope('surrogate'): # ratio = tf.exp(pi.log_prob(self.tfa) - oldpi.log_prob(self.tfa)) ratio = pi.prob(self.tfa) / (oldpi.prob(self.tfa) + 1e-5) surr = ratio * self.tfadv if METHOD['name'] == 'kl_pen': self.tflam = tf.placeholder(tf.float32, None, 'lambda') kl = tf.distributions.kl_divergence(oldpi, pi) self.kl_mean = tf.reduce_mean(kl) self.aloss = -(tf.reduce_mean(surr - self.tflam * kl)) else: # clipping method, find this is better self.aloss = -tf.reduce_mean(tf.minimum( surr, tf.clip_by_value(ratio, 1.-METHOD['epsilon'], 1.+METHOD['epsilon'])*self.tfadv))

这段代码是使用 PPO（Proximal Policy Optimization）算法实现的一个 actor-critic 模型。其中，critic 用来评价当前状态的价值，actor 用来生成在当前状态下采取的动作。在训练过程中，会使用 advantage（优势值）来评价采取某个动作的好坏，并且使用 surrogate loss（代理损失）来训练 actor，同时使用 clipping 或 kl penalty 的方式来限制优势值的大小，以保证训练的稳定性。

阅读全文

df[['discounted_price']] = df[['discounted_price']].replace({'\₹':''},regex=True)作用详细

相关推荐

每天折扣PC/Mac游戏获取指南-Get_discounted_game_daily.crx插件

PHP语法基础：从服务器信息显示到案例实践

掌握机器学习性能评估：深入分析Metrics.jl包

顶级投行、金融估值建模培训资料-Discounted Cash Flow Analysis _ Street Of Walls.pdf

Movie_Recommendation_System:使用Knime的Movia推荐系统

discounted-cashflow:折扣现金流TypeScript应用程序

datawhale_19_RecommandNews:Datawhale第19期学习推荐系统实践（新闻推荐）学习打卡

intrinsic_valuation:公司内在价值计算器

kc_recommendations:kc建议代码备份

Django对象操作神技：django.utils.functional的高级应用

函数式编程在Web开发中的终极应用：django.utils.functional案例深度分析

用Python编写一个函数calculate_discounted_price,该函数接受两个参与price和discount，分别表示原始价格和折扣百分比0~100之间的浮点数，并返回打折后的价格。：

要求用户输入一个商品的价格和折扣率，计算出折扣后的价格并以多种格式输出。 商品价格 price 折扣率discount_rate 折扣后的价格discounted_price # 使用 % 方式输出（

在上面的背景下，请翻译并解释以下文献中语句(c) the effective horizonH : = (1 − γ)^−1, which measures the typical scale over which the discounted reward process evolves;and (d) the underlying noise function, given by the variance of the Bellman residual

mysql创建函数zk_price()，计算商品“智能手机”的价格按指定折扣打折后的折扣价格。

编写程序,计算折扣价,给出商品的原价(不超过1万元的正整数)和折扣为[1,9]区间内的整数),输出商品的折扣价,保留小数点后两位.例如:原价988.标明打7折,则折扣应该是988x70%=691.60.

大家在看

上海松江9000系列设备说明及调试

js 在线编辑office source 浏览器在线打开office

GNSS-R反演土壤水分研究分析

ansys_ls-dyna基础理论与工程实践配书K文件.rar_K文件_LS-DYNA 文件_ansys ls-dyna_dy

arcgis标准分幅图制作与生产

最新推荐

macOS 10.9至10.13版高通RTL88xx USB驱动下载

PyCharm开发者必备：提升效率的Python环境管理秘籍

matlab中VBA指令集

在Windows Forms和WPF中实现FontAwesome-4.7.0图形

【Postman进阶秘籍】：解锁高级API测试与管理的10大技巧

ubuntu22.04怎么恢复出厂设置

2001年度广告运作规划：高效利用资源的策略

【Postman终极指南】：掌握API测试到自动化部署的全流程

叙述图神经网络领域近年来最新研究进展

Java实现深度优先遍历与id-level映射输出

要求用户输入一个商品的价格和折扣率，计算出折扣后的价格并以多种格式输出。商品价格 price 折扣率discount_rate 折扣后的价格discounted_price # 使用 % 方式输出（