给出一个强化学习python代码
时间: 2023-05-24 11:02:54 浏览: 273
当然可以。以下是一个简单的强化学习Python代码,用Gym环境中的CartPole-v1游戏进行训练。
首先需要导入需要的库:
```python
import gym
import numpy as np
```
设置超参数:
```python
gamma = 0.99 # 衰减因子
alpha = 0.5 # 学习率
episodes = 500 # 训练次数
max_steps = 200 # 最大步数
epsilon = 1.0 # 探索率
min_epsilon = 0.01 # 最小探索率
decay_rate = 0.01 # 探索率衰减率
```
定义Q表并初始化为0:
```python
env = gym.make('CartPole-v1')
Q = np.zeros((env.observation_space.n, env.action_space.n))
```
定义一个函数,用于选择动作和应用Epsilon贪婪策略:
```python
def choose_action(state, epsilon):
if np.random.uniform(0, 1) < epsilon:
action = env.action_space.sample() # 随机选择
else:
action = np.argmax(Q[state, :]) # 选择最优动作
return action
```
定义训练函数:
```python
rewards = []
for episode in range(episodes):
state = env.reset()
reward = 0
for step in range(max_steps):
# 选择动作
action = choose_action(state, epsilon)
# 执行动作
next_state, r, done, info = env.step(action)
# 更新Q表
Q[state, action] = (1 - alpha) * Q[state, action] + alpha * (r + gamma * np.max(Q[next_state, :]))
# 更新状态和奖励
state = next_state
reward += r
if done:
break
# 更新探索率
epsilon = min_epsilon + (1 - min_epsilon) * np.exp(-decay_rate * episode)
# 记录奖励
rewards.append(reward)
print("Episode {}: Reward = {}".format(episode, reward))
```
最后,我们可以测试训练后的智能体性能:
```python
total_reward = 0
for _ in range(10):
state = env.reset()
for step in range(max_steps):
action = np.argmax(Q[state, :])
next_state, r, done, info = env.step(action)
state = next_state
total_reward += r
if done:
break
print("Average reward:", total_reward / 10)
```
希望这个简单的代码能帮到你。
阅读全文