强化学习 走迷宫代码
时间: 2024-12-26 08:25:37 浏览: 12
### 使用 Q-Learning 实现走迷宫的功能
为了展示如何利用 Python 和 Q-learning 算法解决迷宫问题,下面提供了一段完整的代码示例。这段代码创建了一个简单的迷宫环境,并通过迭代训练让智能体学会找到从起点到达终点的最佳路径。
#### 迷宫环境类 `MazeEnv`
此部分定义了迷宫的具体布局以及基本操作方法:
```python
import numpy as np
class MazeEnv(object):
def __init__(self, maze_size=5):
self.maze = np.zeros((maze_size, maze_size))
self.start_position = (0, 0)
self.goal_position = (maze_size-1, maze_size-1)
self.current_state = None
def reset(self):
"""重置环境至初始状态"""
self.current_state = self.start_position
return self._state_to_index(self.current_state)
def _state_to_index(self, state):
"""将二维坐标转换成一维索引"""
row, col = state
index = row * int(np.sqrt(len(self.maze))) + col
return index
def step(self, action):
"""
执行一步移动并返回新状态、即时奖励和是否结束标志。
参数:
action : 动作编号 {0:'up', 1:'down', 2:'left', 3:'right'}
返回值:
next_state: 新的状态表示
reward: 即时奖励
done: 是否达到终止条件(True/False)
"""
actions_dict = ['up', 'down', 'left', 'right']
move_direction = actions_dict[action]
current_row, current_col = self.current_state
if move_direction == "up":
new_pos = max(current_row - 1, 0), current_col
elif move_direction == "down":
new_pos = min(current_row + 1, len(self.maze)-1), current_col
elif move_direction == "left":
new_pos = current_row, max(0, current_col - 1)
else: # right
new_pos = current_row, min(len(self.maze[current_row])-1, current_col + 1)
if new_pos != self.current_state:
self.current_state = new_pos
reached_goal = (new_pos == self.goal_position)
reward = 1.0 if reached_goal else -0.01
done = reached_goal or not any([c >= 0 and c < s for c,s in zip(new_pos,self.maze.shape)])
return self._state_to_index(self.current_state), reward, done
```
#### Q-Learning 训练过程
接下来的部分展示了如何基于上述环境设置来进行 Q-learning 的训练流程:
```python
def q_learning(env, num_episodes=500, alpha=0.8, gamma=0.95, epsilon=0.1):
n_states = env.maze.size
n_actions = 4
q_table = np.zeros([n_states, n_actions])
for i_episode in range(num_episodes):
observation = env.reset()
while True:
if np.random.uniform() < epsilon:
action = np.random.choice(n_actions) # Explore
else:
action = np.argmax(q_table[observation]) # Exploit learned values
next_observation, reward, done = env.step(action)
old_value = q_table[observation][action]
next_max = np.max(q_table[next_observation])
new_value = (1 - alpha)*old_value + \
alpha*(reward + gamma*next_max)
q_table[observation][action] = new_value
observation = next_observation
if done:
break
return q_table
```
以上就是使用 Python 编写的简单版本的 Q-learning 解决迷宫导航问题的例子[^1]。在这个过程中,`q_learning()` 函数会逐渐调整 Q 表格中的数值直到收敛于一个稳定的策略,使得智能体能够在最少步数内抵达目标位置。
阅读全文