强化学习解决VRP：单模型优化的车辆路线问题

下载需积分: 9 | PDF格式 | 325KB | 更新于2024-09-07 | 184 浏览量 | 举报

本文主要探讨了在人工智能顶级会议（AI-VRP顶会）中，如何利用强化学习（Reinforcement Learning, RL）解决车辆路线问题（Vehicle Routing Problem, VRP）。VRP是一个经典的优化问题，涉及寻找最有效的货物配送路径，以最小化成本或时间，同时满足车辆的容量限制。传统的解决方法通常依赖于精确的数学模型和优化算法，如遗传算法、模拟退火等。作者们提出了一种端到端的框架，通过训练单一的策略模型来应对各种规模相似的VRP实例。他们采用参数化的随机策略，这种策略允许模型根据环境反馈（奖励信号）动态调整其决策，遵循可行性规则，从而找到接近最优解的解决方案。值得注意的是，这个模型是通过应用一种政策梯度算法来优化其参数的，这意味着在训练完成后，模型可以直接实时生成解决方案，而无需针对每个新的问题实例重新训练，提高了效率。实验结果显示，当处理有容量限制的VRP时，他们的方法在中型规模问题上，在解决方案质量上优于传统的启发式算法，如局部搜索算法，并且与Google的OR-Tools在计算时间上具有可比性。此外，该研究还展示了如何将分批交付（split delivery）这一复杂的现实世界特性纳入VRP的解决策略中，并探讨了这种方法在处理这类问题时的效果。这篇论文提供了一个创新的RL在VRP中的应用实例，证明了通过机器学习技术可以实现高效且灵活的解决方案生成，这对于实际物流和运输规划具有重要的实践价值。它不仅提升了问题求解的效率，还展示了机器学习在解决复杂优化问题上的潜力和优势。未来的研究可能会进一步探索如何将这种方法扩展到更大规模的问题，或者与其他优化技术结合，以达到更优的性能。

of these models. The general architecture, which is shared by many of these models, consists of

two RNN networks called the encoder and decoder. An encoder network reads through the input

sequence and stores the knowledge in a ﬁxed-size vector representation (or a sequence of vectors);

then, a decoder converts the encoded information back to an output sequence.

In the vanilla Sequence-to-Sequence architecture [

], the source sequence appears only once in the

encoder and the entire output sequence is generated based on one vector (i.e., the last hidden state

of the encoder RNN). Other extensions, for example Bahdanau et al.

[3]

, illustrate that the source

information can be used more explicitly to increase the amount of information during the decoding

steps. In addition to the encoder and decoder networks, they employ another neural network, namely

an attention mechanism that attends to the entire encoder RNN states. This mechanism allows the

decoder to focus on the important locations of the source sequence and use the relevant information

during decoding steps for producing “better” output sequences. Recently, the concept of attention has

been a popular research idea due to its capability to align different objects, e.g., in computer vision

[

] and neural machine translation [

]. In this study, we also employ a special

attention structure for the policy parameterization. See Section 3.3 for a detailed discussion of the

attention mechanism.

Neural Combinatorial Optimization

Over the last several years, multiple methods have been

developed to tackle combinatorial optimization problems by using recent advances in artiﬁcial

intelligence. The ﬁrst attempt was proposed by Vinyals et al.

[34]

, who introduce the concept of

a Pointer Network, a model originally inspired by sequence-to-sequence models. Because it is

invariant to the length of the encoder sequence, the Pointer Network enables the model to apply to

combinatorial optimization problems, where the output sequence length is determined by the source

sequence. They use the Pointer Network architecture in a supervised fashion to ﬁnd near-optimal

Traveling Salesman Problem (TSP) tours from ground truth optimal (or heuristic) solutions. This

dependence on supervision prohibits the Pointer Network from ﬁnding better solutions than the ones

provided during the training.

Closest to our approach, Bello et al.

[4]

address this issue by developing a neural combinatorial

optimization framework that uses RL to optimize a policy modeled by a Pointer Network. Using

several classical combinatorial optimization problems such as TSP and the knapsack problem, they

show the effectiveness and generality of their architecture.

On a related topic, Dai et al.

[11]

solve optimization problems over graphs using a graph embedding

structure [

] and a deep Q-learning (DQN) algorithm [

]. Even though VRP can be represented

by a graph with weighted nodes and edges, their proposed approach does not directly apply since in

VRP, a particular node (e.g. the depot) might be visited multiple times.

Next, we introduce our model, which is a simpliﬁed version of the Pointer Network.

3 The Model

In this section, we formally deﬁne the problem and our proposed framework for a generic combinato-

rial optimization problem with a given set of inputs

= {x

, i = 1, · · · , M}

. We allow some of

the elements of each input to change between the decoding steps, which is, in fact, the case in many

problems such as the VRP. The dynamic elements might be an artifact of the decoding procedure

itself, or they can be imposed by the environment. For example, in the VRP, the remaining customer

demands change over time as the vehicle visits the customer nodes; or we might consider a variant

in which new customers arrive or adjust their demand values over time, independent of the vehicle

decisions. Formally, we represent each input

by a sequence of tuples

= (s

, d

), t = 0, 1, · · · }

where

and

are the static and dynamic elements of the input, respectively, and can themselves

be tuples. One can view

as a vector of features that describes the state of input

at time

. For

instance, in the VRP,

gives a snapshot of the customer

, where

corresponds to the 2-dimensional

coordinate of customer

’s location and

is its demand at time

. We will denote the set of all input

states at a ﬁxed time t with X

We start from an arbitrary input in

, where we use the pointer

to refer to that input. At every

decoding time

(

t = 0, 1, · · ·

t+1

points to one of the available inputs

, which determines

the input of the next decoder step; this process continues until a termination condition is satisﬁed.

The termination condition is problem-speciﬁc, showing that the generated sequence satisﬁes the

剩余10页未读，继续阅读

qq_42916478

粉丝: 0

强化学习解决VRP：单模型优化的车辆路线问题

VRP算法介绍

VRP问题及技术回顾 103页

S3900EI-VRP310-R1702P18.zip

在TD-VRP模型中为什么要将速度时间函数转换车行驶时间函数

如何在华为路由器上通过VRP命令配置系统视图，并设置console和vty登录认证及超时？

ACO-PSO VRP

在华为路由器上如何通过VRP命令进入系统视图并配置console和vty线路的登录认证以及超时设置？

请详细说明如何在华为路由器上通过VRP命令配置系统视图，并设置console和vty登录认证及超时？

A-n32-k5.vrp

最新资源