of these models. The general architecture, which is shared by many of these models, consists of
two RNN networks called the encoder and decoder. An encoder network reads through the input
sequence and stores the knowledge in a fixed-size vector representation (or a sequence of vectors);
then, a decoder converts the encoded information back to an output sequence.
In the vanilla Sequence-to-Sequence architecture [
32
], the source sequence appears only once in the
encoder and the entire output sequence is generated based on one vector (i.e., the last hidden state
of the encoder RNN). Other extensions, for example Bahdanau et al.
[3]
, illustrate that the source
information can be used more explicitly to increase the amount of information during the decoding
steps. In addition to the encoder and decoder networks, they employ another neural network, namely
an attention mechanism that attends to the entire encoder RNN states. This mechanism allows the
decoder to focus on the important locations of the source sequence and use the relevant information
during decoding steps for producing “better” output sequences. Recently, the concept of attention has
been a popular research idea due to its capability to align different objects, e.g., in computer vision
[
6
,
18
,
39
,
40
] and neural machine translation [
3
,
19
,
25
]. In this study, we also employ a special
attention structure for the policy parameterization. See Section 3.3 for a detailed discussion of the
attention mechanism.
Neural Combinatorial Optimization
Over the last several years, multiple methods have been
developed to tackle combinatorial optimization problems by using recent advances in artificial
intelligence. The first attempt was proposed by Vinyals et al.
[34]
, who introduce the concept of
a Pointer Network, a model originally inspired by sequence-to-sequence models. Because it is
invariant to the length of the encoder sequence, the Pointer Network enables the model to apply to
combinatorial optimization problems, where the output sequence length is determined by the source
sequence. They use the Pointer Network architecture in a supervised fashion to find near-optimal
Traveling Salesman Problem (TSP) tours from ground truth optimal (or heuristic) solutions. This
dependence on supervision prohibits the Pointer Network from finding better solutions than the ones
provided during the training.
Closest to our approach, Bello et al.
[4]
address this issue by developing a neural combinatorial
optimization framework that uses RL to optimize a policy modeled by a Pointer Network. Using
several classical combinatorial optimization problems such as TSP and the knapsack problem, they
show the effectiveness and generality of their architecture.
On a related topic, Dai et al.
[11]
solve optimization problems over graphs using a graph embedding
structure [
10
] and a deep Q-learning (DQN) algorithm [
26
]. Even though VRP can be represented
by a graph with weighted nodes and edges, their proposed approach does not directly apply since in
VRP, a particular node (e.g. the depot) might be visited multiple times.
Next, we introduce our model, which is a simplified version of the Pointer Network.
3 The Model
In this section, we formally define the problem and our proposed framework for a generic combinato-
rial optimization problem with a given set of inputs
X
.
= {x
i
, i = 1, · · · , M}
. We allow some of
the elements of each input to change between the decoding steps, which is, in fact, the case in many
problems such as the VRP. The dynamic elements might be an artifact of the decoding procedure
itself, or they can be imposed by the environment. For example, in the VRP, the remaining customer
demands change over time as the vehicle visits the customer nodes; or we might consider a variant
in which new customers arrive or adjust their demand values over time, independent of the vehicle
decisions. Formally, we represent each input
x
i
by a sequence of tuples
{x
i
t
.
= (s
i
, d
i
t
), t = 0, 1, · · · }
,
where
s
i
and
d
i
t
are the static and dynamic elements of the input, respectively, and can themselves
be tuples. One can view
x
i
t
as a vector of features that describes the state of input
i
at time
t
. For
instance, in the VRP,
x
i
t
gives a snapshot of the customer
i
, where
s
i
corresponds to the 2-dimensional
coordinate of customer
i
’s location and
d
i
t
is its demand at time
t
. We will denote the set of all input
states at a fixed time t with X
t
.
We start from an arbitrary input in
X
0
, where we use the pointer
y
0
to refer to that input. At every
decoding time
t
(
t = 0, 1, · · ·
),
y
t+1
points to one of the available inputs
X
t
, which determines
the input of the next decoder step; this process continues until a termination condition is satisfied.
The termination condition is problem-specific, showing that the generated sequence satisfies the
3