Combinatorial optimization problems(COP) are problems that
involve finding the “best” solution from a finite (but potentially
large) set of candidate solutions. Traveling salesman, vehicle
routing, bin packing are some of the realworld examples of
such problems. As it turns out, these classes of problems form
the core of the supply chain industry — and can mean the
difference between market leadership and obscurity for some of
the world’s fastestgrowing brands — and that makes it our
business at Pando!
In this blog, we are going to discuss one of these problems —
“Capacitated Vehicle Routing Problem(CVRP)” and our
approach to solving it. Before we get into the details of how we
can solve it by Deep Reinforcement Learning(DRL), let us
understand the problem and motivation behind the need to
solve it in the first place.
Capacitated Vehicle Routing Problem — CVRP
Any organization’s Supply Chain is directly linked by one or
more of the upstream and downstream flows of its products
from the manufacturer to the customer. As there are multiple
flows from one point to another, having an optimized way of
shipping the materials can significantly reduce the overall cost, increase customer satisfaction and drive ecologically sensitive
decisions — all of which are boardlevel priorities today. Let’s
consider Company A needs to ship some materials from a
depot to multiple locations. Each location has a demand for
specific materials that are needed to be shipped. Following are
the questions the company may be curious to answer :
 What should we ship?
 How should we ship? i.e., what combination of vehicles should we
choose to deliver material?

Who should we ship with? Which transporter shall we choose for
each truck?

Which location’s materials can be clubbed in the same truck?
The goal is to minimize the overall cost that the company would
end up paying to the transporters. This can be visualized in the
below diagram.
Fig. 1 (source) — An instance of a VRP (left) and its solution (right)
From Fig. 1, we can see that this is a capacitated vehicle
routing problem with multiple variables at play: material weight
and volume, truck types, transporters cost for each truck type, and delivery locations. We need to select the best combinations
of each to find the minimum cost. As these are sameday
deliveries, it is not practical to combine more than 2–3
customers in the same truck due to loading and unloading time
for each customer.
Computational Complexity of CVRP
Every computational problem can be categorized based on its
inherent difficulty, as mentioned below.
 Complexity Class P: Problems can be solved in polynomial time.
 Complexity Class NP: Problems that cannot be solved in
polynomial time, but once solved, its solution can be verified in
polynomial time.

Complexity Class NPHard: Problems that can neither be solved
nor its solution can be verified in polynomial time.
CVRP comes under the class of NPHard problems. These
problems are typically exponential in terms of time complexity
and may require exploring all possible permutations in the worst
case. Heuristics and Approximations are generally used to find
nearoptimal solutions. Each combinatorial optimization problem
has its own heuristics that need to be revised once the problem
statement changes slightly. This process can be long and
arduous and might require the work of domain experts to detect
some structure in the combinatorial search space of the specific
problem. In contrast, machine learning methods like deep
neural networks are applicable across many tasks and can
discover their own heuristics, thus requiring less hand engineering than solvers that are optimized for one particular
task only.
Why Deep Neural Networks?
Neural Network is a branch of Artificial Intelligence, which acts
as a function approximator to solve any kind of problem. These
functions are themselves a combination of nonlinear functions
inside the hidden layers. This makes it quite powerful and
general, as a solution to any problem can be approximated by a
mathematical function. The harder the problem, the more
complex will be the function. In our case, we try to use Deep
Learning (Deep Neural Networks) so that it can learn an
appropriate function to solve our specific combinatorial problem.
As this is an NPHard problem, we believe that it can be
approximated by a complex function that might perform better
than humandesigned heuristics.
This being a sequential task, we use a sequence to sequence
model variation where both encoder and decoder are
LSTM/GRU cells(Fig 2). We treat CVRP as a multiclass
classification model, where inputs are the features of the
materials and output is the vehicle it is placed in. Each material
is fed in one at a time for each time step to the encoder, and we
get an output corresponding to that time step in the
decoder.
We use the Dynamic RNN module to train the model
as that would allow us to train our model for a variable number
of input materials.
Fig 2 (source)— Seq2Seq LSTM Model Architecture
Supervised vs. SemiSupervised Learning
But how does a Deep Learning network really learn such a
complex function? It is fed with numerous examples with correct
solutions, also commonly known as “training data”. It takes the
input(X1, X2…Xt), and runs that through the model, and
compares the output generated(Y1, Y2…Yt) with the optimal
solution of the input.
Fig. 3 shows the input features that we feed to train our model.
Fig. 3 — Input features
Initially, our model behaves similarly to a random function.
During training, As we update the weights of the model (Back
propagation algorithm) in such a way so that it behaves closely
to a function that would output the optimal solution. This
process is done iteratively using different examples from the
training data. Our model needs to be fed with as many
examples as possible, as it takes feedback from the data it is
provided with, to learn a generic approximate function.
So in our case, training data would be a large number of CVRP
examples in the order of millions along with their optimal
solutions. But sometimes, finding an optimal solution for even a
single large NPHard example can take even days, if not
months. Therefore, taking the supervised approach, in this
case, is not feasible. Here is where Reinforcement Learning
(RL) comes into the picture.
So what is Reinforcement Learning?
Reinforcement Learning(RL) is a type of semisupervised
machine learning technique that enables an agent to learn in an
interactive environment by trial and error using feedback from
its actions and experiences.
Though both supervised and reinforcement learning use
mapping between input and output, unlike supervised learning,
where feedback provided to the agent is the correct set of
actions for performing a task, reinforcement learning uses
rewards and penalties as signals indicating positive and
negative behaviour, respectively. This property of RL is
extremely important to avoid having the optimal solution for
every example during training.
CVRP Environment
We must convert the above problem to RL terminology:
 State (s) : At any time t, the state of every material(whether it is
packed or not), and the state of every vehicle(which material is
packed in it).
 Action(a) : All the possible vehicles available for the next
material.

Reward(r) : Total cost of the fitment. We need to minimise our
reward, unlike standard RL problems. We define reward = ∑
(cost of each filled truck).
 Episode (τ): One episode is defined as a full sequence of “M”
materials to be packed.
 Policy Network(π) : A stochastic policy that outputs the
possibility of an action for every time step t.
 Parameters (θ) : θ is a set of parameters that define the policy.
At every time step t, we need to put a material inside a vehicle
until all materials are packed. The packing of every material at
time t should follow an algorithm (policy in RL terminology). We
need to find such a policy that would minimise the overall cost
of the fitment — optimal policy.
During training, the model explores different
possibilities by taking different actions and check
for the final reward(cost) led by those actions. It
then modifies its weights in such a way that will
reduce the probabilities of those actions for
which our reward is high and increase the
probabilities of those actions for which the
reward is low. This is how we train our model
without the use of any labelled data.
Practical constraints to be addressed by the solution
Before we train our model, we need to address the cases for
which we might get invalid output. We have the following
practical constraints :
 A truck cannot carry materials more than its specified capacity.
 A truck cannot have materials for more than 2 customers.

Some customers do not want their materials to be combined with
any other customer.

Not all truck types are available for every customer location.

Some materials require to be carried in a specific truck type only,
e.g., refrigerated trucks.
It is common to add a penalty reward for an invalid solution and
let the model learn what is allowed and what is not. In our case,
if the model’s output does not follow any of our constraints, we
set a very high cost for that episode.
Training Process
Let us train the model and verify our approach. For ease of
visualization, we have taken a small representation from the
real data, which is described as follows :
 3 delivery locations
 Total number of materials can vary from 10–30
 If the truck visits more than 1 location, an extra dropoff charge of
700 will be added to the cost.
 Following truck types with their respective contractual cost from
depot to each location.
As we can see in Fig. 5, our model starts to converge after 4000
iterations.
Though we can simply sum the weights of all the materials in a
truck and check the validity of the solution, it is not that
straightforward with volume as there might be space wastage
while doing the actual fitment of the materials. To achieve this,
the model needs to take into account the dimensions of both
materials and trucks and learn the 3D fitment optimization —
which is the subject of our future article. One simple
workaround can be to fit the volume capacity until a given
percentage(let say 90%) only, though this could add
approximations in the final results and possibly affect the
optimality.
Summary
Our approach of Deep Reinforcement Learning with Policy
Gradient is able to learn a pretty good function for CVRP. You
may wonder why the optimal solution can’t be found by simple
backtracking. For ease of visualization, we have compared the
results for a lower number of material and vehicle counts. In the
real world, we generally have the material count in the order of
hundreds, sometimes thousands, and vehicle count in the range
of 100–200, for which full backtracking is not feasible. To give
an idea, the search space for just 100 materials and 100 trucks
is 10 ²⁰⁰.
At PandoLabs, we are building cuttingedge machine
learning algorithms to solve different problems like route
optimization, 3D bin packing, Network optimization,
demand forecasting, and ETA prediction. The intention of
openly sharing our work is to benefit a larger ecosystem of
customers, data scientists, engineers & product
companies. If any of this work interests you, feel free to
reach out to us.