Pando was recognized globally as a Technology Pioneer for 2022 by the World Economic Forum! View press release

transportation management companies

Combinatorial optimization problems(COP) are problems that involve finding the “best” solution from a finite (but potentially large) set of candidate solutions. Traveling salesman, vehicle routing, bin packing are some of the real-world examples of such problems. As it turns out, these classes of problems form the core of the supply chain industry — and can mean the difference between market leadership and obscurity for some of the world’s fastest-growing brands — and that makes it our business at Pando!

In this blog, we are going to discuss one of these problems — “Capacitated Vehicle Routing Problem(CVRP)” and our approach to solving it. Before we get into the details of how we can solve it by Deep Reinforcement Learning(DRL), let us understand the problem and motivation behind the need to solve it in the first place.

Capacitated Vehicle Routing Problem — CVRP

Any organization’s Supply Chain is directly linked by one or more of the upstream and downstream flows of its products from the manufacturer to the customer. As there are multiple flows from one point to another, having an optimized way of shipping the materials can significantly reduce the overall cost, increase customer satisfaction and drive ecologically sensitive decisions — all of which are board-level priorities today. Let’s consider Company A needs to ship some materials from a depot to multiple locations. Each location has a demand for specific materials that are needed to be shipped. Following are the questions the company may be curious to answer :

  • What should we ship?
  • How should we ship? i.e., what combination of vehicles should we choose to deliver material?
  • Who should we ship with? Which transporter shall we choose for each truck?
  • Which location’s materials can be clubbed in the same truck?
The goal is to minimize the overall cost that the company would end up paying to the transporters. This can be visualized in the below diagram.

Fig. 1 (source) — An instance of a VRP (left) and its solution (right)

From Fig. 1, we can see that this is a capacitated vehicle routing problem with multiple variables at play: material weight and volume, truck types, transporters cost for each truck type, and delivery locations. We need to select the best combinations of each to find the minimum cost. As these are same-day deliveries, it is not practical to combine more than 2–3 customers in the same truck due to loading and unloading time for each customer.

Computational Complexity of CVRP

Every computational problem can be categorized based on its inherent difficulty, as mentioned below.
  • Complexity Class P: Problems can be solved in polynomial time.
  • Complexity Class NP: Problems that cannot be solved in polynomial time, but once solved, its solution can be verified in polynomial time.
  • Complexity Class NP-Hard: Problems that can neither be solved nor its solution can be verified in polynomial time.
CVRP comes under the class of NP-Hard problems. These problems are typically exponential in terms of time complexity and may require exploring all possible permutations in the worst case. Heuristics and Approximations are generally used to find near-optimal solutions. Each combinatorial optimization problem has its own heuristics that need to be revised once the problem statement changes slightly. This process can be long and arduous and might require the work of domain experts to detect some structure in the combinatorial search space of the specific problem. In contrast, machine learning methods like deep neural networks are applicable across many tasks and can discover their own heuristics, thus requiring less hand- engineering than solvers that are optimized for one particular task only.

Why Deep Neural Networks?

Neural Network is a branch of Artificial Intelligence, which acts as a function approximator to solve any kind of problem. These functions are themselves a combination of non-linear functions inside the hidden layers. This makes it quite powerful and general, as a solution to any problem can be approximated by a mathematical function. The harder the problem, the more complex will be the function. In our case, we try to use Deep Learning (Deep Neural Networks) so that it can learn an appropriate function to solve our specific combinatorial problem. As this is an NP-Hard problem, we believe that it can be approximated by a complex function that might perform better than human-designed heuristics.

This being a sequential task, we use a sequence to sequence model variation where both encoder and decoder are LSTM/GRU cells(Fig 2). We treat CVRP as a multi-class classification model, where inputs are the features of the materials and output is the vehicle it is placed in. Each material is fed in one at a time for each time step to the encoder, and we get an output corresponding to that time step in the decoder. We use the Dynamic RNN module to train the model as that would allow us to train our model for a variable number of input materials.

Fig 2 (source)— Seq2Seq LSTM Model Architecture

Supervised vs. Semi-Supervised Learning
But how does a Deep Learning network really learn such a complex function? It is fed with numerous examples with correct solutions, also commonly known as “training data”. It takes the input(X1, X2…Xt), and runs that through the model, and compares the output generated(Y1, Y2…Yt) with the optimal solution of the input.

Fig. 3 shows the input features that we feed to train our model.

Fig. 3 — Input features

Initially, our model behaves similarly to a random function. During training, As we update the weights of the model (Back- propagation algorithm) in such a way so that it behaves closely to a function that would output the optimal solution. This process is done iteratively using different examples from the training data. Our model needs to be fed with as many examples as possible, as it takes feedback from the data it is provided with, to learn a generic approximate function.

So in our case, training data would be a large number of CVRP examples in the order of millions along with their optimal solutions. But sometimes, finding an optimal solution for even a single large NP-Hard example can take even days, if not months. Therefore, taking the supervised approach, in this case, is not feasible. Here is where Reinforcement Learning (RL) comes into the picture.

So what is Reinforcement Learning?

Reinforcement Learning(RL) is a type of semi-supervised machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its actions and experiences.

Though both supervised and reinforcement learning use mapping between input and output, unlike supervised learning, where feedback provided to the agent is the correct set of actions for performing a task, reinforcement learning uses rewards and penalties as signals indicating positive and negative behaviour, respectively. This property of RL is extremely important to avoid having the optimal solution for every example during training.

CVRP Environment

We must convert the above problem to RL terminology:
  • State (s) : At any time t, the state of every material(whether it is packed or not), and the state of every vehicle(which material is packed in it).
  • Action(a) : All the possible vehicles available for the next material.
  • Reward(r) : Total cost of the fitment. We need to minimise our reward, unlike standard RL problems. We define reward = ∑ (cost of each filled truck).
  • Episode (τ): One episode is defined as a full sequence of “M” materials to be packed.
  • Policy Network(π) : A stochastic policy that outputs the possibility of an action for every time step t.
  • Parameters (θ) : θ is a set of parameters that define the policy.
At every time step t, we need to put a material inside a vehicle until all materials are packed. The packing of every material at time t should follow an algorithm (policy in RL terminology). We need to find such a policy that would minimise the overall cost of the fitment — optimal policy.

During training, the model explores different possibilities by taking different actions and check for the final reward(cost) led by those actions. It then modifies its weights in such a way that will reduce the probabilities of those actions for which our reward is high and increase the probabilities of those actions for which the reward is low. This is how we train our model without the use of any labelled data.

Practical constraints to be addressed by the solution

Before we train our model, we need to address the cases for which we might get invalid output. We have the following practical constraints :
  • A truck cannot carry materials more than its specified capacity.
  • A truck cannot have materials for more than 2 customers.
  • Some customers do not want their materials to be combined with any other customer.
  • Not all truck types are available for every customer location.
  • Some materials require to be carried in a specific truck type only, e.g., refrigerated trucks.
It is common to add a penalty reward for an invalid solution and let the model learn what is allowed and what is not. In our case, if the model’s output does not follow any of our constraints, we set a very high cost for that episode.

Training Process

Let us train the model and verify our approach. For ease of visualization, we have taken a small representation from the real data, which is described as follows :
  • 3 delivery locations
  • Total number of materials can vary from 10–30
  • If the truck visits more than 1 location, an extra drop-off charge of 700 will be added to the cost.
  • Following truck types with their respective contractual cost from depot to each location.

As we can see in Fig. 5, our model starts to converge after 4000 iterations.

Though we can simply sum the weights of all the materials in a truck and check the validity of the solution, it is not that straightforward with volume as there might be space wastage while doing the actual fitment of the materials. To achieve this, the model needs to take into account the dimensions of both materials and trucks and learn the 3D fitment optimization — which is the subject of our future article. One simple workaround can be to fit the volume capacity until a given percentage(let say 90%) only, though this could add approximations in the final results and possibly affect the optimality.

Our approach of Deep Reinforcement Learning with Policy Gradient is able to learn a pretty good function for CVRP. You may wonder why the optimal solution can’t be found by simple backtracking. For ease of visualization, we have compared the results for a lower number of material and vehicle counts. In the real world, we generally have the material count in the order of hundreds, sometimes thousands, and vehicle count in the range of 100–200, for which full backtracking is not feasible. To give an idea, the search space for just 100 materials and 100 trucks is 10 ²⁰⁰.

At PandoLabs, we are building cutting-edge machine learning algorithms to solve different problems like route optimization, 3D bin packing, Network optimization, demand forecasting, and ETA prediction. The intention of openly sharing our work is to benefit a larger ecosystem of customers, data scientists, engineers & product companies. If any of this work interests you, feel free to reach out to us.

Article By:

Share this Article


5 things you need to keep in mind while choosing a location-tracking based visibility solution

The average shipper spends 6 hours a day, 20 days a month, coordinating deliveries. A lot of this time is spent ...

| MAY 12, 2021
  Transportation management

Why current enterprise software can’t solve your logistics woes.

Any business, irrespective of scale, geography, or industry, has common factors core to its existence: customers, employees, resources, and products. ...

| APRIL 25, 2021