Predictive Sales Analytics: Use Machine Learning to Predict and Optimize Product Backorders
Written by Matt Dancho on October 16, 2017
Sales, customer service, supply chain and logistics, manufacturing… no matter which department you’re in, you more than likely care about backorders. Backorders are products that are temporarily out of stock, but a customer is permitted to place an order against future inventory. Back orders are both good and bad: Strong demand can drive back orders, but so can suboptimal planning. The problem is when a product is not immediately available, customers may not have the luxury or patience to wait. This translates into lost sales and low customer satisfaction. The good news is that machine learning (ML) can be used to identify products at risk of backorders. In this article we use the new H2O automated ML algorithm to implement Kaggle-quality predictions on the Kaggle dataset, “Can You Predict Product Backorders?”. This is an advanced tutorial, which can be difficult for learners. We have good news, see our announcement below if you are interested in a machine learning course from Business Science. If you love this tutorial, please connect with us on social media to stay up on the latest Business Science news, events and information! Good luck and happy analyzing!
Challenges and Benefits
This tutorial covers two challenges. One challenge with this problem is dataset imbalance, when the majority class significantly outweighs the minority class. We implement a special technique for dealing with unbalanced data sets called SMOTE (synthetic minority over-sampling technique) that improves modeling accuracy and efficiency (win-win). The second challenge is optimizing for the business case. To do so, we explore cutoff (threshold) optimization which can be used to find the cutoff that maximizes expected profit.
One of the important visualizations (benefits to your understanding) is the effect of precision and recall on inventory strategy. A lever the analyst controls is the cutoff (threshold). The threshold has an effect on precision and recall. By selecting the optimal threshold we can maximize expected profit.
This is a longer analysis, and certain sections may be more interesting based on your role. Are you an Executive or a Data Scientist?
- Executive: Read these:
- Data Scientist: Read these:
Others should just read the whole thing. ;)
Interested in Learning Data Science For Business?
Sign up for our free "5 Topic Friday" Newsletter. Every week, I'll send you the five coolest topics in data science for business that I've found that week. These could be new R packages, free books, or just some fun to end the week on.
What is a backorder?
According to Investopedia, a backorder is…
A customer order that has not been fulfilled. A backorder generally indicates that customer demand for a product or service exceeds a company’s capacity to supply it. Total backorders, also known as backlog, may be expressed in terms of units or dollar amount.
Why do we care?
A backorder can be both a good thing and a bad thing. Consider a company like Apple that recently rolled out several new iPhone models including iPhone 8 and iPhone X. Because the initial supply cannot possibly keep up with the magnitude of demand, Apple is immediately in a “backorder” situation that requires customer service management.
On the positive side, backorders indicate healthy demand for a product. For Apple, this is a credit to strong brand recognition. Customers are likely to wait several months to get the latest model because of the new and innovative technology and incredible branding. Conversely, Apple’s competitors don’t have this luxury. If new models cannot be provided immediately, their customers cancel orders and go elsewhere.
Companies constantly strive for a balance in managing backorders. It’s a fine line: Too much supply increases inventory costs while too little supply increases the risk that customers may cancel orders. Why not just keep 100,000 of everything on the shelves at all time? For most retailers and manufacturers, this strategy will drive inventory costs through the roof considering they likely have a large number of SKUs (unique product IDs).
A predictive analytics program can identify which products are most likely to experience backorders giving the organization information and time to adjust. Machine learning can identify patterns related to backorders before customers order. Production can then adjust to minimize delays while customer service can provide accurate dates to keep customers informed and happy. The predictive analytics approach enables the maximum product to get in the hands of customers at the lowest cost to the organization. The result is a win-win: organizations increase sales while customers get to enjoy the products they demand.
One of the first steps in developing a backorder strategy is to understand which products are at risk of being backordered. Sounds easy right? Wrong. A number of factors can make this a challenging business problem.
Demand for existing products are constantly changing, often dependent on non-business data (holidays, weather, and other events). Consider demand for snow blowers at Home Depot. We know that geographically-speaking in Northern USA the winter is when demand will rise and the spring is when demand will fall. However, this demand is highly dependent on the level of snowfall. If snowfall is minimal, demand will be low regardless of it being winter. Therefore, Home Depot will need to know the likelihood of snowfall to best predict demand surges and shortfalls in order to optimize inventory.
Small Numbers (Anomalies)
If backorders are very infrequent but highly important, it can be very difficult to predict the minority class accurately because of the imbalance between backorders to non-backorders within the data set. Accuracy for the model will look great, but the actual predictive quality may be very poor. Consider the NULL error rate, or the rate of just picking the majority class. Just picking “non-backorder” may be the same or more accurate than the model. Special strategies exist for dealing with unbalanced data, and we’ll implement SMOTE.
Often this problem is viewed as a cross-sectional problem rather than as a temporal (or time-based) problem. The problem is time can play a critical role. Just because a product sold out this time last year, will it sell out again this year? Maybe. Maybe not. Time-based demand forecasting can be necessary to augment cross-sectional binary classification.
A hypothetical manufacturer has a data set that identifies whether or not a backorder has occurred. The challenge is to accurately predict future backorder risk using predictive analytics and machine learning and then to identify the optimal strategy for inventorying products with high backorder risk.
The data comes from Kaggle’s Can You Predict Product Backorders? dataset. If you have a Kaggle account, you can download the data, which includes both a training and a test set. We’ll use the training set for developing our model and the test set for determining the final accuracy of the best model.
The data file contains the historical data for the 8 weeks prior to the week we are trying to predict. The data were taken as weekly snapshots at the start of each week. The target (or response) is the
went_on_backorder variable. To model and predict the target, we’ll use the other features, which include:
- sku - Random ID for the product
- national_inv - Current inventory level for the part
- lead_time - Transit time for product (if available)
- in_transit_qty - Amount of product in transit from source
- forecast_3_month - Forecast sales for the next 3 months
- forecast_6_month - Forecast sales for the next 6 months
- forecast_9_month - Forecast sales for the next 9 months
- sales_1_month - Sales quantity for the prior 1 month time period
- sales_3_month - Sales quantity for the prior 3 month time period
- sales_6_month - Sales quantity for the prior 6 month time period
- sales_9_month - Sales quantity for the prior 9 month time period
- min_bank - Minimum recommend amount to stock
- potential_issue - Source issue for part identified
- pieces_past_due - Parts overdue from source
- perf_6_month_avg - Source performance for prior 6 month period
- perf_12_month_avg - Source performance for prior 12 month period
- local_bo_qty - Amount of stock orders overdue
- deck_risk - Part risk flag
- oe_constraint - Part risk flag
- ppap_risk - Part risk flag
- stop_auto_buy - Part risk flag
- rev_stop - Part risk flag
- went_on_backorder - Product actually went on backorder. This is the target value.
We’ll use the following libraries:
tidyquant: Used to quickly load the “tidyverse” (
ggplot, etc) along with custom, business-report-friendly
ggplotthemes. Also great for time series analysis (not featured)
unbalanced: Contains various methods for working with unbalanced data. We’ll use the
h2o: Go-to package for implementing professional grade machine learning.
Note on H2O: You’ll want to install the latest stable R version from H2O.ai. If you have issues in this post, you probably did not follow these steps:
- Go to H2O.ai’s download page
- Under H2O, select “Latest Stable Release”
- Click on the “Install in R” tab
- Follow instructions exactly.
Load the libraries.
read_csv() to load the training and test data.
Inspecting the head and tail, we can get an idea of the data set. There’s some sporadic
NA values in the “lead_time” column along with -99 values in the two supplier performance columns.
The tail shows another issue, the last row is
If we take a look at the distribution of the target,
train_raw_df$went_on_backorder, we see that the data set is severely imbalanced. We’ll need a strategy to balance the data set if we want to get maximum model performance and efficiency.
We can also inspect missing values. Because both the train and test sets have missing values, we’ll need to come up with a strategy to deal with them.
The test set has a similar distribution.
Finally, it’s also worth taking a
glimpse of the data. We can quickly see:
- We have some character columns with Yes/No values.
- The “perf_” columns have -99 values, which reflect missing data.
We have a training and a test set, but we’ll need a validation set to assist with modeling. We can perform a random 85/15 split of the training data.
Pre-processing includes a number of steps to get the raw data ready for modeling. We can make a pre-process function that drops unnecessary columns, deals with
NA values, converts Yes/No data to 1/0, and converts the target to a factor. These are all standard pre-processing steps that will need to be applied to each of the data sets.
A step may be needed to transform (normalize, center, scale, apply PCA, etc) the data especially if using deep learning as a classification method. In our case, overall performance actually decreased when transformation was performed. We skip this step because of this. Interested readers may wish to explore the
recipes package for creating and applying ML pre-processing recipes, which has great documentation.
Balance Dataset with SMOTE
As we saw previously, the data is severely unbalanced with only 0.7% of cases being positive (
went_on_backorder = Yes). To deal with this class imbalance, we’ll implement a technique called SMOTE (synthetic minority over-sampling technique), which oversamples the minority class by generating synthetic minority examples in the neighborhood of observed ones. In other words, it shrinks the prevalence of the majority class (under sampling) while simultaneously synthetically increasing the prevalence of the minority class using a k-nearest neighbors approach (over sampling via knn). The great thing is that SMOTE can improve classifier performance (due to better classifier balance) and improve efficiency (due to smaller but focused training set) at the same time (win-win)!
ubSMOTE() function from the
unbalanced package implements SMOTE (along with a number of other sampling methods such as over, under, and several advanced methods). We’ll setup the following arguments:
perc.over = 200: This is the percent of new instances generated for each rare instance. For example, if there were only 5 rare cases, the 200% percent over will synthetically generate and additional 10 rare cases.
perc.under = 200: The percentage of majority classes selected for each SMOTE observations. If 10 additional observations were created through the SMOTE process, 20 majority cases would be sampled.
k = 5: The number of nearest neighbors to select when synthetically generating new observations.
We need to recombine the results into a
tibble. Notice there are now only 68K observations versus 1.4M previously. This is a good indication that SMOTE worked correctly. As an added benefit, the training set size has shrunk which will make the model training significantly faster.
We can also check the new balance of Yes/No: It’s now 43% Yes to 57% No, which in theory should enable the classifier to better detect relationships with the positive (Yes) class.
Modeling with h2o
Now we are ready to model. Let’s fire up
h2o package deals with
H2OFrame (instead of data frame). Let’s convert.
We use the
h2o.automl() function to automatically try a wide range of classification models. The most important arguments are:
training_frame: Supply our training data set to build models.
validation_frame: Supply our validation data set to prevent models from overfitting
leaderboard_frame: Supply our test data set, and models will be ranked based on performance on the test set.
max_runtime_secs: Controls the amount of runtime for each model. Setting to 45 seconds will keep things moving but give each model a fair shot. If you want even higher accuracy you can try increasing the run time. Just be aware that there tends to be diminishing returns.
Extract our leader model.
h2o.predict() to make our predictions on the test set. The data is still stored as an
h2o object, but we can easily convert to a data frame with
as.tibble(). Note that the left column (“predict”) is the class prediction, and columns “p0” and “p1” are the probabilities.
h2o.performance() to get performance-related information by passing the function the leader model and the test set in the same fashion as we did for the prediction. We’ll use the performance output to visualize ROC, AUC, precision and recall.
From the performance output we can get a number of key classifier metrics by threshold using
The Receiver Operating Characteristic (ROC) curve is a graphical method that pits the true positive rate (y-axis) against the false positive rate (x-axis). The benefit to the ROC curve is two-fold:
- We can visualize how the binary classification model compares to randomly guessing
- We can calculate AUC (Area Under the Curve), which is a method to compare models (perfect classification = 1).
Let’s review the ROC curve for our model using
h2o. The red dotted line is what you could theoretically achieve by randomly guessing.
We can also get the AUC of the test set using
h2o.auc(). To give you an idea, the best Kaggle data scientists are getting AUC = 0.95. At AUC = 0.92, our automatic machine learning model is in the same ball park as the Kaggle competitors, which is quite impressive considering the minimal effort to get to this point.
The Cutoff (Threshold)
The cutoff (also known as threshold) is the value that divides the predictions. If we recall, the prediction output has three columns: “predict”, “p0”, and “p1”. The first column, “predict” (model predictions) uses an automatically determined threshold value as the cutoff. Everything above is classified as “Yes” and below as “No”.
How is the Cutoff Determined?
The algorithm uses the threshold at the best F1 score to determine the optimal value for the cutoff.
We could use a different threshold if we like, which can change depending on your goal (e.g. maximize recall, maximize precision, etc). For a full list of thresholds, we can inspect the “metrics” slot that is inside the
Is cutoff at max F1 what we want? Probably not as we’ll see next.
In the business context we need to decide what cutoff (threshold of probability to assign yes/no) to use: Is the “p1” cutoff >= 0.63 (probability above which predict “Yes”) adequate? The answer lies in the balance between the cost of inventorying incorrect product (low precision) versus the cost of the lost customer (low recall):
Precision: When model predicts yes, how often is the actual value yes. If we implement a high precision (low recall) strategy then we need to accept a tradeoff of letting the model misclassify actual yes values to decrease the number of incorrectly predicted yes values.
Recall: When actual is yes, how often does the model predict yes. If we implement a high recall (low precision) strategy then we need to accept a tradeoff of letting the model increase the no values incorrectly predicted as yes values.
Effect of Precision and Recall on Business Strategy
By shifting the cutoff, we can control the precision and recall and this has major effect on the business strategy. As the cutoff increases, the model becomes more conservative being very picky about what it classifies as
went_on_backorder = Yes.
Expected Value Framework
We need another function to help us determine which values of cutoff to use, and this comes from the Expected Value Framework described in Data Science for Business.
Source: Data Science for Business
To implement this framework we need two things:
- Expected Rates (matrix of probabilities): Needed for each threshold
- Cost/Benefit Information: Needed for each observation
We have the class probabilities and rates from the confusion matrix, and we can retrieve this using the
The problems are:
- The confusion matrix is for just one specific cutoff. We want to evaluate for all of the potential cutoffs to determine which strategy is best.
- We need to convert to expected rates, as shown in the Figure 7-2 diagram.
Luckily, we can retrieve the rates by cutoff conveniently using
The cost-benefit matrix is a business assessment of the cost and benefit for each of four potential outcomes:
True Negative (benefit): This is the benefit from a SKU that is correctly predicted as not backordered. The benefit is zero because a customer simply does not buy this item (lack of demand).
True Positive (benefit): This is the benefit from a SKU that is correctly predicted as backorder and the customer needs this now. The benefit is the profit from revenue. For example, if the item price is $1000 and the cost is $600, the unit profit is $400.
False Positive (cost): This is the cost associated with incorrectly classifying a SKU as backordered when demand is not present. The is the warehousing and inventory related cost. For example, if the item costs $10/unit to inventory, which would have otherwise not have occurred.
False Negative (cost): This is the cost associated with incorrectly classifying a SKU when demand is present. In our case, no money was spent and nothing was gained.
The cost-benefit information is needed for each decision pair. If we take one SKU, say the first prediction from the test set, we have an item that was predicted to NOT be backordered (
went_on_backorder == "No"). If hypothetically the value for True Positive (benefit) is $400/unit in profit from correctly predict a backorder and the False Positive (cost) of accidentally inventorying and item that was not backordered is $10/unit then a data frame can be be structured like so.
Optimization: Single Item
The expected value equation generalizes to:
- pi is the probability of the outcome for one observation
- vi is the value of the outcome for one observation
The general form isn’t very useful, but from it we can create an Expected Profit equation using a basic rule of probability p(x,y) = p(y)*p(x|y) that combines both the Expected Rates (2x2 matrix of probabilities after normalization of Confusion Matrix) and a Cost/Benefit Matrix (2x2 matrix with expected costs and benefits). Refer to Data Science for Business for the proof.
- p(p) is the confusion matrix positive class prior (probability of actual yes / total)
- p(n) is the confusion matrix positive class prior (probability of actual no / total = 1 - positive class prior)
- p(Y|p) is the True Positive Rate (
- p(N|p) is the False Negative Rate (
- p(N|n) is the True Negative Rate (
- p(Y|n) is the False Positive Rate (
- b(Y,p) is the benefit from true positive (
- c(N,p) is the cost from false negative (
- b(N,n) is the benefit from true negative (
- c(Y,n) is the cost from false positive (
This equation simplifies even further since we have a case where
cb_fn are both zero.
We create a function to calculate the expected profit using the probability of a positive case (positive prior,
p1), the cost/benefit of a true positive (
cb_tp), and the cost/benefit of a false positive (
cb_fp). We’ll take advantage of the
expected_rates data frame we previously created, which contains the true positive rate and false positive rate for each threshold (400 thresholds in the range of 0 and 1).
We can test the function for a hypothetical prediction that is unlikely to have a backorder (most common class). Setting
p1 = 0.01 indicates the hypothetical SKU has a low probability. Continuing with the example case of $400/unit profit and $10/unit inventory cost, we can see that optimal threshold is approximately 0.4. Note that an inventory all items strategy (threshold = 0) would cause the company to lose money on low probability of backorder items (-$6/unit) and an inventory nothing strategy would result in no benefit but no loss ($0/unit).
Conversely if we investigate a hypothetical item with high probability of backorder, we can see that it’s much more advantageous to have a loose strategy with respect to inventory conservatism. Notice the profit per-unit is 80% of the theoretical maximum profit (80% of $400/unit = $320/unit if “p1” = 0.8). The profit decreases to zero as the inventory strategy becomes more conservative.
Let’s take a minute to digest what’s going on in both the high and low expected profit curves. Units with low probability of backorder (the majority class) will tend to increase the cutoff while units with high probability will tend to lower the cutoff. It’s this balance or tradeoff that we need to scale to understand the full picture.
Now the fun part, scaling the optimization to multiple products. Let’s analyze a simplified case: 10 items with varying backorder probabilities, benefits, costs, and safety stock levels. In addition, we’ll include a backorder purchase quantity with logic of 100% safety stock (meaning items believed to be backordered will have an additional quantity purchased equal to that of the safety stock level). We’ll investigate optimal stocking level for this subset of items to illustrate scaling the analysis to find the global optimized cutoff (threshold).
purrr to map the
calc_expected_profit() to each item, thus returning a data frame of expected profits per unit by threshold value. We
unnest() to expand the data frame to one level enabling us to work with the expected profits and quantities. We then “extend” (multiply the unit expected profit by the backorder-prevention purchase quantity, which is 100% of safety stock level per our logic) to get total expected profit per unit.
We can visualize the expected profit curves for each item extended for backorder-prevention quantity to be purchased and sold (note that selling 100% is a simplifying assumption).
Finally, we can aggregate the expected profit using a few
dplyr operations. Visualizing the final curve exposes the optimal threshold.
This was a very technical and detailed post, and if you made it through congratulations! We covered automated machine learning with H2O, an efficient and high accuracy tool for prediction. We worked with an extremely unbalanced data set, showing how to use SMOTE to synthetically improve dataset balance and ultimately model performance. We spent a considerable amount of effort optimizing the cutoff (threshold) selection to maximize expected profit, which ultimately matters most to the bottom line. Hopefully you can see how data science and machine learning can be very beneficial to the business, enabling better decisions and ROI.
Enjoy data science for business? We do too. This is why we created Business Science University where we teach you how to do Data Science For Busines (#DS4B) just like us!
Our first DS4B course (HR 201) is now available!
Who is this course for?
Anyone that is interested in applying data science in a business context (we call this DS4B). All you need is basic
ggplot2 experience. If you understood this article, you are qualified.
What do you get it out of it?
You learn everything you need to know about how to apply data science in a business context:
Using ROI-driven data science taught from consulting experience!
Solve high-impact problems (e.g. $15M Employee Attrition Problem)
Use advanced, bleeding-edge machine learning algorithms (e.g. H2O, LIME)
Apply systematic data science frameworks (e.g. Business Science Problem Framework)
“If you’ve been looking for a program like this, I’m happy to say it’s finally here! This is what I needed when I first began data science years ago. It’s why I created Business Science University.”
Matt Dancho, Founder of Business Science
Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.
Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301
Our first Data Science For Business (HR 201) Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:
- HR 201: Predicting Employee Attrition with
- HR 301: Building A
- HR 302: Data Story Telling With
RMarkdownReports and Presentations
- HR 303: Building An R Package For Your Organization,
The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.
Interested? Enroll in Business Science University today!
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!