Predictive Sales Analytics: Use Machine Learning to Predict and Optimize Product Backorders
Written by Matt Dancho
Sales, customer service, supply chain and logistics, manufacturingâ€¦ no matter which department youâ€™re in, you more than likely care about backorders. Backorders are products that are temporarily out of stock, but a customer is permitted to place an order against future inventory. Back orders are both good and bad: Strong demand can drive back orders, but so can suboptimal planning. The problem is when a product is not immediately available, customers may not have the luxury or patience to wait. This translates into lost sales and low customer satisfaction. The good news is that machine learning (ML) can be used to identify products at risk of backorders. In this article we use the new H2O automated ML algorithm to implement Kagglequality predictions on the Kaggle dataset, â€śCan You Predict Product Backorders?â€ť. This is an advanced tutorial, which can be difficult for learners. We have good news, see our announcement below if you are interested in a machine learning course from Business Science. If you love this tutorial, please connect with us on social media to stay up on the latest Business Science news, events and information! Good luck and happy analyzing!
Challenges and Benefits
This tutorial covers two challenges. One challenge with this problem is dataset imbalance, when the majority class significantly outweighs the minority class. We implement a special technique for dealing with unbalanced data sets called SMOTE (synthetic minority oversampling technique) that improves modeling accuracy and efficiency (winwin). The second challenge is optimizing for the business case. To do so, we explore cutoff (threshold) optimization which can be used to find the cutoff that maximizes expected profit.
One of the important visualizations (benefits to your understanding) is the effect of precision and recall on inventory strategy. A lever the analyst controls is the cutoff (threshold). The threshold has an effect on precision and recall. By selecting the optimal threshold we can maximize expected profit.
Quick Navigation
This is a longer analysis, and certain sections may be more interesting based on your role. Are you an Executive or a Data Scientist?
 Executive: Read these:
 Data Scientist: Read these:
Others should just read the whole thing. ;)
Background on Backorder Prediction
What is a backorder?
According to Investopedia, a backorder isâ€¦
A customer order that has not been fulfilled. A backorder generally indicates that customer demand for a product or service exceeds a companyâ€™s capacity to supply it. Total backorders, also known as backlog, may be expressed in terms of units or dollar amount.
Why do we care?
Source: www.trustedreviews.com
A backorder can be both a good thing and a bad thing. Consider a company like Apple that recently rolled out several new iPhone models including iPhone 8 and iPhone X. Because the initial supply cannot possibly keep up with the magnitude of demand, Apple is immediately in a â€śbackorderâ€ť situation that requires customer service management.
On the positive side, backorders indicate healthy demand for a product. For Apple, this is a credit to strong brand recognition. Customers are likely to wait several months to get the latest model because of the new and innovative technology and incredible branding. Conversely, Appleâ€™s competitors donâ€™t have this luxury. If new models cannot be provided immediately, their customers cancel orders and go elsewhere.
Companies constantly strive for a balance in managing backorders. Itâ€™s a fine line: Too much supply increases inventory costs while too little supply increases the risk that customers may cancel orders. Why not just keep 100,000 of everything on the shelves at all time? For most retailers and manufacturers, this strategy will drive inventory costs through the roof considering they likely have a large number of SKUs (unique product IDs).
A predictive analytics program can identify which products are most likely to experience backorders giving the organization information and time to adjust. Machine learning can identify patterns related to backorders before customers order. Production can then adjust to minimize delays while customer service can provide accurate dates to keep customers informed and happy. The predictive analytics approach enables the maximum product to get in the hands of customers at the lowest cost to the organization. The result is a winwin: organizations increase sales while customers get to enjoy the products they demand.
Challenges with Predicting Backorders
One of the first steps in developing a backorder strategy is to understand which products are at risk of being backordered. Sounds easy right? Wrong. A number of factors can make this a challenging business problem.
DemandEvent Dependencies
Demand for existing products are constantly changing, often dependent on nonbusiness data (holidays, weather, and other events). Consider demand for snow blowers at Home Depot. We know that geographicallyspeaking in Northern USA the winter is when demand will rise and the spring is when demand will fall. However, this demand is highly dependent on the level of snowfall. If snowfall is minimal, demand will be low regardless of it being winter. Therefore, Home Depot will need to know the likelihood of snowfall to best predict demand surges and shortfalls in order to optimize inventory.
Small Numbers (Anomalies)
If backorders are very infrequent but highly important, it can be very difficult to predict the minority class accurately because of the imbalance between backorders to nonbackorders within the data set. Accuracy for the model will look great, but the actual predictive quality may be very poor. Consider the NULL error rate, or the rate of just picking the majority class. Just picking â€śnonbackorderâ€ť may be the same or more accurate than the model. Special strategies exist for dealing with unbalanced data, and weâ€™ll implement SMOTE.
Time Dependency
Often this problem is viewed as a crosssectional problem rather than as a temporal (or timebased) problem. The problem is time can play a critical role. Just because a product sold out this time last year, will it sell out again this year? Maybe. Maybe not. Timebased demand forecasting can be necessary to augment crosssectional binary classification.
Case Study: Predicting Backorder Risk and Optimizing Profit
The Problem
A hypothetical manufacturer has a data set that identifies whether or not a backorder has occurred. The challenge is to accurately predict future backorder risk using predictive analytics and machine learning and then to identify the optimal strategy for inventorying products with high backorder risk.
The Data
The data comes from Kaggleâ€™s Can You Predict Product Backorders? dataset. If you have a Kaggle account, you can download the data, which includes both a training and a test set. Weâ€™ll use the training set for developing our model and the test set for determining the final accuracy of the best model.
The data file contains the historical data for the 8 weeks prior to the week we are trying to predict. The data were taken as weekly snapshots at the start of each week. The target (or response) is the went_on_backorder
variable. To model and predict the target, weâ€™ll use the other features, which include:
 sku  Random ID for the product
 national_inv  Current inventory level for the part
 lead_time  Transit time for product (if available)
 in_transit_qty  Amount of product in transit from source
 forecast_3_month  Forecast sales for the next 3 months
 forecast_6_month  Forecast sales for the next 6 months
 forecast_9_month  Forecast sales for the next 9 months
 sales_1_month  Sales quantity for the prior 1 month time period
 sales_3_month  Sales quantity for the prior 3 month time period
 sales_6_month  Sales quantity for the prior 6 month time period
 sales_9_month  Sales quantity for the prior 9 month time period
 min_bank  Minimum recommend amount to stock
 potential_issue  Source issue for part identified
 pieces_past_due  Parts overdue from source
 perf_6_month_avg  Source performance for prior 6 month period
 perf_12_month_avg  Source performance for prior 12 month period
 local_bo_qty  Amount of stock orders overdue
 deck_risk  Part risk flag
 oe_constraint  Part risk flag
 ppap_risk  Part risk flag
 stop_auto_buy  Part risk flag
 rev_stop  Part risk flag
 went_on_backorder  Product actually went on backorder. This is the target value.
Using Machine Learning to Predict Backorders
Libraries
Weâ€™ll use the following libraries:
tidyquant
: Used to quickly load the â€śtidyverseâ€ť (dplyr
,tidyr
,ggplot
, etc) along with custom, businessreportfriendlyggplot
themes. Also great for time series analysis (not featured)unbalanced
: Contains various methods for working with unbalanced data. Weâ€™ll use theubSMOTE()
function.h2o
: Goto package for implementing professional grade machine learning.
Note on H2O: Youâ€™ll want to install the latest stable R version from H2O.ai. If you have issues in this post, you probably did not follow these steps:
 Go to H2O.aiâ€™s download page
 Under H2O, select â€śLatest Stable Releaseâ€ť
 Click on the â€śInstall in Râ€ť tab
 Follow instructions exactly.
Load the libraries.
Load Data
Use read_csv()
to load the training and test data.
Inspect Data
Inspecting the head and tail, we can get an idea of the data set. Thereâ€™s some sporadic NA
values in the â€ślead_timeâ€ť column along with 99 values in the two supplier performance columns.
sku  national_inv  lead_time  in_transit_qty  forecast_3_month  forecast_6_month  forecast_9_month  sales_1_month  sales_3_month  sales_6_month  sales_9_month  min_bank  potential_issue  pieces_past_due  perf_6_month_avg  perf_12_month_avg  local_bo_qty  deck_risk  oe_constraint  ppap_risk  stop_auto_buy  rev_stop  went_on_backorder 

1026827  0  NA  0  0  0  0  0  0  0  0  0  No  0  99.00  99.00  0  No  No  No  Yes  No  No 
1043384  2  9  0  0  0  0  0  0  0  0  0  No  0  0.99  0.99  0  No  No  No  Yes  No  No 
1043696  2  NA  0  0  0  0  0  0  0  0  0  No  0  99.00  99.00  0  Yes  No  No  Yes  No  No 
1043852  7  8  0  0  0  0  0  0  0  0  1  No  0  0.10  0.13  0  No  No  No  Yes  No  No 
1044048  8  NA  0  0  0  0  0  0  0  4  2  No  0  99.00  99.00  0  Yes  No  No  Yes  No  No 
1044198  13  8  0  0  0  0  0  0  0  0  0  No  0  0.82  0.87  0  No  No  No  Yes  No  No 
The tail shows another issue, the last row is NA
values.
sku  national_inv  lead_time  in_transit_qty  forecast_3_month  forecast_6_month  forecast_9_month  sales_1_month  sales_3_month  sales_6_month  sales_9_month  min_bank  potential_issue  pieces_past_due  perf_6_month_avg  perf_12_month_avg  local_bo_qty  deck_risk  oe_constraint  ppap_risk  stop_auto_buy  rev_stop  went_on_backorder 

1407754  0  2  0  10  10  10  0  5  7  7  0  No  0  0.69  0.69  5  Yes  No  No  Yes  No  No 
1373987  1  NA  0  5  7  9  1  3  3  8  0  No  0  99.00  99.00  1  No  No  No  Yes  No  No 
1524346  1  9  0  7  9  11  0  8  11  12  0  No  0  0.86  0.84  1  Yes  No  No  No  No  Yes 
1439563  62  9  16  39  87  126  35  63  153  205  12  No  0  0.86  0.84  6  No  No  No  Yes  No  No 
1502009  19  4  0  0  0  0  2  7  12  20  1  No  0  0.73  0.78  1  No  No  No  Yes  No  No 
NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
If we take a look at the distribution of the target, train_raw_df$went_on_backorder
, we see that the data set is severely imbalanced. Weâ€™ll need a strategy to balance the data set if we want to get maximum model performance and efficiency.
We can also inspect missing values. Because both the train and test sets have missing values, weâ€™ll need to come up with a strategy to deal with them.
The test set has a similar distribution.
Finally, itâ€™s also worth taking a glimpse
of the data. We can quickly see:
 We have some character columns with Yes/No values.
 The â€śperf_â€ť columns have 99 values, which reflect missing data.
Data Setup
Validation Set
We have a training and a test set, but weâ€™ll need a validation set to assist with modeling. We can perform a random 85/15 split of the training data.
Data PreProcessing
Preprocessing includes a number of steps to get the raw data ready for modeling. We can make a preprocess function that drops unnecessary columns, deals with NA
values, converts Yes/No data to 1/0, and converts the target to a factor. These are all standard preprocessing steps that will need to be applied to each of the data sets.
Data Transformation
A step may be needed to transform (normalize, center, scale, apply PCA, etc) the data especially if using deep learning as a classification method. In our case, overall performance actually decreased when transformation was performed. We skip this step because of this. Interested readers may wish to explore the recipes
package for creating and applying ML preprocessing recipes, which has great documentation.
Balance Dataset with SMOTE
As we saw previously, the data is severely unbalanced with only 0.7% of cases being positive (went_on_backorder = Yes
). To deal with this class imbalance, weâ€™ll implement a technique called SMOTE (synthetic minority oversampling technique), which oversamples the minority class by generating synthetic minority examples in the neighborhood of observed ones. In other words, it shrinks the prevalence of the majority class (under sampling) while simultaneously synthetically increasing the prevalence of the minority class using a knearest neighbors approach (over sampling via knn). The great thing is that SMOTE can improve classifier performance (due to better classifier balance) and improve efficiency (due to smaller but focused training set) at the same time (winwin)!
The ubSMOTE()
function from the unbalanced
package implements SMOTE (along with a number of other sampling methods such as over, under, and several advanced methods). Weâ€™ll setup the following arguments:
perc.over = 200
: This is the percent of new instances generated for each rare instance. For example, if there were only 5 rare cases, the 200% percent over will synthetically generate and additional 10 rare cases.perc.under = 200
: The percentage of majority classes selected for each SMOTE observations. If 10 additional observations were created through the SMOTE process, 20 majority cases would be sampled.k = 5
: The number of nearest neighbors to select when synthetically generating new observations.
We need to recombine the results into a tibble
. Notice there are now only 68K observations versus 1.4M previously. This is a good indication that SMOTE worked correctly. As an added benefit, the training set size has shrunk which will make the model training significantly faster.
We can also check the new balance of Yes/No: Itâ€™s now 43% Yes to 57% No, which in theory should enable the classifier to better detect relationships with the positive (Yes) class.
Modeling with h2o
Now we are ready to model. Letâ€™s fire up h2o
.
The h2o
package deals with H2OFrame
(instead of data frame). Letâ€™s convert.
We use the h2o.automl()
function to automatically try a wide range of classification models. The most important arguments are:
training_frame
: Supply our training data set to build models.validation_frame
: Supply our validation data set to prevent models from overfittingleaderboard_frame
: Supply our test data set, and models will be ranked based on performance on the test set.max_runtime_secs
: Controls the amount of runtime for each model. Setting to 45 seconds will keep things moving but give each model a fair shot. If you want even higher accuracy you can try increasing the run time. Just be aware that there tends to be diminishing returns.
Extract our leader model.
Making Predictions
We use h2o.predict()
to make our predictions on the test set. The data is still stored as an h2o
object, but we can easily convert to a data frame with as.tibble()
. Note that the left column (â€śpredictâ€ť) is the class prediction, and columns â€śp0â€ť and â€śp1â€ť are the probabilities.
Assessing Performance
We use h2o.performance()
to get performancerelated information by passing the function the leader model and the test set in the same fashion as we did for the prediction. Weâ€™ll use the performance output to visualize ROC, AUC, precision and recall.
From the performance output we can get a number of key classifier metrics by threshold using h2o.metric()
.
ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical method that pits the true positive rate (yaxis) against the false positive rate (xaxis). The benefit to the ROC curve is twofold:
 We can visualize how the binary classification model compares to randomly guessing
 We can calculate AUC (Area Under the Curve), which is a method to compare models (perfect classification = 1).
Letâ€™s review the ROC curve for our model using h2o
. The red dotted line is what you could theoretically achieve by randomly guessing.
We can also get the AUC of the test set using h2o.auc()
. To give you an idea, the best Kaggle data scientists are getting AUC = 0.95. At AUC = 0.92, our automatic machine learning model is in the same ball park as the Kaggle competitors, which is quite impressive considering the minimal effort to get to this point.
The Cutoff (Threshold)
The cutoff (also known as threshold) is the value that divides the predictions. If we recall, the prediction output has three columns: â€śpredictâ€ť, â€śp0â€ť, and â€śp1â€ť. The first column, â€śpredictâ€ť (model predictions) uses an automatically determined threshold value as the cutoff. Everything above is classified as â€śYesâ€ť and below as â€śNoâ€ť.
How is the Cutoff Determined?
The algorithm uses the threshold at the best F1 score to determine the optimal value for the cutoff.
We could use a different threshold if we like, which can change depending on your goal (e.g. maximize recall, maximize precision, etc). For a full list of thresholds, we can inspect the â€śmetricsâ€ť slot that is inside the perf_h2o
object.
Is cutoff at max F1 what we want? Probably not as weâ€™ll see next.
Optimizing the Model for Expected Profit
In the business context we need to decide what cutoff (threshold of probability to assign yes/no) to use: Is the â€śp1â€ť cutoff >= 0.63 (probability above which predict â€śYesâ€ť) adequate? The answer lies in the balance between the cost of inventorying incorrect product (low precision) versus the cost of the lost customer (low recall):

Precision: When model predicts yes, how often is the actual value yes. If we implement a high precision (low recall) strategy then we need to accept a tradeoff of letting the model misclassify actual yes values to decrease the number of incorrectly predicted yes values.

Recall: When actual is yes, how often does the model predict yes. If we implement a high recall (low precision) strategy then we need to accept a tradeoff of letting the model increase the no values incorrectly predicted as yes values.
Effect of Precision and Recall on Business Strategy
By shifting the cutoff, we can control the precision and recall and this has major effect on the business strategy. As the cutoff increases, the model becomes more conservative being very picky about what it classifies as went_on_backorder = Yes
.
Expected Value Framework
We need another function to help us determine which values of cutoff to use, and this comes from the Expected Value Framework described in Data Science for Business.
Source: Data Science for Business
To implement this framework we need two things:
 Expected Rates (matrix of probabilities): Needed for each threshold
 Cost/Benefit Information: Needed for each observation
Expected Rates
We have the class probabilities and rates from the confusion matrix, and we can retrieve this using the h2o.confusionMatrix()
function.
The problems are:
 The confusion matrix is for just one specific cutoff. We want to evaluate for all of the potential cutoffs to determine which strategy is best.
 We need to convert to expected rates, as shown in the Figure 72 diagram.
Luckily, we can retrieve the rates by cutoff conveniently using h2o.metric()
. Huzzah!
Cost/Benefit Information
The costbenefit matrix is a business assessment of the cost and benefit for each of four potential outcomes:

True Negative (benefit): This is the benefit from a SKU that is correctly predicted as not backordered. The benefit is zero because a customer simply does not buy this item (lack of demand).

True Positive (benefit): This is the benefit from a SKU that is correctly predicted as backorder and the customer needs this now. The benefit is the profit from revenue. For example, if the item price is $1000 and the cost is $600, the unit profit is $400.

False Positive (cost): This is the cost associated with incorrectly classifying a SKU as backordered when demand is not present. The is the warehousing and inventory related cost. For example, if the item costs $10/unit to inventory, which would have otherwise not have occurred.

False Negative (cost): This is the cost associated with incorrectly classifying a SKU when demand is present. In our case, no money was spent and nothing was gained.
The costbenefit information is needed for each decision pair. If we take one SKU, say the first prediction from the test set, we have an item that was predicted to NOT be backordered (went_on_backorder == "No"
). If hypothetically the value for True Positive (benefit) is $400/unit in profit from correctly predict a backorder and the False Positive (cost) of accidentally inventorying and item that was not backordered is $10/unit then a data frame can be be structured like so.
Optimization: Single Item
The expected value equation generalizes to:
\[EV = \sum_{i=1}^{N} p_i*v_i\]Where,
 p_{i} is the probability of the outcome for one observation
 v_{i} is the value of the outcome for one observation
The general form isnâ€™t very useful, but from it we can create an Expected Profit equation using a basic rule of probability p(x,y) = p(y)*p(xy) that combines both the Expected Rates (2x2 matrix of probabilities after normalization of Confusion Matrix) and a Cost/Benefit Matrix (2x2 matrix with expected costs and benefits). Refer to Data Science for Business for the proof.
\[EP = p(p)*[p(Yp)*b(Y,p)+p(Np)*c(N,p)]\\ + p(n)*[p(Nn)*b(N,n)+p(Yn)*c(Y,n)]\]Where,
 p(p) is the confusion matrix positive class prior (probability of actual yes / total)
 p(n) is the confusion matrix positive class prior (probability of actual no / total = 1  positive class prior)
 p(Yp) is the True Positive Rate (
tpr
)  p(Np) is the False Negative Rate (
fnr
)  p(Nn) is the True Negative Rate (
tnr
)  p(Yn) is the False Positive Rate (
fpr
)  b(Y,p) is the benefit from true positive (
cb_tp
)  c(N,p) is the cost from false negative (
cb_fn
)  b(N,n) is the benefit from true negative (
cb_tn
)  c(Y,n) is the cost from false positive (
cb_fp
)
This equation simplifies even further since we have a case where cb_tn
and cb_fn
are both zero.
We create a function to calculate the expected profit using the probability of a positive case (positive prior, p1
), the cost/benefit of a true positive (cb_tp
), and the cost/benefit of a false positive (cb_fp
). Weâ€™ll take advantage of the expected_rates
data frame we previously created, which contains the true positive rate and false positive rate for each threshold (400 thresholds in the range of 0 and 1).
We can test the function for a hypothetical prediction that is unlikely to have a backorder (most common class). Setting p1 = 0.01
indicates the hypothetical SKU has a low probability. Continuing with the example case of $400/unit profit and $10/unit inventory cost, we can see that optimal threshold is approximately 0.4. Note that an inventory all items strategy (threshold = 0) would cause the company to lose money on low probability of backorder items ($6/unit) and an inventory nothing strategy would result in no benefit but no loss ($0/unit).
Conversely if we investigate a hypothetical item with high probability of backorder, we can see that itâ€™s much more advantageous to have a loose strategy with respect to inventory conservatism. Notice the profit perunit is 80% of the theoretical maximum profit (80% of $400/unit = $320/unit if â€śp1â€ť = 0.8). The profit decreases to zero as the inventory strategy becomes more conservative.
Letâ€™s take a minute to digest whatâ€™s going on in both the high and low expected profit curves. Units with low probability of backorder (the majority class) will tend to increase the cutoff while units with high probability will tend to lower the cutoff. Itâ€™s this balance or tradeoff that we need to scale to understand the full picture.
Optimization: MultiItem
Now the fun part, scaling the optimization to multiple products. Letâ€™s analyze a simplified case: 10 items with varying backorder probabilities, benefits, costs, and safety stock levels. In addition, weâ€™ll include a backorder purchase quantity with logic of 100% safety stock (meaning items believed to be backordered will have an additional quantity purchased equal to that of the safety stock level). Weâ€™ll investigate optimal stocking level for this subset of items to illustrate scaling the analysis to find the global optimized cutoff (threshold).
We use purrr
to map the calc_expected_profit()
to each item, thus returning a data frame of expected profits per unit by threshold value. We unnest()
to expand the data frame to one level enabling us to work with the expected profits and quantities. We then â€śextendâ€ť (multiply the unit expected profit by the backorderprevention purchase quantity, which is 100% of safety stock level per our logic) to get total expected profit per unit.
We can visualize the expected profit curves for each item extended for backorderprevention quantity to be purchased and sold (note that selling 100% is a simplifying assumption).
Finally, we can aggregate the expected profit using a few dplyr
operations. Visualizing the final curve exposes the optimal threshold.
Conclusions
This was a very technical and detailed post, and if you made it through congratulations! We covered automated machine learning with H2O, an efficient and high accuracy tool for prediction. We worked with an extremely unbalanced data set, showing how to use SMOTE to synthetically improve dataset balance and ultimately model performance. We spent a considerable amount of effort optimizing the cutoff (threshold) selection to maximize expected profit, which ultimately matters most to the bottom line. Hopefully you can see how data science and machine learning can be very beneficial to the business, enabling better decisions and ROI.
Business Science University
Enjoy data science for business? We do too. This is why we created Business Science University where we teach you how to do Data Science For Busines (#DS4B) just like us!
Our first DS4B course (HR 201) is now available!
Who is this course for?
Anyone that is interested in applying data science in a business context (we call this DS4B). All you need is basic R
, dplyr
, and ggplot2
experience. If you understood this article, you are qualified.
What do you get it out of it?
You learn everything you need to know about how to apply data science in a business context:

Using ROIdriven data science taught from consulting experience!

Solve highimpact problems (e.g. $15M Employee Attrition Problem)

Use advanced, bleedingedge machine learning algorithms (e.g. H2O, LIME)

Apply systematic data science frameworks (e.g. Business Science Problem Framework)
About Business Science
Business Science specializes in â€śROIdriven data scienceâ€ť. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!
Follow Business Science on Social Media
 @bizScienc is on twitter!
 Check us out on Facebook page!
 Check us out on LinkedIn!
 Sign up for our insights blog to stay updated!
 If you like our software, star our GitHub packages!