orderSimulatoR: Simulate Orders for Business Analytics
Written
In this post, we will be discussing orderSimulatoR
, which enables fast and easy R
order simulation for customer and product learning. The basic premise is to simulate data that you’d retrieve from a SQL
query of an ERP system. The data can then be merged with products and customers tables to data mine. I’ll go through the basic steps to create an order data set that combines customers and products, and I’ll wrap up with some visualizations to show how you can use order data to expose trends. You can get the scripts and the Cannondale bikes data set
at the orderSimulatoR
GitHub repository. In case you are wondering what simulated orders look like, click here to scroll to the end result.
About the Photo:
The bicycle in the photo is one of Cannondale’s top-of-the-line cross-country mountain bikes, the Scalpel. The orderSimulatoR
scripts were used to create the Cannondale bikes data set
, which simulates orders for the bicycle manufacturer, Cannondale. The data set uses their 2016 models. For more information, visit their website at www.cannondale.com.
Table of Contents
Why orderSimulatoR?
It’s very difficult to create order data with trends for data mining and visualization. I’ve searched for good data sets, and I’ve come to the conclusion that I’m better off creating my own orders data for analyzing on my blog. In the process, I made a collection of R
scripts that became known as orderSimulatoR
. The scripts are publicly available so others can simulate orders for customers and products in any sales/marketing situation. The process to create orders (shown below) is fast and easy, and the result is orders with customized trends depending on the inputs you create and the parameters you select.
Getting Started
Prior to starting, you’ll need three data frames:
- Customers
- Products
- Customer-Product Interactions
I’ll briefly go through each using the Cannondale bikes data set
. If you’re following along, you can access the excel files here. Alternatively, you can generate your own data frames of customers, products and customer-production interactions. Note that the orders.xlsx
file is what we will be generating in data frame form. The code below loads the three data frames.
Customers
The customers for the example case are bike shops scattered throughout the United States. I’ve included id
, name
, city
state
, latitude
, and longitude
. The id
is important and must be in the first column for the orderSimulatoR
scripts to work. The other features are up to you. Feel free to get creative when designing your customers data frame by adding features such as region, size, type, etc that would normally be found in a CRM / ERP system. While features beyond the id
are not required for the scripts, it’s a good idea to add them now for data mining later. The first six customer records are shown below.
bikeshop.id |
bikeshop.name |
bikeshop.city |
bikeshop.state |
latitude |
longitude |
1 |
Pittsburgh Mountain Machines |
Pittsburgh |
PA |
40.44062 |
-79.99589 |
2 |
Ithaca Mountain Climbers |
Ithaca |
NY |
42.44396 |
-76.50188 |
3 |
Columbus Race Equipment |
Columbus |
OH |
39.96118 |
-82.99879 |
4 |
Detroit Cycles |
Detroit |
MI |
42.33143 |
-83.04575 |
5 |
Cincinnati Speed |
Cincinnati |
OH |
39.10312 |
-84.51202 |
6 |
Louisville Race Equipment |
Louisville |
KY |
38.25267 |
-85.75846 |
Products
The products for the example case are road and mountain bikes made by Cannondale, a premier bicycle manufacturer. I’ve included the id
and several other features I found on the website including model
, category1
(Road, Mountain), category2
(Elite Road, Endurance Road, etc), frame
(Carbon, Aluminum), and price. Again, id
is the most important feature and must be in the first column for the scripts to work. As with customers, get creative with the features. We’ll use the features when developing the product-customer interactions, which creates the trends that we’ll data mine. The first six products are shown below.
bike.id |
model |
category1 |
category2 |
frame |
price |
1 |
Supersix Evo Black Inc. |
Road |
Elite Road |
Carbon |
12790 |
2 |
Supersix Evo Hi-Mod Team |
Road |
Elite Road |
Carbon |
10660 |
3 |
Supersix Evo Hi-Mod Dura Ace 1 |
Road |
Elite Road |
Carbon |
7990 |
4 |
Supersix Evo Hi-Mod Dura Ace 2 |
Road |
Elite Road |
Carbon |
5330 |
5 |
Supersix Evo Hi-Mod Utegra |
Road |
Elite Road |
Carbon |
4260 |
6 |
Supersix Evo Red |
Road |
Elite Road |
Carbon |
3940 |
Customer-Product Interactions
The customer-products interactions is a matrix that links the probability of each customer to purchasing each product (i.e. it is a model for each customer’s purchasing preference). The scripts use this matrix to assign products to orders. The probabilities are critical as these will be what create the customer trends in the data. It’s a good idea to check out the excel spreadsheet, customer_product_interactions.xlsx
, which has detailed notes on how to add trends into the customer-product interactions. For the bikes data set, the customers each had a preference for bike style (Road, Mountain, or AnyStyle) and a price range (High and Low). The excel functions assign various probabilities based on how the customer preferences match the product features. A few important points:
- Product id must be the first column
- All subsequent columns should be the customers in order of customer id from 1 to the last customer id.
- Each customer column should sum to 1.0, as this is the product probability density for each customer.
The customer product interactions are shown for the first six customers and products. The way to interpret the matrix is that customer.id.1 has a 0.986% probability of seleting product.id.1 (bike.id = 1).
bike.id |
customer.id.1 |
customer.id.2 |
customer.id.3 |
customer.id.4 |
customer.id.5 |
1 |
0.0099 |
0.0099 |
0.0196 |
0.0057 |
0.0196 |
2 |
0.0099 |
0.0099 |
0.0196 |
0.0057 |
0.0196 |
3 |
0.0099 |
0.0099 |
0.0196 |
0.0057 |
0.0196 |
4 |
0.0099 |
0.0099 |
0.0196 |
0.0057 |
0.0196 |
5 |
0.0099 |
0.0099 |
0.0196 |
0.0057 |
0.0196 |
6 |
0.0099 |
0.0099 |
0.0196 |
0.0057 |
0.0196 |
Creating Customer Orders
Once the customer, product and customer-product interaction data frames are finished, you are ready to create orders. The first thing you’ll need is to load the scripts. The scripts can be found here.
Step 1: Create Orders and Lines
First, we’ll create the orders and lines using the createOrdersAndLines()
function. For those unfamiliar with lines, each order typically has several products purchased which are denoted “lines”. The function parameters are number of orders n
, max number of lines maxLines
, and rate
, which affects the distribution of lines on an order.
Quick Digression on the rate
Parameter:
This section gets a little technical, so feel free to skip and you won’t miss much.
Several other orderSimulatoR
functions use a similar rate
parameter, and I recommend playing around with the rates to see how the distribution is affected. The following code generates the probabilities of lines to an order:
The equation iterates through the number of lines to an order, \(i\), from 1
to maxLines
in the equation:
\[lineProb_i = \frac{1}{i^{rate}}\]
Where \(i\) is the number of lines to an order, and \(rate\) is the adjustment parameter. As the number of lines to an order, \(i\), increases, the probability naturally decreases. Because \(i\) is raised to the \(rate\) parameter, the \(rate\) determines how fast or slow the distribution decreases. Larger values cause the distribution to decrease faster, and values close to zero cause the distribution to become more uniform.
The effect of rate
adjustment can be conveyed by plotting the probability distribution of orders with maxLines
= 10 while adjusting the rate
parameter from 0 to 1.5. At rate
= 0, the probability of a one-line order is 10%, while at a rate
of 1.5 the probability is 50%.
Back to Creating Orders and Lines:
With inputs of n
= 2000, maxLines
= 30, and rate
= 1, the output is a data frame with 2000 orders with 15,644 total rows for each product purchase. Some orders have more lines (more product purchases) and others have less (as dictated by the rate
parameter). One note worth mentioning is that the maxLines
cannot exceed the number of products (otherwise an order would have more lines than products available which does not make sense).
order.id |
order.line |
1 |
1 |
1 |
2 |
2 |
1 |
2 |
2 |
3 |
1 |
3 |
2 |
3 |
3 |
3 |
4 |
3 |
5 |
Step 2: Add Dates to the Orders
Orders typically have a date recorded so the sales can be tracked by time period (e.g. for assessment of time-based metrics like sales by month, quarter, year, etc). As such, we’ll add dates according the distribution of orders in a given year using the createDatesFromOrders()
function, which takes four parameters: the orders
data frame from Step 1, a startYear
, a yearlyOrderDist
, and a monthlyOrderDist
. The startYear
identifies the beginning of the time span, and the length of the yearlyOrderDist
(length = 5) creates orders spanning five years. Yearly growth/contraction in sales is simulated using the yearlyOrderDist
distribution vector, while seasonality is simulated using the monthlyOrderDist
distribution vector. As shown below, we now have dates added according to our desired yearly and monthly distributions.
order.id |
order.line |
order.date |
1 |
1 |
2011-01-07 |
1 |
2 |
2011-01-07 |
2 |
1 |
2011-01-10 |
2 |
2 |
2011-01-10 |
3 |
1 |
2011-01-10 |
3 |
2 |
2011-01-10 |
The script takes into account no sales on weekends, but does not account for holidays (e.g. Christmas, Thanksgiving, etc). The last six rows of the orders
data frame are shown below.
|
order.id |
order.line |
order.date |
15639 |
2000 |
3 |
2015-12-25 |
15640 |
2000 |
4 |
2015-12-25 |
15641 |
2000 |
5 |
2015-12-25 |
15642 |
2000 |
6 |
2015-12-25 |
15643 |
2000 |
7 |
2015-12-25 |
15644 |
2000 |
8 |
2015-12-25 |
We can use a heatmap to graphically view the order distribution by year and month to get an idea for how the order dates were assigned. As expected, the seasonality of orders follows the monthlyOrderDist
distribution, with the majority of orders coming in March through July. The orders increase following the yearlyOrderDist
distribution.
Step 3: Assign Customers to Orders
Next, we’ll assign customers to orders using the assignCustomersToOrders()
function. The parameters needed are the orders
data frame from Step 2, the customers
data frame, and the rate
parameter, which controls the distribution of orders assigned to customers. Similar to the rate
parameter in Step 1, a larger rate increases the number of orders assigned to the largest customers, and as the rate decreases to zero the distribution approaches a uniform distribution. As shown below, orders now have customer id’s.
order.id |
order.line |
order.date |
customer.id |
1 |
1 |
2011-01-07 |
2 |
1 |
2 |
2011-01-07 |
2 |
2 |
1 |
2011-01-10 |
10 |
2 |
2 |
2011-01-10 |
10 |
3 |
1 |
2011-01-10 |
6 |
3 |
2 |
2011-01-10 |
6 |
3 |
3 |
2011-01-10 |
6 |
3 |
4 |
2011-01-10 |
6 |
3 |
5 |
2011-01-10 |
6 |
Using a rate
= 0.8 still assigns a wide range of orders to customers, and we can see the distribution of order rows by customer in the plot below. Customer 10 has the most order rows at 2731, while Customer 21 has the least order rows at 108.
Step 4: Assign Products to Orders Lines
In step 4, we’ll assign products to the orders using the customer-product interaction table. Keep in mind that this is the critical step where a well-thought-out customer-product interaction table will allow us to uncover deep trends and insights into customer behavior. The assignProductsToCustomerOrders()
function takes two arguments: the orders
data frame from Step 3 and the customerProductProbs
table, which contains the matrix of probabilities linking product.id to customer.id.
order.id |
order.line |
order.date |
customer.id |
product.id |
1 |
1 |
2011-01-07 |
2 |
48 |
1 |
2 |
2011-01-07 |
2 |
52 |
2 |
1 |
2011-01-10 |
10 |
76 |
2 |
2 |
2011-01-10 |
10 |
52 |
3 |
1 |
2011-01-10 |
6 |
2 |
3 |
2 |
2011-01-10 |
6 |
50 |
3 |
3 |
2011-01-10 |
6 |
1 |
3 |
4 |
2011-01-10 |
6 |
4 |
3 |
5 |
2011-01-10 |
6 |
34 |
Step 5: Assign Quantities to Order Lines
In the last step, we assign quantities to the order lines using the createProductQuantities()
function. The function takes three parameters: the orders
data frame from Step 4, the maxQty
, and the rate
. The maxQty
parameter limits the quantity of products on a order line while the rate
parameter controls the distribution. A maxQty
= 10 and a rate
= 3 places an 83.5% probability on a line quantity of 1, whereas a line quantity of 10 (maximum quantity) has a 0.08% probability.
order.id |
order.line |
order.date |
customer.id |
product.id |
quantity |
1 |
1 |
2011-01-07 |
2 |
48 |
1 |
1 |
2 |
2011-01-07 |
2 |
52 |
1 |
2 |
1 |
2011-01-10 |
10 |
76 |
1 |
2 |
2 |
2011-01-10 |
10 |
52 |
1 |
3 |
1 |
2011-01-10 |
6 |
2 |
1 |
3 |
2 |
2011-01-10 |
6 |
50 |
1 |
3 |
3 |
2011-01-10 |
6 |
1 |
1 |
3 |
4 |
2011-01-10 |
6 |
4 |
1 |
3 |
5 |
2011-01-10 |
6 |
34 |
1 |
Joining Orders with Customers and Products
The orders table alone does not give us much information. For instance, we don’t know the names of the customers, the products purchased, or the sales values. Let’s combine some of the tables to make the information more useful.
order.date |
order.id |
order.line |
bikeshop.name |
model |
quantity |
price |
price.extended |
category1 |
category2 |
frame |
2011-01-07 |
1 |
1 |
Ithaca Mountain Climbers |
Jekyll Carbon 2 |
1 |
6070 |
6070 |
Mountain |
Over Mountain |
Carbon |
2011-01-07 |
1 |
2 |
Ithaca Mountain Climbers |
Trigger Carbon 2 |
1 |
5970 |
5970 |
Mountain |
Over Mountain |
Carbon |
2011-01-10 |
2 |
1 |
Kansas City 29ers |
Beast of the East 1 |
1 |
2770 |
2770 |
Mountain |
Trail |
Aluminum |
2011-01-10 |
2 |
2 |
Kansas City 29ers |
Trigger Carbon 2 |
1 |
5970 |
5970 |
Mountain |
Over Mountain |
Carbon |
2011-01-10 |
3 |
1 |
Louisville Race Equipment |
Supersix Evo Hi-Mod Team |
1 |
10660 |
10660 |
Road |
Elite Road |
Carbon |
2011-01-10 |
3 |
2 |
Louisville Race Equipment |
Jekyll Carbon 4 |
1 |
3200 |
3200 |
Mountain |
Over Mountain |
Carbon |
2011-01-10 |
3 |
3 |
Louisville Race Equipment |
Supersix Evo Black Inc. |
1 |
12790 |
12790 |
Road |
Elite Road |
Carbon |
2011-01-10 |
3 |
4 |
Louisville Race Equipment |
Supersix Evo Hi-Mod Dura Ace 2 |
1 |
5330 |
5330 |
Road |
Elite Road |
Carbon |
2011-01-10 |
3 |
5 |
Louisville Race Equipment |
Synapse Disc 105 |
1 |
1570 |
1570 |
Road |
Endurance Road |
Aluminum |
Great. Now this finally looks like orders data that could be retrieved from an ERP system. We are ready to start visualizing and data mining. The last part of this post will go into some high level visualizations for data exploration. Future blog posts will dig much deeper.
Exploring the Orders
If you’ve followed along to this point, we have just created the orders from the bikes data set
. Now, we can have some fun! Let’s examine the data set.
Sales Over Time
We’ll first take a look at sales over time using the ggplot2
package.
Nice. Cannondale has a general growth trend. It looks like 2015 was Cannondale’s best year with $17,171,510 in total sales!
Top 10 Products
Next, let’s explore some products to get an idea of top sellers.
It looks like the best model is the Cannondale Scalpel-Si Black Inc. with 168 units sold representing sales of $2,148,720 or 3.02% of total sales.
Geographic Customer Trends
Finally, we’ll take a look at a map of our customer-base using the leaflet
package. The map below charts the customer locations and exposes the sales trends using the circle radius. Larger circles relate to higher sales, and smaller circles relate to lower sales. You can click on the markers to see the bike shop name and sales.
This is elightening. Not only is it easy to see who are largest customers are and their locations, but we can see geographic trends. It looks like the southeast US and the midwest are realatively unsaturated geographies. This could be an opportunity to partner with bikeshops in those areas!
Recap
Ok, you’ve made it through a long post, but one that hopefully helps you show off your analytic talents. We discussed how to simulate orders using the orderSimulatoR
scripts, which can be used to simulate orders in any situation that has customers and products. All you need is a customer table, products table, and a well-thought-out matrix of probabilities that create the customer-product interaction trends. The simulated orders can then be used for data mining and visualizations. We went through a few high level data visualizations on the bikes data set
. We used the ggplot2
and leaflet
packages to help understand the overall sales trends, best products, and best customers. With that said, we just scratched the surface of the data mining and visualization process. In future posts, we’ll take a look at more advanced techniques for uncover trends. Until then, onward and upward.
Further Resources
For more R
resources, visit R-Bloggers, a site dedicated to R
. There’s a wealth of R
information, and you may even see this post. Happy analyzing.