Product Price Prediction: A Tidy Hyperparameter Tuning and Cross Validation Tutorial
Written by Matt Dancho
Product price estimation and prediction is one of the skills I teach frequently - It's a great way to analyze competitor product information, your own company's product data, and develop key insights into which product features influence product prices. Learn how to model product car prices and calculate depreciation curves using the brand new
tune package for Hyperparameter Tuning Machine Learning Models. This is an Advanced Machine Learning Tutorial.
R Packages Covered
tune- Hyperparameter tuning framework
parsnip- Machine learning framework
dials- Grid search development framework
rsample- Cross-validation and sampling framework
recipes- Preprocessing pipeline framework
yardstick- Accuracy measurements
correlationfunnel- Detect relationships using binary correlation analysis
DataExplorer- Investigate data cleanliness
skimr- Summarize data by data type
Summary (Why read this?)
Hyperparameter tuning and cross-validation have previously been quite difficult using
parsnip, the new machine learning framework that is part of the
tidymodels ecosystem. Everything has changed with the introduction of the
tune package - the
tidymodels hyperparameter tuning framework that integrates:
parsnipfor machine learning
dialsfor grid search
This is a machine learning tutorial where we model auto prices (MSRP) and estimate depreciation curves.
Estimate Depreciation Curves using Machine Learning
To implement the Depreciation Curve Estimation, you develop a machine learning model that is hyperparameter tuned using a 3-Stage Nested Hyper Parameter Tuning Process with 5-Fold Cross-Validation.
3-Stage Nested Hyperparameter Tuning Process
3 Stage Hyperparameter Tuning Process:
Find Parameters: Use Hyper Parameter Tuning on a “Training Dataset” that sections your training data into 5-Folds. The output at Stage 1 is the parameter set.
Compare and Select Best Model: Evaluate the performance on a hidden “Test Dataset”. The ouput at Stage 2 is that we determine best model.
Train Final Model: Once we have selected the best model, we train on the full dataset. This model goes into production.
Need to learn Data Science for Business? This is an advanced tutorial, but you can get the foundational skills, advanced machine learning, business consulting, and web application development using
AWS (Cloud), and
tidyverse (data manipulation and visualization). I recommend my 4-Course R-Track for Business Bundle.
Why Product Pricing is Important
Product price prediction is an important tool for businesses. There are key pricing actions that machine learning and algorithmic price modeling can be used for.
Which brands will customers pay more for?
Defend against inconsistent pricing
There is nothing more confusing to customers than pricing products inconsistently. Price too high, and customers fail to see the difference between competitor products and yours. Price too low and profit can suffer. Use machine learning to price products consistently based on the key product features that your customers care about.
Learn which features drive pricing and customer purchase decisions
In competitive markets, pricing is based on supply and demand economics. Sellers adjust prices to maximize profitability given market conditions. Use machine learning and explainable ML techniques to interpret the value of features such as brand (luxury vs economy), performance (engine horsepower), age (vehicle year), and more.
Develop price profiles (Appreciation / Depreciation Curves)
Another important concept to products like homes and automobiles is the ability to monitor the effect of time. Homes tend to appreciate and machinery (including automobiles) tend to depreciate in value over time. Use machine learning to develop price curves. We’ll do just that in this tutorial examining the MSRP of vehicles that were manufactured across time.
Depreciation Curve for Dodge Ram 1500 Pickup
Read on to learn how to make this plot
Product Price Tutorial
Onward - To the Product Price Prediction and Hyperparameter Tuning Tutorial.
1.0 Libraries and Data
Load the following libraries.
Next, get the data used for this tutorial. This data set containing Car Features and MSRP was scraped from "Edmunds and Twitter".
Download the Dataset
Car Features and MSRP Dataset
2.0 Data Understanding
Read the data using
read_csv() and use
clean_names() from the
janitor package to clean up the column names.
|46135||BMW||1 Series M||2011||premium unleaded (required)||335||6||MANUAL||rear wheel drive||2||Factory Tuner,Luxury,High-Performance||Compact||Coupe||26||19||3916|
|40650||BMW||1 Series||2011||premium unleaded (required)||300||6||MANUAL||rear wheel drive||2||Luxury,Performance||Compact||Convertible||28||19||3916|
|36350||BMW||1 Series||2011||premium unleaded (required)||300||6||MANUAL||rear wheel drive||2||Luxury,High-Performance||Compact||Coupe||28||20||3916|
|29450||BMW||1 Series||2011||premium unleaded (required)||230||6||MANUAL||rear wheel drive||2||Luxury,Performance||Compact||Coupe||28||18||3916|
|34500||BMW||1 Series||2011||premium unleaded (required)||230||6||MANUAL||rear wheel drive||2||Luxury||Compact||Convertible||28||18||3916|
We can get a sense using some
ggplot2 visualizations and correlation analysis to detect key features in the dataset.
2.1 Engine Horsepower vs MSRP
First, let’s take a look at two interesting factors to me:
- MSRP (vehicle price) - Our target, what customers on average pay for the vehicle, and
- Engine horsepower - A measure of product performance
We can see that there are two distinct groups in the plots. We can inspect a bit closer with
patchwork. I specifically focus on the high end of both groups:
- Vehicles with MSRP greater than $650,000
- Vehicles with Engine HP greater than 350 and MSRP less than $10,000
- The first group makes total sense - Lamborghini, Bugatti and Maybach. These are all super-luxury vehicles.
- The second group is more interesting - It’s BMW 8 Series, Mercedes-Benz 600-Class. This is odd to me because these vehicles normally have a higher starting price.
2.2 Correlation Analysis - correlationfunnel
Next, I’ll use my
correlationfunnel package to hone in on the low price vehicles. I want to find which features correlate most with low prices.
Ah hah! The reduction in price is related to vehicle age. We can see that Vehicle Year less than 2004 is highly correlated with Vehicle Price (MSRP) less than $18,372. This explains why some of the 600 Series Mercedes-Benz vehicles (luxury brand) are in the low price group. Good - I’m not going crazy.
2.3 Engine HP, MSRP by Vehicle Age
Let’s explain this story by redo-ing the visualization from 2.1, this time segmenting by Vehicle Year. I’ll segment by year older than 2000 and newer than 2000.
As Joshua Starmer would say, “Double Bam!” Vehicle Year is the culprit.
3.0 Exploratory Data Analysis
Ok, now that I have a sense of what is going on with the data, I need to figure out what’s needed to prepare the data for machine learning. The data set was webscraped. Datasets like this commonly have issues with missing data, unformatted data, lack of cleanliness, and need a lot of preprocessing to get into the format needed for modeling. We’ll fix that up with a preprocessing pipeline.
Our goal in this section is to identify data issues that need to be corrected. We will then use this information to develop a preprocessing pipeline. We will make heavy use of
DataExplorer, two EDA packages I highly recommend.
3.1 Data Summary - skimr
I’m using the
skim() function to breakdown the data by data type so I can assess missing values, number of uniue categories, numeric value distributions, etc.
- We have missing values in a few features
- We have several categories with low and high unique values
- We have pretty high skew in several categories
Let’s go deeper with
3.2 Missing Values - DataExplorer
Let’s use the
plot_missing() function to identify which columns have missing values. We’ll take care of these missing values using imputation in section 4.
Key Point: We’ll zap missing values and replace with estimated values using imputation.
3.3 Categorical Data - DataExplorer
Let’s use the
plot_bar() function to identify the distribution of categories. We have several columns with a lot of categories that have few values. We can lump these into an “Other” Category.
One category that doesn’t show up in the plot is “market_category”. This has 72 unique values - too many to plot. Let’s take a look.
Hmmm. This is actually a “tag”-style category (where one vehicle can have multiple tags). We can clean this up using term-frequency feature engineering.
- We need to lump categories with few observations
- We can use text-based feature engineering on the market categories that have multiple categories.
3.4 Numeric Data - DataExplorer
plot_histogram() to investigate the numeric data. Some of these features are skewed and others are actually discrete (and best analyzed as categorical data). Skew won’t be much of an issue with tree-based algorithms so we’ll leave those alone (better from an explainability perspective). Discrete features like engine-cylinders and number of doors are better represented as factors.
Here’s a closer look at engine cylinders vs MSRP. Definitely need to encode this one as a factor, which will be better because of the non-linear relationship to price. Note that cars with zero engine cylinders are electric.
- We’ll encode engine cylinders and number of doors as categorical data to better represent non-linearity
- Tree-Based algorithms can handle skew. I’ll leave alone for explainability purposes. Note that you can use transformations like Box Cox to reduce skew for linear algorithms. But this transformation makes it more difficult to explain the results to non-technical stakeholders.
4.0 Machine Learning Strategy
You’re probably thinking, “Wow - That’s a lot of work just to get to this step.”
Yes, you’re right. That’s why good Data Scientists are in high demand. Data Scientists that understand business, know data wrangling, visualization techniques, can identify how to treat data prior to Machine Learning, and communicate what’s going on to the organization - These data scientists get good jobs.
At this point, it makes sense to re-iterate that I have a 4-Course R-Track Curriculum that will turn you into a good Data Scientist in 6-months or less. Yeahhhh!
ML Game Plan
We have a strategy that we are going to use to do what’s called “Nested Cross Validation”. It involves 3 Stages.
Stage 1: Find Parameters
We need to create several machine learning models and try them out. To accomplish we do:
- Initial Splitting - Separate into random training and test data sets
- Preprocessing - Make a pipeline to turn raw data into a dataset ready for ML
- Cross Validation Specification - Sample the training data into 5-splits
- Model Specification - Select model algorithms and identify key tuning parameters
- Grid Specification - Set up a grid using wise parameter choices
- Hyperparameter Tuning - Implement the tuning process
Stage 2: Select Best Model
Once we have the optimal algorithm parameters for each machine learning algorithm, we can move into stage 2. Our goal here is to compare each of the models on “Test Data”, data that were not used during the parameter tuning process. We re-train on the “Training Dataset” then evaluate against the “Test Dataset”. The best model has the best accuracy on this unseen data.
Stage 3: Retrain on the Full Dataset
Once we have the best model identified from Stage 2, we retrain the model using the best parameters from Stage 1 on the entire dataset. This gives us the best model to go into production with.
Ok, let’s get going.
5.0 Stage 1 - Preprocessing, Cross Validation, and Tuning
Time for machine learning! Just a few more steps and we’ll make and tune high-accuracy models.
5.1 Initial Train-Test Split - rsample
The first step is to split the data into Training and Testing sets. We use an 80/20 random split with the
initial_split() function from
set.seed() function is used for reproducibility. Note that we have 11,914 cars (observations) of which 9,532 are randomly assigned to training and 2,382 are randomly assigned to testing.
5.2 Preprocessing Pipeline - recipes
What the heck is a preprocessing pipeline? A “preprocessing pipeline” (aka a “recipe”) is a set of preprocessing steps that transform raw data into data formatted for machine learning. The key advantage to a preprocessing pipeline is that it can be re-used on new data. So when you go into production, you can use the recipe to process new incoming data.
Remember in Section 3.0 when we used EDA to identify issues with our data? Now it’s time to fix those data issues using the Training Data Set. We use the
recipe() function from the
recipes package then progressively add
step_* functions to transform the data.
The recipe we implement applies the following transformations:
- Encoding Character and Discrete Numeric data to Categorical Data.
- Text-Based Term Frequency Feature Engineering for the Market Category column
- Consolidate low-frequency categories
- Impute missing data using K-Nearest Neighbors with 5-neighbors (kNN is a fast and accurate imputation method)
- Remove unnecessary columns (e.g. model)
The preprocessing recipe hasn’t yet changed the data. We’ve just come up with the
recipe. To transform the data, we use
bake(). I create a new variable to hold the preprocessed training dataset.
We can use
DataExplorer to verify that the dataset has been processed. First, let’s inspect the Categorical Features.
- The category distributions have been fixed - now “other” category present lumping together infrequent categories.
- The text feature processing has added several new columns beginning with “tf_market_category_”.
- Number of doors and engine cylinders are now categorical.
We can review the Numeric Features. The remaining numeric features have been left alone to preserve explainability.
5.3 Cross Validation Specification - rsample
Now that we have a preprocessing recipe and the initial training / testing split, we can develop the Cross Validation Specification. Standard practice is to use either a 5-Fold or 10-Fold cross validation:
- 5-Fold: I prefer 5-fold cross validation to speed up results by using 5 folds and an 80/20 split in each fold
- 10-Fold: Others prefer a 10-fold cross validation to use more training data with a 90/10 split in each fold. The downside is that this calculation requires twice as many models as 5-fold, which is already an expensive (time consuming) operation.
To implement 5-Fold Cross Validation, we use
vfold_cv() from the
rsample package. Make sure to use your training dataset (
training()) and then apply the preprocessing recipe using
bake() before the
vfold_cv() cross validation sampling. You now have specified the 5-Fold Cross Validation specification for your training dataset.
5.4 Model Specifications - parnsip
We’ll specify two competing models:
glmnet- Uses an Elastic Net, that combines the LASSO and Ridge Regression techniques. This is a linear algorithm, which can have difficulty with skewed numeric data, which is present in our numeric features.
xgboost- A tree-based algorithm that uses gradient boosted trees to develop high-performance models. The tree-based algorithms are not sensitive to skewed numeric data, which can easily be sectioned by the tree-splitting process.
5.4.1 glmnet - Model Spec
We use the
linear_reg() function from
parsnip to set up the initial specification. We use the
tune() function from
tune to identify the tuning parameters. We use
set_engine() from parsnip to specify the engine as the
5.4.2 xgboost - Model Spec
A similar process is used for the XGBoost model. We specify
boost_tree() and identify the tuning parameters. We set the engine to
xgboost library. Note that an update to the
xgboost library has changed the default objective from
reg:squarederror. I’m specifying this by adding a
objective argument in
set_engine() that get’s passed to the underlying
xgb.train(params = list([goes here])).
5.5 Grid Specification - dials
Next, we need to set up the grid that we plan to use for Grid Search. Grid Search is the process of specifying a variety of parameter values to be used with your model. The goal is to to find which combination of parameters yields the best accuracy (lowest prediction error) for each model.
We use the
dials package to setup the hyperparameters. Key functions:
parameters()- Used to specify ranges for the tuning parameters
grid_***- Grid functions including max entropy, hypercube, etc
5.5.1 glmnet - Grid Spec
glmnet model, we specify a parameter set,
parameters(), that includes
Next, we use the
grid_max_entropy() function to make a grid of 20 values using the
parameters. I use
set.seed() to make this random process reproducible.
Because this is a 2-Dimensional Hyper Paramater Space (only 2 tuning parameters), we can visualize what the
grid_max_entropy() function did. The grid selections were evenly spaced out to create uniformly distributed hyperparameter selections. Note that the
penalty parameter is on the Log Base 10-scale by default (refer to the
dials::penalty() function documentation). This functionality results in smarter choices for critical parameters, a big benefit of using the
5.5.2 xgboost - Grid Spec
We can follow the same process for the
xgboost model, specifying a parameter set using
parameters(). The tuning parameters we select for are grid are made super easy using the
learn_rate() functions from
Next, we set up the grid space. Because this is now a 3-Dimensional Hyperparameter Space, I up the size of the grid to 30 points. Note that this will drastically increase the time it takes to tune the models because the
xgboost algorithm must be run 30 x 5 = 150 times. Each time it runs with 1000 trees, so we are talking 150,000 tree calculations. My point is that it will take a bit to run the algorithm once we get to Section 5.6 Hyperparameter Tuning.
5.6 Hyper Parameter Tuning - tune
Now that we have specified the recipe, models, cross validation spec, and grid spec, we can use
tune() to bring them all together to implement the Hyperparameter Tuning with 5-Fold Cross Validation.
5.6.1 glmnet - Hyperparameter Tuning
Tuning the model using 5-fold Cross Validation is straight-forward with the
tune_grid() function. We specify the
The only piece that I haven’t explained is the
metrics. These come from the
yardstick package, which has functions including
rsq() for calculating regression accuracy. We can specify any number of these using the
metric_set(). Just make sure to use only regression metrics since this is a regression problem. For classification, you can use all of the normal measures like AUC, Precision, Recall, F1, etc.
show_best() function to quickly identify the best hyperparameter values.
A key observation is that the Mean Absolute Error (MAE) is $16,801, meaning the model is performing poorly. This is partly because we left the numeric features untransformed. Try updating the recipe with
step_boxcox() and see if you can do better. Note that your MSRP will be transformed so you need to invert the MAE to the correct scale by finding the power for the box-cox transformation. But I digress.
5.6.2 xgboost - Hyperparameter Tuning
Follow the same tuning process for the
xgboost model using
Warning: This takes approximately 20 minutes to run with 6-core parallel backend. I have recommendations to speed up this at the end of the article.
We can see the best
xgboost tuning parameters with
A key observation is that the Mean Absolute Error (MAE) is $3,784, meaning the
xgboost model is performing about 5X better than the
glmnet. However, we won’t know for sure until we move to Stage 2, Evaluation.
6.0 Stage 2 - Compare and Select Best Model
Feels like a whirlwind to get to this point. You’re doing great. Just a little bit more to go. Now, let’s compare our models.
The “proper” way to perform model selection is not to use the cross validation results because in theory we’ve optimized the results to the data the training data. This is why we left out the testing set by doing initial splitting at the beginning.
Now, do I agree with this? Let me just say that normally the Cross Validation Winner is the true winner.
But for the sake of showing you the correct way, I will continue with the model comparison.
6.1 Select the Best Parameters
select_best() function from the
tune package to get the best parameters. We’ll do this for both the
6.2 Finalize the Models
Next, we can use the
finalize_model() function to apply the best parameters to each of the models. Do this for both the
6.3 Calculate Performance on Test Data
We can create a helper function,
calc_test_metrics(), to calculate the performance on the test set.
6.3.1 glment - Test Performance
calc_test_metrics() function to calculate the test performance on the
6.3.2 xgboost - Test Performance
calc_test_metrics() function to calculate the test performance on the
6.4 Model Winner!
xgboost model had an MAE of $3,209 on the test data set. We select this model with the following parameters to move forward.
7.0 Stage 3 - Final Model (Full Dataset)
Now that we have a winner from Stage 2, we move forward with the
xgboost model and train on the full data set. This gives us the best possible model to move into production with.
fit() function from
parsnip to train the final xgboost model on the full data set. Make sure to apply the preprocessing recipe to the original
8.0 Making Predictions on New Data
We’ve went through the full analysis and have a “production-ready” model. Now for some fun - let’s use it on some new data. To avoid repetitive code, I created the helper function,
predict_msrp(), to quickly apply my final model and preprocessing recipe to new data.
8.1 What does a
I’ll simulate a 2008 Dodge Ram 1500 Pickup and calculate the predicted price.
8.2 What’s the “Luxury” Effect?
Let’s play with the model a bit. Next, let’s change the
market_category from “N/A” to “Luxury”. The price just jumped from $30K to $50K.
8.3 How Do Prices Vary By Model Year?
Let’s see how our XGBoost model views the effect of changing the model year. First, we’ll create a dataframe of Dodge Ram 1500 Pickups with the only feature that changes is the model year.
Next, we can use
ggplot2 to visualize this “Year” effect, which is essentially a Depreciation Curve. We can see that there’s a huge depreciation going from models built in 1990’s vs 2000’s and earlier.
8.4 Which Features Are Most Important?
Finally, we can check which features have the most impact on the MSRP using the
vip() function from the
vip (variable importance) package.
9.0 Conclusion and Next Steps
tune package presents a much needed tool for Hyperparameter Tuning in the
tidymodels ecosystem. We saw how useful it was to perform 5-Fold Cross Validation, a standard in improving machine learning performance.
An advanced machine learning package that I’m a huge fan of is
h2o. H2O provides a automatic machine learning, which takes a similar approach (minus the Stage 2 - Comparison Step) by automating the cross validation and hyperparameter tuning process. Because H2O is written in Java, H2O is much faster and more scalable, which is great for large-scale machine learning projects. If you are interested in learning
h2o, I recommend my 4-Course R-Track for Business Program - The 201 Course teaches Advanced ML with H2O.
The last step is to productionalize the model inside a
Shiny Web Application. I teach two courses on production with
ShinyDashboards - Build 2 Predictive Web Dashboards
AWS- Build Full-Stack Web Applications in the Cloud
Shiny courses are part of the 4-Course R-Track for Business Program.