We’re at the final day of Business Science Demo Week. Today we are demo-ing the h2o package for machine learning on time series data. What’s demo week? Every day this week we are demoing an R package: tidyquant (Monday), timetk (Tuesday), sweep (Wednesday), tibbletime (Thursday) and h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today you’ll see how we can use timetk + h2o to get really accurate time series forecasts. Here we go!
Get The Best Resources In Data Science. Every Friday!
Sign up for our free "5 Topic Friday" Newsletter. Every week, I'll send you the five coolest topics in data science for business that I've found that week. These could be new R packages, free books, or just some fun to end the week on.
The h2o package is a product offered by H2O.ai that contains a number of cutting edge machine learning algorithms, performance metrics, and auxiliary functions to make machine learning both powerful and easy. One of the main benefits of H2O is that it can be deployed on a cluster (this will not be discussed today). From the R perspective, there are four main uses:
Data Manipulation: Merging, grouping, pivoting, imputing, splitting into training/test/validation sets, etc.
Machine Learning Algorithms: Very sophisiticated algorithms in both supervised and unsupervised categories. Supervised include deep learning (neural networks), random forest, generalized linear model, gradient boosting machine, naive bayes, stacked ensembles, and xgboost. Unsupervised include generalized low rank models, k-means and PCA. There’s also Word2vec for text analysis. The latest stable release also has AutoML: automatic machine learning, which is really cool as we’ll see in this post!
Auxiliary ML Functionality Performance analysis and grid hyperparameter search
Production, Map/Reduce and Cloud: Capabilities for productionizing into Java environments, cluster deployment with Hadoop / Spark (Sparkling Water), deploying in cloud environments (Azure, AWS, Databricks, etc)
Sticking with the theme for the week, we’ll go over how h2o can be used for time series machine learning as an advanced algorithm. We’ll use h2o locally to develop a high accuracy time series model on the same data set (beer_sales_tbl) from the timetk and sweep posts. This is a supervised regression problem.
We’ll need three libraries today:
h2o: Awesome machine learning library
tidyquant: For getting data and loading the tidyverse behind the scenes
timetk: Toolkit for working with time series in R
IMPORTANT FOR INSTALLING H2O
For h2o, you must install the latest stable release. Select H2O » Latest Stable Release » Install in R. Then follow the instructions exactly.
Installing Other Packages
If you haven’t done so already, install the timetk and tidyquant packages:
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting, and it’s a good idea to identify spots where we will split the data into training, test and validation sets.
Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!
DEMO: h2o + timetk, Time Series Machine Learning
We’ll follow a similar workflow for time series machine learning from the timetk + linear regression post on Tuesday. However, this time we’ll swap out the lm() function for h2o.autoML() to get superior accuracy!
Time Series Machine Learning
Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 8 months of data for 2017 using the time series signature. We’ll then compare the results to the two prior demos that predicted the same data using different methods: timetk + lm() (linear regression) and sweep + auto.arima() (ARIMA).
We’ll go through a workflow that can be used to perform time series machine learning.
Step 0: Review data
Just to show our starting point, let’s print out our beer_sales_tbl. We use glimpse() to take a quick peek at the data.
Step 1: Augment Time Series Signature
The tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame. We’ll again use glimpse() for quick inspection. See how there are now 30 features. Not all will be important, but some will.
Step 2: Prep the Data for H2O
We need to prepare the data in a format for H2O. First, let’s remove any unnecessary columns such as dates or those with missing values, and change the ordered classes to plain factors. We prefer dplyr operations for these steps.
Let’s split into a training, validation and test sets following the time ranges in the visualization above.
Step 3: Model with H2O
First, fire up h2o. This will initialize the Java Virtual Machine (JVM) that H2O uses locally.
We change our data to an H2OFrame object that can be interpreted by the h2o package.
Set the names that h2o will use as the target and predictor variables.
Apply any regression model to the data. We’ll use h2o.automl.
x = x: The names of our feature columns.
y = y: The name of our target column.
training_frame = train_h2o: Our training set consisting of data from 2010 to start of 2016.
validation_frame = valid_h2o: Our validation set consisting of data in the year 2016. H2O uses this to ensure the model does not overfit the data.
leaderboard_frame = test_h2o: The models get ranked based on MAE performance against this set.
max_runtime_secs = 60: We supply this to speed up H2O’s modeling. The algorithm has a large number of complex models so we want to keep things moving at the expense of some accuracy.
stopping_metric = "deviance": Use deviance as the stopping metric, which provides very good results for MAPE.
Next we extract the leader model.
Step 4: Predict
Generate predictions using h2o.predict() on the test data.
Step 5: Evaluate Performance
There are a few ways to evaluate performance. We’ll go through the easy way, which is h2o.performance(). This yields a preset values that are commonly used to compare regression models including root mean squared error (RMSE) and mean absolute error (MAE).
Our preference for this is assessment is mean absolute percentage error (MAPE), which is not included above. However, we can easily calculate. We can investigate the error on our test set (actuals vs predictions).
For comparison sake, we can calculate a few residuals metrics.
And The Winner of Demo Week Is…
The MAPE for the combination of h2o + timetk is superior to the two prior demos:
A question for the interested reader to figure out: What happens to the accuracy when you average the predictions of all three different methods? Try it to find out.
Note that the accuracy of time series machine learning may not always be superior to ARIMA and other forecast techniques including those implemented by prophet and GARCH methods. The data scientist has a responsibility to test different methods and to select the right tool for the job.
HaLLowEen TRick oR TrEat BoNuS!
We are going to visualize the forecast compared to the actual values, but this time taking a cue from @lenkiefer’s theme_spooky described in one of his recent posts, Mortgage Rates are Low!
We’re going to need to load a few libraries to get setup. The biggest challenge is the fonts, but there’s a really cool package called extrafont that we can use. We’ll use extrafont to load the Chiller fontset. Load the bonus library.
We have a busy couple of weeks. In addition to Demo Week, we have:
Facebook LIVE DataTalk
Matt was recently hosted on Experian DataLabs live webcast, #DataTalk, where he spoke about Machine Learning in Human Resources. The talk already has 80K+ views and is growing!! Check it out if you are interested in #rstats, #hranalytics and #MachineLearning.
The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.
About Business Science
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!