Demo Week: Time Series Machine Learning with h2o and timetk
Written by Matt Dancho
We’re at the final day of Business Science Demo Week. Today we are demo-ing the
h2o package for machine learning on time series data. What’s demo week? Every day this week we are demoing an R package:
tibbletime (Thursday) and
h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today you’ll see how we can use
h2o to get really accurate time series forecasts. Here we go!
Demo Week Demos:
Get The Best Resources In Data Science. Every Friday!
Sign up for our free "5 Topic Friday" Newsletter. Every week, I'll send you the five coolest topics in data science for business that I've found that week. These could be new R packages, free books, or just some fun to end the week on.
Sign Up For Five-Topic-Friday!
h2o: What’s It Used For?
h2o package is a product offered by H2O.ai that contains a number of cutting edge machine learning algorithms, performance metrics, and auxiliary functions to make machine learning both powerful and easy. One of the main benefits of H2O is that it can be deployed on a cluster (this will not be discussed today). From the R perspective, there are four main uses:
Data Manipulation: Merging, grouping, pivoting, imputing, splitting into training/test/validation sets, etc.
Machine Learning Algorithms: Very sophisiticated algorithms in both supervised and unsupervised categories. Supervised include deep learning (neural networks), random forest, generalized linear model, gradient boosting machine, naive bayes, stacked ensembles, and xgboost. Unsupervised include generalized low rank models, k-means and PCA. There’s also Word2vec for text analysis. The latest stable release also has AutoML: automatic machine learning, which is really cool as we’ll see in this post!
Auxiliary ML Functionality Performance analysis and grid hyperparameter search
Production, Map/Reduce and Cloud: Capabilities for productionizing into Java environments, cluster deployment with Hadoop / Spark (Sparkling Water), deploying in cloud environments (Azure, AWS, Databricks, etc)
Sticking with the theme for the week, we’ll go over how
h2o can be used for time series machine learning as an advanced algorithm. We’ll use
h2o locally to develop a high accuracy time series model on the same data set (
beer_sales_tbl) from the
sweep posts. This is a supervised regression problem.
We’ll need three libraries today:
h2o: Awesome machine learning library
tidyquant: For getting data and loading the tidyverse behind the scenes
timetk: Toolkit for working with time series in R
IMPORTANT FOR INSTALLING H2O
h2o, you must install the latest stable release. Select H2O » Latest Stable Release » Install in R. Then follow the instructions exactly.
Installing Other Packages
If you haven’t done so already, install the
Load the libraries.
We’ll get data using the
tq_get() function from
tidyquant. The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales.
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting, and it’s a good idea to identify spots where we will split the data into training, test and validation sets.
Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!
DEMO: h2o + timetk, Time Series Machine Learning
We’ll follow a similar workflow for time series machine learning from the
timetk + linear regression post on Tuesday. However, this time we’ll swap out the
lm() function for
h2o.autoML() to get superior accuracy!
Time Series Machine Learning
Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 8 months of data for 2017 using the time series signature. We’ll then compare the results to the two prior demos that predicted the same data using different methods:
lm() (linear regression) and
We’ll go through a workflow that can be used to perform time series machine learning.
Step 0: Review data
Just to show our starting point, let’s print out our
beer_sales_tbl. We use
glimpse() to take a quick peek at the data.
Step 1: Augment Time Series Signature
tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame. We’ll again use
glimpse() for quick inspection. See how there are now 30 features. Not all will be important, but some will.
Step 2: Prep the Data for H2O
We need to prepare the data in a format for H2O. First, let’s remove any unnecessary columns such as dates or those with missing values, and change the ordered classes to plain factors. We prefer
dplyr operations for these steps.
Let’s split into a training, validation and test sets following the time ranges in the visualization above.
Step 3: Model with H2O
First, fire up
h2o. This will initialize the Java Virtual Machine (JVM) that H2O uses locally.
We change our data to an
H2OFrame object that can be interpreted by the
Set the names that h2o will use as the target and predictor variables.
Apply any regression model to the data. We’ll use
x = x: The names of our feature columns.
y = y: The name of our target column.
training_frame = train_h2o: Our training set consisting of data from 2010 to start of 2016.
validation_frame = valid_h2o: Our validation set consisting of data in the year 2016. H2O uses this to ensure the model does not overfit the data.
leaderboard_frame = test_h2o: The models get ranked based on MAE performance against this set.
max_runtime_secs = 60: We supply this to speed up H2O’s modeling. The algorithm has a large number of complex models so we want to keep things moving at the expense of some accuracy.
stopping_metric = "deviance": Use deviance as the stopping metric, which provides very good results for MAPE.
Next we extract the leader model.
Step 4: Predict
Generate predictions using
h2o.predict() on the test data.
Step 5: Evaluate Performance
There are a few ways to evaluate performance. We’ll go through the easy way, which is
h2o.performance(). This yields a preset values that are commonly used to compare regression models including root mean squared error (RMSE) and mean absolute error (MAE).
Our preference for this is assessment is mean absolute percentage error (MAPE), which is not included above. However, we can easily calculate. We can investigate the error on our test set (actuals vs predictions).
For comparison sake, we can calculate a few residuals metrics.
And The Winner of Demo Week Is…
The MAPE for the combination of
timetk is superior to the two prior demos:
- timetk + h2o: MAPE = 3.9% (This demo)
- timetk + linear regression: MAPE = 4.3% (timetk demo)
- sweep + ARIMA: MAPE = 4.3%, (sweep demo)
A question for the interested reader to figure out: What happens to the accuracy when you average the predictions of all three different methods? Try it to find out.
Note that the accuracy of time series machine learning may not always be superior to ARIMA and other forecast techniques including those implemented by
prophet and GARCH methods. The data scientist has a responsibility to test different methods and to select the right tool for the job.
HaLLowEen TRick oR TrEat BoNuS!
We are going to visualize the forecast compared to the actual values, but this time taking a cue from @lenkiefer’s
theme_spooky described in one of his recent posts, Mortgage Rates are Low!
We’re going to need to load a few libraries to get setup. The biggest challenge is the fonts, but there’s a really cool package called
extrafont that we can use. We’ll use
extrafont to load the Chiller fontset. Load the bonus library.
Next, you’ll need to setup the Chiller font. Revolutions Analytics has a great article, How to Use Your Favorite Fonts in R Charts, which will get you up and running with
extrafont. IMPORTANT: Make sure you go throught the process of loading your system fonts with
Once fonts are imported, you can load fonts using.
We’ll use Len’s script for
theme_spooky(). I highly encourage you to use
theme_spooky() all month of October around the office. Very spooky, and surprisingly engaging. :)
Now let’s create the final visualization so we can see our spooky forecast… Conclusion from the plot: It’s scary how accurate
We’ve only scratched the surface of
h2o. There’s more to learn including working classifiers and unsupervised learning. Here are a few resources to help you along the way:
We have a busy couple of weeks. In addition to Demo Week, we have:
Facebook LIVE DataTalk
Matt was recently hosted on Experian DataLabs live webcast, #DataTalk, where he spoke about Machine Learning in Human Resources. The talk already has 80K+ views and is growing!! Check it out if you are interested in #rstats, #hranalytics and #MachineLearning.
On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.
Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:
The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.
About Business Science
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!