Demo Week: Time Series Machine Learning with timetk
Written by Matt Dancho
We’re into the second day of Business Science Demo Week. What’s demo week? Every day this week we are demoing an R package:
tibbletime (Thursday) and
h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Second up is
timetk, your toolkit for time series in R. Here we go!
Demo Week Demos:
Get The Best Resources In Data Science. Every Friday!
Sign up for our free "5 Topic Friday" Newsletter. Every week, I'll send you the five coolest topics in data science for business that I've found that week. These could be new R packages, free books, or just some fun to end the week on.
Sign Up For Five-Topic-Friday!
timetk: What’s It Used For?
There are three main uses:
Time series machine learning: Using regression algorithms to forecast
Making future time series indicies: Extract, explore, and extend a time series index using patterns in the time-base
Coercing (converting) between time classes (e.g. between
ts): Consistent coercion makes working in the various time classes much easier!
We’ll go over time series ML and coercion today. The second (extracting and making future time series) will be touched on in time series ML as this is very critical to prediction accuracy.
We’ll need two libraries today:
tidyquant: For getting data and loading the tidyverse behind the scenes
timetk: Toolkit for working with time series in R
If you haven’t done so already, install the packages:
Load the libraries.
We’ll get data using the
tq_get() function from
tidyquant. The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales.
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting (as we see during time series machine learning). We’ll use
tidyquant charting tools: mainly
geom_ma(ma_fun = SMA, n = 12) to add a 12-period simple moving average to get an idea of the trend. We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).
Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!
We’ve split this demo into two parts. First, we’ll follow a workflow for time series machine learning. Second, we’ll check out coercion tools.
Part 1: Time Series Machine Learning
Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 12 months of data for the time series using the time series signature.
We’ll go through a workflow that can be used to perform time series machine learning. You’ll see how several
timetk functions can help with this process. We’ll do machine learning with a simple
lm() linear regression, and you will see how powerful and accurate this can be when a time series signature is used. Further, you should think about what other more powerful machine learning algorithms can be used such as
glmnet (LASSO), and others.
Step 0: Review data
Just to show our starting point, let’s print out our
We can quickly get a feel for the time series using
tk_index() to extract the index and
tk_get_timeseries_summary() to retrieve summary information of the index. We use
glimpse() to output in a nice format for review.
We can see important features like start, end, units, etc. We also have the quantiles of the time-diffs (difference in seconds between observations), which is useful for assessing the degree of regularity. Because the scale is monthly, the number of seconds between each month follows an irregular distribution.
Step 1: Augment Time Series Signature
tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame.
Step 2: Model
Apply any regression model to the data. We’ll use
lm(). Note that we drop the date and diff columns. Most algorithms do not work with dates, and the diff column is not useful for machine learning (it’s more useful for finding time gaps in the data).
Step 3: Build Future (New) Data
tk_index() to extract the index.
Make a future index from the existing index with
tk_make_future_timeseries. The function internally checks the periodicity and returns the correct sequence. Note that we have a whole vignette on how to make future time series, which is helpful due to the complexity of the topic.
From the future index, use
tk_get_timeseries_signature() to turn index into time signature data frame.
Step 4: Predict the New Data
predict() function for your regression model. Note that we drop the index and diff columns, the same as before when using the
Step 5: Compare Actual vs Predictions
We can use
tq_get() to retrieve the actual data. Note that we don’t have all of the data for comparison, but we can at least compare the first several months of actual values.
Visualize our forecast.
We can investigate the error on our test set (actuals vs predictions).
And we can calculate a few residuals metrics. The MAPE error is approximately 4.5% from the actual value, which is pretty good for a simple multivariate linear regression. A more complex algorithm could produce more accurate results.
Time series machine learning can produce exceptional forecasts. For those interested in learning more, we have a whole vignette dedicated to time series forecasting using timetk.
Part 2: Coercion
Problem: Switching between various time classes in R is painful and inconsistent.
We are starting with a
tbl object. A disadvantage is that sometimes we would like to convert to an xts object to use xts-based functions from the numerous packages that deal with xts objects (
We can easily convert to an
xts object using
tk_xts(). Notice that
tk_xts() auto-detects the time-based column and uses its values as the index for the xts object.
We can also go from
xts back to
tbl. We tack on
rename_index = "date" to have the index name match what we started with. This used to be very difficult. Notice that
A number of packages use a different time class called
ts. Probably the most popular is the
forecast package. The advantage of using the
tk_ts() function is two-fold:
- It’s consistent with the other
tk_ coercion functions so coercing back and forth is straightforward and easy.
- IMPORTANT: When
tk_ts() is used, the ts-object carries the original irregular time index (usually dates) as an index attribute. This makes keeping date and datetime information possible.
Here’s an example. We can use
tk_ts() to convert to a
ts object. Because the ts-based system only works with regular time series, we need to add the arguments
start = 2010 and
freq = 12.
There are two ways we can go back to
- Just coerce back using
tk_tbl() and we get the “regular” index as YEARMON data type from
- If the object was created with
tk_ts() and has a
timetk_index, we can coerce back using
tk_tbl(timetk_index = TRUE) and we get the original “irregular” index as Date data type.
Method 1: We go back to
tbl. Note that the date column is YEARMON class.
Method 2: We go back to
tbl but specify
timetk_idx = TRUE to return original DATE or DATETIME information.
First, you can check to see if the ts-object has a timetk index with
TRUE, then specify
timetk_idx = TRUE during the
tk_tbl() coercion. See that we now have “date” data type. This was previously very difficult to do.
We’ve only scratched the surface of
timetk. There’s more to learn including working with time series indices and making future indices. Here are a few resources to help you along the way:
We have a busy couple of weeks. In addition to Demo Week, we have:
On Thursday, October 26 at 7PM EST, Matt will be giving a FREE LIVE #DataTalk on Machine Learning for Recruitment and Reducing Employee Attrition. You can sign up for a reminder at the Experian Data Lab website.
On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.
Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:
The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.
About Business Science
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!