Demo Week: Tidy Forecasting with sweep
Written by Matt Dancho
We’re into the third day of Business Science Demo Week. Hopefully by now you’re getting a taste of some interesting and useful packages. For those that may have missed it, every day this week we are demo-ing an R package:
tibbletime (Thursday) and
h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today is
sweep, which has
broom-style tidiers for forecasting. Let’s get going!
Demo Week Demos:
Get The Best Resources In Data Science. Every Friday!
Sign up for our free "5 Topic Friday" Newsletter. Every week, I'll send you the five coolest topics in data science for business that I've found that week. These could be new R packages, free books, or just some fun to end the week on.
Sign Up For Five-Topic-Friday!
sweep: What’s It Used For?
sweep is used for tidying the
forecast package workflow. Like
broom is to the
sweep is to
forecast package. It has useful functions including:
sw_sweep. We’ll check out each in this demo.
An added benefit to
timetk is if the ts-objects are created from time-based tibbles (tibbles with date or datetime index), the date or datetime information is carried through the forecasting process as a timetk index attribute. Bottom Line: This means we can finally use dates when forecasting as opposed to the regularly spaced numeric dates that the ts-system uses!
We’ll need four libraries today:
sweep: For tidying the
forecast package (like
broom is to
sweep is to
forecast: Package that includes ARIMA, ETS, and other popular forecasting algorithms
tidyquant: For getting data and loading the tidyverse behind the scenes
timetk: Toolkit for working with time series in R. We’ll use to coerce from
If you don’t already have installed, you can install with
install.packages(). Then load the libraries as follows.
We’ll use the same data as in the previous post where we used
timetk to forecast with time series machine learning. We get data using the
tq_get() function from
tidyquant. The data comes from FRED: Beer, Wine, and Distilled Alcoholic Beverages Sales.
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting (as we see during time series machine learning). We’ll use
tidyquant charting tools: mainly
geom_ma(ma_fun = SMA, n = 12) to add a 12-period simple moving average to get an idea of the trend. We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).
Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!
DEMO: Tidy forecasting with forecast + sweep
We’ll use the combination of
sweep to perform tidy forecasting.
Forecasting using the
forecast package is a non-tidy process that involves
ts class objects. We have seen this system before where we can “tidy” these objects. For the
stats library, we have
broom, which tidies models and predictions. For the
forecast package we now have
sweep, which tidies models and forecasts.
Objective: We’ll work through an ARIMA analysis to forecast the next 12 months of time series data.
Step 1: Create ts object
timetk::tk_ts() to convert from
ts. From the previous post, we learned that this has two benefits:
- It’s a consistent method to convert to and from
- The ts-object contains a
timetk_idx (timetk index) as an attribute, which is the original time-based index.
Here’s how to convert. Remember that ts-objects are regular time series so we need to specify a
start and a
We can check that the ts-object has a
Great. This will be important when we use
sw_sweep() later. Next, we’ll model using ARIMA.
Step 2A: Model using ARIMA
We can use the
auto.arima() function from the
forecast package to model the time series.
Step 2B: Tidy the Model
broom tidies the
stats package, we can use
sweep functions to tidy the ARIMA model. Let’s examine three tidiers, which enable tidy model evaluation:
sw_tidy(): Used to retrieve the model coefficients
sw_glance(): Used to retrieve model description and training set accuracy metrics
sw_augment(): Used to get model residuals
sw_tidy() function returns the model coefficients in a tibble (tidy data frame).
sw_glance() function returns the training set accuracy measures in a tibble (tidy data frame). We use
glimpse to aid in quickly reviewing the model metrics.
sw_augument() function helps with model evaluation. We get the “.actual”, “.fitted” and “.resid” columns, which are useful in evaluating the model against the training data. Note that we can pass
timetk_idx = TRUE to return the original date index.
We can visualize the residual diagnostics for the training data to make sure there is no pattern leftover.
Step 3: Make a Forecast
Make a forecast using the
One problem is the forecast output is not “tidy”. We need it in a data frame if we want to work with it using the
tidyverse functionality. The class is “forecast”, which is a ts-based-object (its contents are ts-objects).
Step 4: Tidy the Forecast with sweep
We can use
sw_sweep() to tidy the forecast output. As an added benefit, if the forecast-object has a timetk index, we can use it to return a date/datetime index as opposed to regular index from the ts-based-object.
First, let’s check if the forecast-object has a timetk index. Great. We can use the
timetk_idx argument when we apply
sw_sweep() to tidy the forecast output. Internally it projects a future time series index based on “timetk_idx” that is an attribute (this all happens because we created the ts-object originally with
tk_ts() in Step 1). Bottom Line: This means we can finally use dates with the forecast package (as opposed to the regularly spaced numeric index that the ts-system uses)!!!
Step 5: Compare Actuals vs Predictions
We can use
tq_get() to retrieve the actual data. Note that we don’t have all of the data for comparison, but we can at least compare the first several months of actual values.
Notice that we have the entire forecast in a tibble. We can now more easily visualize the forecast.
We can investigate the error on our test set (actuals vs predictions).
And we can calculate a few residuals metrics. The MAPE error is approximately 4.3% from the actual value, which is slightly better than the simple linear regression from the timetk demo. Note that the RMSE is slighly worse.
sweep package is very useful for tidying the
forecast package output. This demo showed some of the basics. Interested readers should check out the documentation, which goes into expanded detail on scaling analysis by groups and using multiple forecast models.
We have a busy couple of weeks. In addition to Demo Week, we have:
On Thursday, October 26 at 7PM EST, Matt will be giving a FREE LIVE #DataTalk on Machine Learning for Recruitment and Reducing Employee Attrition. You can sign up for a reminder at the Experian Data Lab website.
On Friday, November 3rd, Matt will be presenting at the EARL Conference on HR Analytics: Using Machine Learning to Predict Employee Turnover.
Based on recent demand, we are considering offering application-specific machine learning courses for Data Scientists. The content will be business problems similar to our popular articles:
The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.
About Business Science
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!