sweep: Extending broom for time series forecasting
Written by Matt Dancho
We’re pleased to introduce a new package, sweep
, now on CRAN! Think of it like broom
for the forecast
package. The forecast
package is the most popular package for forecasting, and for good reason: it has a number of sophisticated forecast modeling functions. There’s one problem: forecast
is based on the ts
system, which makes it difficult work within the tidyverse
. This is where sweep
fits in! The sweep
package has tidiers that convert the output from forecast
modeling and forecasting functions to “tidy” data frames. We’ll go through a quick introduction to show how the tidiers can be used, and then show a fun example of forecasting GDP trends of US states. If you’re familiar with broom
it will feel like second nature. If you like what you read, don’t forget to follow us on social media to stay up on the latest Business Science news, events and information!
An example of the visualization we can create using sw_sweep()
for tidying a forecast:
Benefits
The sweep
package makes it easy to transition from the forecast
package to the tidyverse
. The main benefits are:
-
Converting forecasts to data frames: The forecast
package uses ts
objects under the hood, thus making it difficult to use in the “tidyverse”. With sw_sweep
, we can now easily convert forecasts to tidy data frames.
-
Dates are carried through to the end: The ts
objects traditionally lose date information. The sweep
package uses timekit
under the hood to maintain the original time series index the whole way through the process. The result is ability to forecast in the original date or date-time time-base by setting timekit_idx = TRUE
. Future dates are computed using tk_make_future_timeseries()
from timekit
.
-
Intermediate modeling tidiers: The sweep
package uses broom
-style tidiers, sw_tidy
, sw_glance
, and sw_augment
to extract important model information into tidy data frames.
Libraries Needed
You can quickly install the packages used with the following script:
Load the following packages:
forecast
: Has excellent modeling functions such as auto.arima()
, ets()
and bats()
and the forecast()
function for predicting future observations.
sweep
: Tidies the output of forecast
functions using a similar strategy as the broom
package.
timekit
: Coercion function tk_ts()
for converting a tibble to ts
while maintaining time-based data.
tidyquant
: Used to get FRED data and for its ggplot2
theme.
geofacet
: Really useful facet_geo()
function to visualize facets organized by geography.
Data
We’ll be working with Annual Gross Domestic Product (GDP) time series data for each of the US States from the FRED database.
One State: Nebraska
We can get the data for one of the states by using tq_get()
from the tidyquant
package. The FRED code we will use is “NENGSP”, for Nebraska’s annual GDP. Set get = "economic.data"
and supply a date range. By default, the returned values are named “price”. Rename “gdp”.
We’ll need the GDP data for all states to create the GDP by State forecast visualization. Here’s how to get it by scaling with tq_get()
.
Scaling to All 50 States
The structure of the FRED code begins with the state abbreviation, “NE” for Nebraska, followed by “NGSP”. This means we can pull the data for all states very easily by changing the first two characters.
We start by getting a data frame of state FRED codes and abbreviations. Conveniently, R ships with the state abbreviations stored in state.abb
. The mutation just adds “NGSP” to the end of the abbreviation to get the FRED code. It’s really important that the code is in the first column so tq_get
can scale the “getter”. The output is stored as states
.
Next, we scale to pull the FRED data for all of the states by simply passing the states
data frame to tq_get()
. We format the output dropping the “fred_code” column, grouping on “abbreviation”, and renaming the “price” column to “gdp”. The result is stored in states_gdp
.
We have two data frames now:
Quick Start
We’ll go through the process to show how sweep
can help with tidying in the forecast workflow using the Nebraska GDP data, ne_gdp
.
Convert to ts
The forecast
package works with ts
objects so we’ll need to convert from a tibble
(tidy data frame). Here’s how using the timekit
function, tk_ts()
. Supply a start date start = 2017
and frequency freq = 1
for 1 year to setup the ts
object. Add silent = TRUE
to skip the messages and warnings that the “date” column is being dropped (non-numeric columns are automatically dropped and the user is alerted by default).
Model with auto.arima
Now we can model. Let’s use the auto.arima()
function from the forecast
package. This function is really cool because internally it pre-selects parameters making it easier to get forecasts especially at scale, discussed later. ;)
Optional: Apply a modeling tidier
Once we have a model, we can using the sweep
tidiers: sw_tidy()
, sw_glance
and sw_augment
. We’ll check out sw_glance
to get the model accuracy metrics.
Forecast
Next, we create the forecast using the forecast()
function from the forecast
package. We’ll perform a three year forecast so set h = 3
for 3 periods.
Tidy the forecast with sw_sweep
Finally, the beauty of sweep
, we can convert the forecast to a tidy data frame.
And, we can easily visualize using ggplot2
.
Now, onto a more sophisticated example.
State GDP Forecasting
Rather than one state, say we wanted to visualize the forecast of the annual GDP for all states so we can get a better understanding of trends. This is now much easier. The general steps are the same, but instead of individually managing each analysis we’ll use purrr
to iterate through the 50 states keeping everything “tidy” in the process.
Start with states_gdp
, which contains our data for all 50 states. Use nest()
to create a nested data frame with the “date” and “gdp” inside a list column.
Next, use map()
to iteratively apply the tk_ts()
function. Add the additional arguments freq = 1
, start = 2007
and silent = TRUE
. The new column, “data_ts”, contains the data converted ts
.
Third, use map()
again, this time applying the auto.arima
function. We can see that a new column is added called fit.
Optionally, we can run glance to get the model accuracies.
Fourth, use map()
to apply the forecast
function, passing h = 3
as an additional argument.
Finally, use map()
to apply the sw_sweep
function, passing timekit_idx = TRUE
(this gets dates instead of numbers) and rename_index = "date"
. We no longer need the other columns so select “abbreviation” and “sweep” columns and unnest()
. Viola, we have a nice tidy data frame of all of the state forecasts!
As an added bonus, we can use the facet_geo()
function from the geofacet
package to visualize the trend and forecast for each state. From the output it looks like most of the states are increasing, but there’s a few with more volatile trends. It might be interesting to investigate what’s causing the deviations in the midwest and south. Possibly related to the recent recession in oil and gas?
Conclusions
The sweep
package is a great way to “tidy” the forecast
package. It has several functions that tidy model output (sw_tidy
, sw_glance
, and sw_augment
) and forecast output (sw_sweep
). A big advantage is that the dates can be kept through the entire process since sweep
uses timekit
under the hood. If you use the forecast
package and love the tidyverse
, give sweep
a try!
About Business Science
We have a full suite of data science services to supercharge your financial and business performance. How do we do it? Using our network of data science consultants, we pull together the right team to get custom projects done on time, within budget, and of the highest quality. Find out more about our data science services or contact us!
We are growing! Let us know if you are interested in joining our network of data scientist consultants. If you have expertise in Marketing Analytics, Data Science for Business, Financial Analytics, or Data Science in general, we’d love to talk. Contact us!