sweep: Extending broom for time series forecasting
Written by Matt Dancho
We’re pleased to introduce a new package,
sweep, now on CRAN! Think of it like
broom for the
forecast package. The
forecast package is the most popular package for forecasting, and for good reason: it has a number of sophisticated forecast modeling functions. There’s one problem:
forecast is based on the
ts system, which makes it difficult work within the
tidyverse. This is where
sweep fits in! The
sweep package has tidiers that convert the output from
forecast modeling and forecasting functions to “tidy” data frames. We’ll go through a quick introduction to show how the tidiers can be used, and then show a fun example of forecasting GDP trends of US states. If you’re familiar with
broom it will feel like second nature. If you like what you read, don’t forget to follow us on social media to stay up on the latest Business Science news, events and information!
An example of the visualization we can create using
sw_sweep() for tidying a forecast:
sweep package makes it easy to transition from the
forecast package to the
tidyverse. The main benefits are:
Converting forecasts to data frames: The
forecast package uses
ts objects under the hood, thus making it difficult to use in the “tidyverse”. With
sw_sweep, we can now easily convert forecasts to tidy data frames.
Dates are carried through to the end: The
ts objects traditionally lose date information. The
sweep package uses
timekit under the hood to maintain the original time series index the whole way through the process. The result is ability to forecast in the original date or date-time time-base by setting
timekit_idx = TRUE. Future dates are computed using
Intermediate modeling tidiers: The
sweep package uses
sw_augment to extract important model information into tidy data frames.
You can quickly install the packages used with the following script:
Load the following packages:
forecast: Has excellent modeling functions such as
bats() and the
forecast() function for predicting future observations.
sweep: Tidies the output of
forecast functions using a similar strategy as the
timekit: Coercion function
tk_ts() for converting a tibble to
ts while maintaining time-based data.
tidyquant: Used to get FRED data and for its
geofacet: Really useful
facet_geo() function to visualize facets organized by geography.
We’ll be working with Annual Gross Domestic Product (GDP) time series data for each of the US States from the FRED database.
One State: Nebraska
We can get the data for one of the states by using
tq_get() from the
tidyquant package. The FRED code we will use is “NENGSP”, for Nebraska’s annual GDP. Set
get = "economic.data" and supply a date range. By default, the returned values are named “price”. Rename “gdp”.
We’ll need the GDP data for all states to create the GDP by State forecast visualization. Here’s how to get it by scaling with
Scaling to All 50 States
The structure of the FRED code begins with the state abbreviation, “NE” for Nebraska, followed by “NGSP”. This means we can pull the data for all states very easily by changing the first two characters.
We start by getting a data frame of state FRED codes and abbreviations. Conveniently, R ships with the state abbreviations stored in
state.abb. The mutation just adds “NGSP” to the end of the abbreviation to get the FRED code. It’s really important that the code is in the first column so
tq_get can scale the “getter”. The output is stored as
Next, we scale to pull the FRED data for all of the states by simply passing the
states data frame to
tq_get(). We format the output dropping the “fred_code” column, grouping on “abbreviation”, and renaming the “price” column to “gdp”. The result is stored in
We have two data frames now:
We’ll go through the process to show how
sweep can help with tidying in the forecast workflow using the Nebraska GDP data,
Convert to ts
forecast package works with
ts objects so we’ll need to convert from a
tibble (tidy data frame). Here’s how using the
tk_ts(). Supply a start date
start = 2017 and frequency
freq = 1 for 1 year to setup the
ts object. Add
silent = TRUE to skip the messages and warnings that the “date” column is being dropped (non-numeric columns are automatically dropped and the user is alerted by default).
Model with auto.arima
Now we can model. Let’s use the
auto.arima() function from the
forecast package. This function is really cool because internally it pre-selects parameters making it easier to get forecasts especially at scale, discussed later. ;)
Optional: Apply a modeling tidier
Once we have a model, we can using the
sw_augment. We’ll check out
sw_glance to get the model accuracy metrics.
Next, we create the forecast using the
forecast() function from the
forecast package. We’ll perform a three year forecast so set
h = 3 for 3 periods.
Tidy the forecast with sw_sweep
Finally, the beauty of
sweep, we can convert the forecast to a tidy data frame.
And, we can easily visualize using
Now, onto a more sophisticated example.
State GDP Forecasting
Rather than one state, say we wanted to visualize the forecast of the annual GDP for all states so we can get a better understanding of trends. This is now much easier. The general steps are the same, but instead of individually managing each analysis we’ll use
purrr to iterate through the 50 states keeping everything “tidy” in the process.
states_gdp, which contains our data for all 50 states. Use
nest() to create a nested data frame with the “date” and “gdp” inside a list column.
map() to iteratively apply the
tk_ts() function. Add the additional arguments
freq = 1,
start = 2007 and
silent = TRUE. The new column, “data_ts”, contains the data converted
map() again, this time applying the
auto.arima function. We can see that a new column is added called fit.
Optionally, we can run glance to get the model accuracies.
map() to apply the
forecast function, passing
h = 3 as an additional argument.
map() to apply the
sw_sweep function, passing
timekit_idx = TRUE (this gets dates instead of numbers) and
rename_index = "date". We no longer need the other columns so select “abbreviation” and “sweep” columns and
unnest(). Viola, we have a nice tidy data frame of all of the state forecasts!
As an added bonus, we can use the
facet_geo() function from the
geofacet package to visualize the trend and forecast for each state. From the output it looks like most of the states are increasing, but there’s a few with more volatile trends. It might be interesting to investigate what’s causing the deviations in the midwest and south. Possibly related to the recent recession in oil and gas?
sweep package is a great way to “tidy” the
forecast package. It has several functions that tidy model output (
sw_augment) and forecast output (
sw_sweep). A big advantage is that the dates can be kept through the entire process since
timekit under the hood. If you use the
forecast package and love the
sweep a try!
About Business Science
We have a full suite of data science services to supercharge your financial and business performance. How do we do it? Using our network of data science consultants, we pull together the right team to get custom projects done on time, within budget, and of the highest quality. Find out more about our data science services or contact us!
We are growing! Let us know if you are interested in joining our network of data scientist consultants. If you have expertise in Marketing Analytics, Data Science for Business, Financial Analytics, or Data Science in general, we’d love to talk. Contact us!