Become a Data Scientist and accelerate your career in 6-months or less.
5-10 Hours Per Week. 80/20 Tools. End-To-End Business Projects.
sweep: Extending broom for time series forecasting
Written by Matt Dancho on July 9, 2017
We’re pleased to introduce a new package, sweep, now on CRAN! Think of it like broom for the forecast package. The forecast package is the most popular package for forecasting, and for good reason: it has a number of sophisticated forecast modeling functions. There’s one problem: forecast is based on the ts system, which makes it difficult work within the tidyverse. This is where sweep fits in! The sweep package has tidiers that convert the output from forecast modeling and forecasting functions to “tidy” data frames. We’ll go through a quick introduction to show how the tidiers can be used, and then show a fun example of forecasting GDP trends of US states. If you’re familiar with broom it will feel like second nature. If you like what you read, don’t forget to follow us on social media to stay up on the latest Business Science news, events and information!
An example of the visualization we can create using sw_sweep() for tidying a forecast:
The sweep package makes it easy to transition from the forecast package to the tidyverse. The main benefits are:
Converting forecasts to data frames: The forecast package uses ts objects under the hood, thus making it difficult to use in the “tidyverse”. With sw_sweep, we can now easily convert forecasts to tidy data frames.
Dates are carried through to the end: The ts objects traditionally lose date information. The sweep package uses timekit under the hood to maintain the original time series index the whole way through the process. The result is ability to forecast in the original date or date-time time-base by setting timekit_idx = TRUE. Future dates are computed using tk_make_future_timeseries() from timekit.
Intermediate modeling tidiers: The sweep package uses broom-style tidiers, sw_tidy, sw_glance, and sw_augment to extract important model information into tidy data frames.
You can quickly install the packages used with the following script:
Load the following packages:
forecast: Has excellent modeling functions such as auto.arima(), ets() and bats() and the forecast() function for predicting future observations.
sweep: Tidies the output of forecast functions using a similar strategy as the broom package.
timekit: Coercion function tk_ts() for converting a tibble to ts while maintaining time-based data.
tidyquant: Used to get FRED data and for its ggplot2 theme.
geofacet: Really useful facet_geo() function to visualize facets organized by geography.
We’ll be working with Annual Gross Domestic Product (GDP) time series data for each of the US States from the FRED database.
One State: Nebraska
We can get the data for one of the states by using tq_get() from the tidyquant package. The FRED code we will use is “NENGSP”, for Nebraska’s annual GDP. Set get = "economic.data" and supply a date range. By default, the returned values are named “price”. Rename “gdp”.
We’ll need the GDP data for all states to create the GDP by State forecast visualization. Here’s how to get it by scaling with tq_get().
Scaling to All 50 States
The structure of the FRED code begins with the state abbreviation, “NE” for Nebraska, followed by “NGSP”. This means we can pull the data for all states very easily by changing the first two characters.
We start by getting a data frame of state FRED codes and abbreviations. Conveniently, R ships with the state abbreviations stored in state.abb. The mutation just adds “NGSP” to the end of the abbreviation to get the FRED code. It’s really important that the code is in the first column so tq_get can scale the “getter”. The output is stored as states.
Next, we scale to pull the FRED data for all of the states by simply passing the states data frame to tq_get(). We format the output dropping the “fred_code” column, grouping on “abbreviation”, and renaming the “price” column to “gdp”. The result is stored in states_gdp.
We’ll go through the process to show how sweep can help with tidying in the forecast workflow using the Nebraska GDP data, ne_gdp.
Convert to ts
The forecast package works with ts objects so we’ll need to convert from a tibble (tidy data frame). Here’s how using the timekit function, tk_ts(). Supply a start date start = 2017 and frequency freq = 1 for 1 year to setup the ts object. Add silent = TRUE to skip the messages and warnings that the “date” column is being dropped (non-numeric columns are automatically dropped and the user is alerted by default).
Model with auto.arima
Now we can model. Let’s use the auto.arima() function from the forecast package. This function is really cool because internally it pre-selects parameters making it easier to get forecasts especially at scale, discussed later. ;)
Optional: Apply a modeling tidier
Once we have a model, we can using the sweep tidiers: sw_tidy(), sw_glance and sw_augment. We’ll check out sw_glance to get the model accuracy metrics.
Next, we create the forecast using the forecast() function from the forecast package. We’ll perform a three year forecast so set h = 3 for 3 periods.
Tidy the forecast with sw_sweep
Finally, the beauty of sweep, we can convert the forecast to a tidy data frame.
And, we can easily visualize using ggplot2.
Now, onto a more sophisticated example.
State GDP Forecasting
Rather than one state, say we wanted to visualize the forecast of the annual GDP for all states so we can get a better understanding of trends. This is now much easier. The general steps are the same, but instead of individually managing each analysis we’ll use purrr to iterate through the 50 states keeping everything “tidy” in the process.
Start with states_gdp, which contains our data for all 50 states. Use nest() to create a nested data frame with the “date” and “gdp” inside a list column.
Next, use map() to iteratively apply the tk_ts() function. Add the additional arguments freq = 1, start = 2007 and silent = TRUE. The new column, “data_ts”, contains the data converted ts.
Third, use map() again, this time applying the auto.arima function. We can see that a new column is added called fit.
Optionally, we can run glance to get the model accuracies.
Fourth, use map() to apply the forecast function, passing h = 3 as an additional argument.
Finally, use map() to apply the sw_sweep function, passing timekit_idx = TRUE (this gets dates instead of numbers) and rename_index = "date". We no longer need the other columns so select “abbreviation” and “sweep” columns and unnest(). Viola, we have a nice tidy data frame of all of the state forecasts!
As an added bonus, we can use the facet_geo() function from the geofacet package to visualize the trend and forecast for each state. From the output it looks like most of the states are increasing, but there’s a few with more volatile trends. It might be interesting to investigate what’s causing the deviations in the midwest and south. Possibly related to the recent recession in oil and gas?
The sweep package is a great way to “tidy” the forecast package. It has several functions that tidy model output (sw_tidy, sw_glance, and sw_augment) and forecast output (sw_sweep). A big advantage is that the dates can be kept through the entire process since sweep uses timekit under the hood. If you use the forecast package and love the tidyverse, give sweep a try!
About Business Science
We have a full suite of data science services to supercharge your financial and business performance. How do we do it? Using our network of data science consultants, we pull together the right team to get custom projects done on time, within budget, and of the highest quality. Find out more about our data science services or contact us!
We are growing! Let us know if you are interested in joining our network of data scientist consultants. If you have expertise in Marketing Analytics, Data Science for Business, Financial Analytics, or Data Science in general, we’d love to talk. Contact us!