Tidy Time Series Forecasting in R with Spark
Written by Matt Dancho
I’m SUPER EXCITED to show fellow time-series enthusiasts a new way that we can scale time series analysis using an amazing technology called
Without Spark, large-scale forecasting projects of 10,000 time series can take days to run because of long-running for-loops and the need to test many models on each time series.
Spark has been widely accepted as a “big data” solution, and we’ll use it to scale-out (distribute) our time series analysis to Spark Clusters, and run our analysis in parallel.
Spark is an amazing technology for processing large-scale data science workloads.
Modeltime is a state-of-the-art forecasting library that I personally developed for “Tidy Forecasting” in R.
Modeltime now integrates a
Spark Backend with capability of forecasting 10,000+ time series using distributed Spark Clusters.
I show an introductory tutorial to get you started. Readers can sign up for a Free Live Training using Modeltime and Spark where we’ll cover more detail that we couldn’t cover during this introductory tutorial including:
- Forecasting Ensembles (New Capability)
- More Advanced Time Series Algorithms
- Feature Engineering at Scale
- Special Offers on Courses
Free Training: 3-Tips to Scale Forecasting
As mentioned above, we are hosting a Free Training: 3-Tips to Scale Forecasting. This will be a 1-hour full-code training with 3 tips to scale your forecasting.
Click image for Free Time Series Training
Register Here for Free Training
What is Spark?
I feel like I’ve unlocked infinite power.
If you’re like me, you have heard this term, “Spark”, but maybe you haven’t tried it yet. That was me about 2-months ago… Boy, am I glad I tried it. I feel like I’ve unlocked infinite power.
Spark’s Best Qualities
According to Spark’s website, Spark is a “unified analytics engine for large-scale data processing.”
This means that Spark has the following amazing qualities:
Spark is Unified: We can use different languages like Python, R, SQL and Java to interact with Spark.
Spark is made for Large-Scale: We can run workloads 100X faster versus Hadoop (according to Spark).
How Spark Works
Here’s a picture showing how Spark works.
How Spark Works
Spark runs on Java Virtual Machines (JVMs), which allow Spark to run and distribute workloads very fast.
But most of us aren’t Java Programmers (we’re R and Python programmers), so Spark has an interface to R and Python.
This means we can communicate with Spark via R, and send the work to Spark Executors all from the comfort of R. Boom. 💥
What is Modeltime?
Modeltime is a time series forecasting framework for “tidy time series forecasting”.
Tidymodels is like Scikit Learn, but better.
The “Tidy” in “Tidy time series forecasting” is because
modeltime builds on top of
tidymodels, a collection of packages for modeling and machine learning using
I equate Tidymodels in R to Scikit Learn in Python… Tidymodels is like Scikit Learn, but better. It’s simply easier to use especially when it comes to feature engineering (very important for time series), and I become more productive because of it.
Tidy Time Series Forecasting
modeltime builds on top of Tidymodels, any user that learns
tidymodels can use
The Modeltime Spark Backend
Now for the best part - the highlight of your day!
Modeltime integrates Spark.
YES! I said it people.
I’ve just upgraded Modeltime’s parallel processing backend so now you can swap out your local machine for Spark Clusters. This means you can:
Run Spark Locally - Ok, this is cool but doesn’t get me much beyond normal parallel processing. What else ya got, Matt?
Run Spark in the Cloud - Ahhh, this is where Matt was going. Now we’re talking.
Run Spark on Databricks - Bingo! Many enterprises are adopting Databricks as their data engineering solution. So now Matt’s saying I can run Modeltime in the cloud using databricks! Suh. Weet!
Yes! And you are now no longer limited by your CPU cores for parallel processing. You can easily scale modeltime to as many Spark Clusters as your company can afford.
Ok, on to the forecasting tutorial!
Spark Forecasting Tutorial
We’ll run through a short forecasting example to get you started. Refer to the FREE Time Series with Spark Training for how to perform iterative forecasting with larger datasets using
modeltime and 3-tips for scalable forecasting (beyond what we could cover in this tutorial).
One of the most common situations that parallel computation is required is when doing iterative forecasting where the data scientist needs to experiment with 10+ models across 10,000+ time series. It’s common for this large-scale, high-performance forecasting exercise to take days.
Iterative (Nested) Forecasting with Modeltime
We’ll show you how we can combine Modeltime “Nested Forecasting” and it’s Parallel Spark Backend to scale this computation with distributed parallel Spark execution.
Matt said “Nested”… What’s Nested?
A term that you’ll hear me use frequently is “nested”.
What I’m referring to is the “nested” data structure, which come from
tidyr package that allows us to organize data frames as lists inside data frames. This is a powerful feature of the
tidyverse that allows us for modeling at scale.
The book, “R for Data Science”, has a FULL Chapter on Many Models (and Nested Data). This is a great resource for those that want more info on Nested Data and Modeling (bookmark it ✅).
OK, let’s go!
This tutorial requires:
sparklyr, modeltime, and tidyverse: Make sure you have these R libraries installed (use
install.packages(c("sparklyr", "modeltime", "tidyverse"), dependencies = TRUE)). If this is a fresh install, it may take a bit. Just be patient. ☕
Java: Spark installation depends on Java being installed. Download Java here.
Spark Installation: Can be accomplished via
sparklyr::spark_install() provided the user has
sparklyr and Java installed (see Steps 1 and 2).
Load the following libraries.
Next, we set up a Spark connection via
sparklyr. For this tutorial, we use the “local” connection. But many users will use Databricks to scale the forecasting workload.
To run Spark locally:
If using Databricks, you can use:
Setup the Spark Backend
Next, we register the Spark Backend using
parallel_start(sc, .method = "spark"). This is a helper to set up the
registerDoSpark() foreach adaptor. In layman’s terms, this just means that we can now run parallel using Spark.
Data Preparation (for Nested Forecasting)
The dataset we’ll be forecasting is the
walmart_sales_weekly, which we modify to just include 3 columns: “id”, “date”, “value”.
- The id feature is the grouping variable.
- The date feature contains timestamps.
- The value feature is the sales value for the Walmart store-department combination.
We prepare as nested data using the Nested Forecasting preparation functions.
extend_timeseries(): This extends each one of our time series into the future by 52 timestamps (this is one year for our weekly data set).
nest_timeseries(): This converts our data to the nested data format indicating that our future data will be the last 52 timestamps (that we just extended).
split_nested_timeseries(): This adds indicies for the train / test splitting so we can develop accuracy metrics and determine which model to use for which time series.
Key Concept: Nested Data
You’ll notice our data frame (tibble) is only 7 rows and 4 columns. This is because we’ve nested our data.
If you examine each of the rows, you’ll notice each row is an ID. And we have “tibbles” in each of the other columns.
That is nested data!
We’ll create two unfitted models: XGBoost and Prophet. Then we’ll use
modeltime_nested_fit() to iteratively fit the models to each of the time series using the Spark Backend.
Model 1: XGBoost
We create the XGBoost model on features derived from the date column. This gets a bit complicated because we are adding
recipes to process the data. Basically, we are creating a bunch of features from the date column in each of the time series.
First, we use
extract_nested_train_split(nested_data_tbl, 1) to extract the first time series, so we can begin to create a “recipe”
Once we develop a recipe, we add “steps” that build features. We start by creating timeseries signature features from the date column. Then we remove “date” and further process the signature features.
Then we create an XGBoost model by developing a tidymodels “workflow”. The workflow combines a model (
boost_tree() in this case) with a recipe that we previously created.
The output is an unfitted workflow that will be applied to all 7 of our timeseries.
Next, we create a prophet workflow. The process is actually simpler than XGBoost because Prophet doesn’t require all of the preprocessing recipe steps that XGBoost does.
Nested Forecasting with Spark
Now, the beauty is that everything is set up for us to perform the nested forecasting with Spark. We simply use
modeltime_nested_fit() and make sure it uses the Spark Backend by setting
control_nested_fit(allow_par = TRUE).
Note that this will take about 20-seconds because we have a one-time cost to move data, libraries, and environment variables to the Spark clusters. But the good news is that when we scale up to 10,000+ time series, that the one-time cost is minimal compared to the speed up from distributed computation.
The nested modeltime object has now fit the models using Spark. You’ll see a new column added to our nested data with the name “.modeltime_tables”. This contains 2 fitted models (one XGBoost and one Prophet) for each time series.
Model Test Accuracy
We can observe the results. First, we can check the accuracy for each model.
Next, we can examine the test forecast for each of the models.
We can use
extract_nested_test_forecast() to extract the logged forecast and visualize how each did.
group_by(id) then pipe (
plot_modeltime_forecast() to make the visualization (great for reports and shiny apps!)
More we didn’t cover
There’s a lot more we didn’t cover. We actually didn’t:
- Select the best models for each of the 7 time series
- Improve performance with ensembles
- Forecast the future
I’ll cover these and much more in the Free Live Training on Forecasting with Spark.
Make sure to sign up and you’ll get 3 immediately actionable tips and a full code tutorial that goes well beyond this introduction.
Plus you’ll get exclusive offers on our courses and get to ask me (Matt, the creator of Modeltime) questions!
Close Clusters and Shutdown Spark
We can close the Spark adapter and shut down the Spark session when we are finished.
We’ve now successfully completed an Forecast with Spark. You may find this challenging, especially if you are not familiar with the Modeltime Workflow, terminology, or tidymodeling in R. If this is the case, we have a solution. Take our high-performance forecasting course.
Take the High-Performance Forecasting Course
Become the forecasting expert for your organization
High-Performance Time Series Course
Time Series is Changing
Time series is changing. Businesses now need 10,000+ time series forecasts every day. This is what I call a High-Performance Time Series Forecasting System (HPTSF) - Accurate, Robust, and Scalable Forecasting.
High-Performance Forecasting Systems will save companies by improving accuracy and scalability. Imagine what will happen to your career if you can provide your organization a “High-Performance Time Series Forecasting System” (HPTSF System).
How to Learn High-Performance Time Series Forecasting
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. You will learn:
- Time Series Machine Learning (cutting-edge) with
Modeltime - 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)
- Deep Learning with
GluonTS (Competition Winners)
- Time Series Preprocessing, Noise Reduction, & Anomaly Detection
- Feature engineering using lagged variables & external regressors
- Hyperparameter Tuning
- Time series cross-validation
- Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner)
- Scalable Forecasting - Forecast 1000+ time series in parallel
- and more.
Become the Time Series Expert for your organization.
Take the High-Performance Time Series Forecasting Course