Tidy Time Series Forecasting in R with Spark
Written by Matt Dancho
Iβm SUPER EXCITED to show fellow time-series enthusiasts a new way that we can scale time series analysis using an amazing technology called Spark
!
Without Spark, large-scale forecasting projects of 10,000 time series can take days to run because of long-running for-loops and the need to test many models on each time series.
Spark has been widely accepted as a βbig dataβ solution, and weβll use it to scale-out (distribute) our time series analysis to Spark Clusters, and run our analysis in parallel.
TLDR
-
Spark
is an amazing technology for processing large-scale data science workloads.
-
Modeltime
is a state-of-the-art forecasting library that I personally developed for βTidy Forecastingβ in R. Modeltime
now integrates a Spark
Backend with capability of forecasting 10,000+ time series using distributed Spark Clusters.
-
I show an introductory tutorial to get you started. Readers can sign up for a Free Live Training using Modeltime and Spark where weβll cover more detail that we couldnβt cover during this introductory tutorial including:
- Forecasting Ensembles (New Capability)
- More Advanced Time Series Algorithms
- Feature Engineering at Scale
- Special Offers on Courses
Free Training: 3-Tips to Scale Forecasting
As mentioned above, we are hosting a Free Training: 3-Tips to Scale Forecasting. This will be a 1-hour full-code training with 3 tips to scale your forecasting.
Click image for Free Time Series Training
Register Here for Free Training
What is Spark?
I feel like Iβve unlocked infinite power.
If youβre like me, you have heard this term, βSparkβ, but maybe you havenβt tried it yet. That was me about 2-months agoβ¦ Boy, am I glad I tried it. I feel like Iβve unlocked infinite power.
Sparkβs Best Qualities
According to Sparkβs website, Spark is a βunified analytics engine for large-scale data processing.β
This means that Spark has the following amazing qualities:
-
Spark is Unified: We can use different languages like Python, R, SQL and Java to interact with Spark.
-
Spark is made for Large-Scale: We can run workloads 100X faster versus Hadoop (according to Spark).
How Spark Works
Hereβs a picture showing how Spark works.
How Spark Works
-
Spark runs on Java Virtual Machines (JVMs), which allow Spark to run and distribute workloads very fast.
-
But most of us arenβt Java Programmers (weβre R and Python programmers), so Spark has an interface to R and Python.
-
This means we can communicate with Spark via R, and send the work to Spark Executors all from the comfort of R. Boom. π₯
What is Modeltime?
Modeltime is a time series forecasting framework for βtidy time series forecastingβ.
Tidy Modeling
Tidymodels is like Scikit Learn, but better.
The βTidyβ in βTidy time series forecastingβ is because modeltime
builds on top of tidymodels
, a collection of packages for modeling and machine learning using tidyverse
principles.
I equate Tidymodels in R to Scikit Learn in Pythonβ¦ Tidymodels is like Scikit Learn, but better. Itβs simply easier to use especially when it comes to feature engineering (very important for time series), and I become more productive because of it.
tidymodels.org
Tidy Time Series Forecasting
Because modeltime
builds on top of Tidymodels, any user that learns tidymodels
can use modeltime
.
The Modeltime Spark Backend
Now for the best part - the highlight of your day!
Modeltime integrates Spark.
YES! I said it people.
Iβve just upgraded Modeltimeβs parallel processing backend so now you can swap out your local machine for Spark Clusters. This means you can:
-
Run Spark Locally - Ok, this is cool but doesnβt get me much beyond normal parallel processing. What else ya got, Matt?
-
Run Spark in the Cloud - Ahhh, this is where Matt was going. Now weβre talking.
-
Run Spark on Databricks - Bingo! Many enterprises are adopting Databricks as their data engineering solution. So now Mattβs saying I can run Modeltime in the cloud using databricks! Suh. Weet!
Yes! And you are now no longer limited by your CPU cores for parallel processing. You can easily scale modeltime to as many Spark Clusters as your company can afford.
Ok, on to the forecasting tutorial!
Spark Forecasting Tutorial
Weβll run through a short forecasting example to get you started. Refer to the FREE Time Series with Spark Training for how to perform iterative forecasting with larger datasets using modeltime
and 3-tips for scalable forecasting (beyond what we could cover in this tutorial).
Iterative Forecasting
One of the most common situations that parallel computation is required is when doing iterative forecasting where the data scientist needs to experiment with 10+ models across 10,000+ time series. Itβs common for this large-scale, high-performance forecasting exercise to take days.
Iterative (Nested) Forecasting with Modeltime
Weβll show you how we can combine Modeltime βNested Forecastingβ and itβs Parallel Spark Backend to scale this computation with distributed parallel Spark execution.
Matt said βNestedββ¦ Whatβs Nested?
A term that youβll hear me use frequently is βnestedβ.
What Iβm referring to is the βnestedβ data structure, which come from tidyr
package that allows us to organize data frames as lists inside data frames. This is a powerful feature of the tidyverse
that allows us for modeling at scale.
The book, βR for Data Scienceβ, has a FULL Chapter on Many Models (and Nested Data). This is a great resource for those that want more info on Nested Data and Modeling (bookmark it β
).
OK, letβs go!
System Requirements
This tutorial requires:
-
sparklyr, modeltime, and tidyverse: Make sure you have these R libraries installed (use install.packages(c("sparklyr", "modeltime", "tidyverse"), dependencies = TRUE)
). If this is a fresh install, it may take a bit. Just be patient. β
-
Java: Spark installation depends on Java being installed. Download Java here.
-
Spark Installation: Can be accomplished via sparklyr::spark_install()
provided the user has sparklyr
and Java installed (see Steps 1 and 2).
Libraries
Load the following libraries.
Spark Connection
Next, we set up a Spark connection via sparklyr
. For this tutorial, we use the βlocalβ connection. But many users will use Databricks to scale the forecasting workload.
To run Spark locally:
If using Databricks, you can use:
Setup the Spark Backend
Next, we register the Spark Backend using parallel_start(sc, .method = "spark")
. This is a helper to set up the registerDoSpark()
foreach adaptor. In laymanβs terms, this just means that we can now run parallel using Spark.
Data Preparation (for Nested Forecasting)
The dataset weβll be forecasting is the walmart_sales_weekly
, which we modify to just include 3 columns: βidβ, βdateβ, βvalueβ.
- The id feature is the grouping variable.
- The date feature contains timestamps.
- The value feature is the sales value for the Walmart store-department combination.
We prepare as nested data using the Nested Forecasting preparation functions.
-
extend_timeseries()
: This extends each one of our time series into the future by 52 timestamps (this is one year for our weekly data set).
-
nest_timeseries()
: This converts our data to the nested data format indicating that our future data will be the last 52 timestamps (that we just extended).
-
split_nested_timeseries()
: This adds indicies for the train / test splitting so we can develop accuracy metrics and determine which model to use for which time series.
Key Concept: Nested Data
Youβll notice our data frame (tibble) is only 7 rows and 4 columns. This is because weβve nested our data.
If you examine each of the rows, youβll notice each row is an ID. And we have βtibblesβ in each of the other columns.
That is nested data!
Modeling
Weβll create two unfitted models: XGBoost and Prophet. Then weβll use modeltime_nested_fit()
to iteratively fit the models to each of the time series using the Spark Backend.
Model 1: XGBoost
We create the XGBoost model on features derived from the date column. This gets a bit complicated because we are adding recipes
to process the data. Basically, we are creating a bunch of features from the date column in each of the time series.
-
First, we use extract_nested_train_split(nested_data_tbl, 1)
to extract the first time series, so we can begin to create a βrecipeβ
-
Once we develop a recipe, we add βstepsβ that build features. We start by creating timeseries signature features from the date column. Then we remove βdateβ and further process the signature features.
-
Then we create an XGBoost model by developing a tidymodels βworkflowβ. The workflow combines a model (boost_tree()
in this case) with a recipe that we previously created.
-
The output is an unfitted workflow that will be applied to all 7 of our timeseries.
Prophet
Next, we create a prophet workflow. The process is actually simpler than XGBoost because Prophet doesnβt require all of the preprocessing recipe steps that XGBoost does.
Nested Forecasting with Spark
Now, the beauty is that everything is set up for us to perform the nested forecasting with Spark. We simply use modeltime_nested_fit()
and make sure it uses the Spark Backend by setting control_nested_fit(allow_par = TRUE)
.
Note that this will take about 20-seconds because we have a one-time cost to move data, libraries, and environment variables to the Spark clusters. But the good news is that when we scale up to 10,000+ time series, that the one-time cost is minimal compared to the speed up from distributed computation.
The nested modeltime object has now fit the models using Spark. Youβll see a new column added to our nested data with the name β.modeltime_tablesβ. This contains 2 fitted models (one XGBoost and one Prophet) for each time series.
Model Test Accuracy
We can observe the results. First, we can check the accuracy for each model.
id |
.model_id |
.model_desc |
.type |
mae |
mape |
mase |
smape |
rmse |
rsq |
1_1 |
1 |
XGBOOST |
Test |
4552.91 |
18.14 |
0.90 |
17.04 |
7905.80 |
0.39 |
1_1 |
2 |
PROPHET |
Test |
4078.01 |
17.53 |
0.80 |
17.58 |
5984.31 |
0.65 |
1_3 |
1 |
XGBOOST |
Test |
1556.36 |
10.51 |
0.60 |
11.12 |
2361.35 |
0.94 |
1_3 |
2 |
PROPHET |
Test |
2126.56 |
13.15 |
0.83 |
13.91 |
3806.99 |
0.82 |
1_8 |
1 |
XGBOOST |
Test |
3573.49 |
9.31 |
1.52 |
9.87 |
4026.58 |
0.13 |
1_8 |
2 |
PROPHET |
Test |
3068.24 |
8.00 |
1.31 |
8.38 |
3639.60 |
0.00 |
1_13 |
1 |
XGBOOST |
Test |
2626.57 |
6.53 |
0.97 |
6.78 |
3079.24 |
0.43 |
1_13 |
2 |
PROPHET |
Test |
3367.92 |
8.26 |
1.24 |
8.73 |
4006.58 |
0.11 |
1_38 |
1 |
XGBOOST |
Test |
7865.35 |
9.62 |
0.67 |
10.14 |
10496.79 |
0.24 |
1_38 |
2 |
PROPHET |
Test |
16120.88 |
19.73 |
1.38 |
22.49 |
18856.72 |
0.05 |
1_93 |
1 |
XGBOOST |
Test |
6838.18 |
8.49 |
0.69 |
9.05 |
8860.59 |
0.48 |
1_93 |
2 |
PROPHET |
Test |
7304.08 |
10.00 |
0.74 |
9.57 |
9024.07 |
0.03 |
1_95 |
1 |
XGBOOST |
Test |
5678.51 |
4.59 |
0.68 |
4.72 |
7821.85 |
0.57 |
1_95 |
2 |
PROPHET |
Test |
5856.69 |
4.87 |
0.71 |
4.81 |
7540.48 |
0.49 |
Test Forecast
Next, we can examine the test forecast for each of the models.
-
We can use extract_nested_test_forecast()
to extract the logged forecast and visualize how each did.
-
We group_by(id)
then pipe (%>%
) into plot_modeltime_forecast()
to make the visualization (great for reports and shiny apps!)
More we didnβt cover
Thereβs a lot more we didnβt cover. We actually didnβt:
- Select the best models for each of the 7 time series
- Improve performance with ensembles
- Forecast the future
Iβll cover these and much more in the Free Live Training on Forecasting with Spark.
Make sure to sign up and youβll get 3 immediately actionable tips and a full code tutorial that goes well beyond this introduction.
Plus youβll get exclusive offers on our courses and get to ask me (Matt, the creator of Modeltime) questions!
Close Clusters and Shutdown Spark
We can close the Spark adapter and shut down the Spark session when we are finished.
Summary
Weβve now successfully completed an Forecast with Spark. You may find this challenging, especially if you are not familiar with the Modeltime Workflow, terminology, or tidymodeling in R. If this is the case, we have a solution. Take our high-performance forecasting course.
Become the forecasting expert for your organization
High-Performance Time Series Course
Time Series is Changing
Time series is changing. Businesses now need 10,000+ time series forecasts every day. This is what I call a High-Performance Time Series Forecasting System (HPTSF) - Accurate, Robust, and Scalable Forecasting.
High-Performance Forecasting Systems will save companies by improving accuracy and scalability. Imagine what will happen to your career if you can provide your organization a βHigh-Performance Time Series Forecasting Systemβ (HPTSF System).
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. You will learn:
- Time Series Machine Learning (cutting-edge) with
Modeltime
- 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more)
- Deep Learning with
GluonTS
(Competition Winners)
- Time Series Preprocessing, Noise Reduction, & Anomaly Detection
- Feature engineering using lagged variables & external regressors
- Hyperparameter Tuning
- Time series cross-validation
- Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner)
- Scalable Forecasting - Forecast 1000+ time series in parallel
- and more.
Become the Time Series Expert for your organization.
Take the High-Performance Time Series Forecasting Course