Time Series Analysis: KERAS LSTM Deep Learning  Part 2
Written by Sigrid Keydana, Matt Dancho on July 1, 2018
One of the ways Deep Learning can be used in business is to improve the accuracy of time series forecasts (prediction). We recently showed how a Long Short Term Memory (LSTM) Models developed with the Keras library in R could be used to take advantage of autocorrelation to predict the next 10 years of monthly Sunspots (a solar phenomenon that’s tracked by NASA). In this article, we teamed up with RStudio to take another look at the Sunspots data set, this time implementing some really advanced Deep Learning functionality available with TensorFlow for R. Sigrid Keydana, TF Developer Advocate at RStudio, put together an amazing Deep Learning tutorial using keras
for implementing Keras in R and tfruns
, a suite of tools for trackingtracking, visualizing, and managing TensorFlow training runs and experiments from R. Sounds amazing, right? It is! Let’s get started with this KERAS LSTM Deep Learning Tutorial!
Articles In This Series

Time Series Deep Learning, Part 1: Forecasting Sunspots With Keras Stateful LSTM In R  Shows the a number of powerful time series deep learning techniques such as how to use autocorrelation with an LSTM, how to backtest time series, and more!

Time Series Deep Learning, Part 2: Predicting Sunspot Frequency with Keras LSTM In R  Matt teamed up with Sigrid Keydana (TF Dev Advocate at RStudio) to develop a stateoftheart TensorFlow model using
keras
andtfruns
. You can also find this article on RStudio’s TensorFlow Blog.
Learning Trajectory
In this DEEP LEARNING TUTORIAL, you will learn:
In fact, one of the coolest things you’ll develop is this plot of backtested LSTM forecasts.
Backtested LSTM Forecasts
Interested in Learning Data Science For Business?
Sign up for our free "5 Topic Friday" Newsletter. Every week, I'll send you the five coolest topics in data science for business that I've found that week. These could be new R packages, free books, or just some fun to end the week on.
Sign Up For FiveTopicFriday!
Time Series Deep Learning In Business
Introduction by Matt Dancho, Founder of Business Science
Time Series Forecasting is a key area that can lead to Return On Investment (ROI) in a business. Think about this: A 10% improvement in forecast accuracy can save an organization millions of dollars. How is this possible? Let’s find out.
We’ll take NVIDIA, a semiconductor manufacturer that manufactures stateoftheart chips for Artificial Intelligence (AI) and Deep Learning (DL), as an example. NVIDIA builds Graphics Processing Unitis or GPUs, which are necessary for the computational intensitity resulting from the massive number of numerical calculations required in highperformance Deep Learning. The chips look like this.
Source: NVIDIA USA
Like all manufacturers, NVIDIA needs to forecast demand for their products. Why? So they can build the right amount of chips to supply their customers. This forecast is critical and it takes a lot of skill and some luck to get this right.
What we are talking about is the Sales Forecast, which drives every manufacturing decision that NVIDIA makes. This includes how much raw material to purchase, how many people to have on staff to build the chips, and how much machining and assembly operations to budget. The more error in the sales forecast, the more unnecessary cost NVIDIA experiences because all of these activities (supply chain, inventory management, financial planning, etc) get thrown off!
Time Series Deep Learning For Business
Time Series Deep Learning is amazingly accurate on data that has a high presence of autocorrelation because algorithms including LSTMs and GRUs can learn information from sequences regardless of when the pattern occurred. These special RNNs are designed to have a long memory meaning that they excel at learning patterns from both recent observations and observeration that occurred long ago. This makes them perfect for time series! But are they great for sales data? Maybe. Let’s discuss.
Sales data comes in all flavors, but often it has seasonal patterns and trend. The trend can be flat, linear, exponential, etc. This is typically not where the LSTM will have success, but other traditional forecasting methods can detect trend. However, seasonality is different. Seasonality in sales data is a pattern that can happen at multiple frequencies (annual, quarterly, monthly, weekly, and even daily). The LSTM is great at detecting seasonality because it often has autocorrelation. Therefore, LSTM’s and GRU’s can be great options to help improve seasonality detection, and therefore reduce overall forecast error in the Sales Forecast.
About The Authors
This DEEP LEARNING TUTORIAL was a combined effort: Sigrid Keydana and Matt Dancho.
Sigrid is the TensorFlow Developer Advocate at RStudio, where she develops amazing deep learning using the R TensorFlow API. She’s passionate about deep learning, machine learning and statistics, R, Linux and functional programming. In a short period of time, I’ve developed a tremendous respect for Sigrid. You’ll see why when you walk through the R TensorFlow code in this tutorial
You know me (Matt). I’m the Founder of Business Science where I strive for one mission: To empower Data scientists interested in Business and Finance.
LSTM Deep Learning for Time Series Forecasting: Predicting Sunspot Frequency with Keras
By Sigrid Keydana, TensorFlow Developer Advocate at RStudio, And Matt Dancho, Founder of Business Science
Forecasting sunspots with deep learning
In this post we will examine making time series predictions using the sunspots dataset that ships with base R. Sunspots are dark spots on the sun, associated with lower temperature. Here’s an image from NASA showing the solar phenomenon.
Source: NASA
We’re using the monthly version of the dataset, sunspots.month
(there is a yearly version, too).
It contains 265 years worth of data (from 1749 through 2013) on the number of sunspots per month.
Forecasting this dataset is challenging because of high short term variability as well as longterm irregularities evident in the cycles. For example, maximum amplitudes reached by the low frequency cycle differ a lot, as does the number of high frequency cycle steps needed to reach that maximum low frequency cycle height.
Our post will focus on two dominant aspects: how to apply deep learning to time series forecasting, and how to properly apply cross validation in this domain. For the latter, we will use the rsample package that allows to do resampling on time series data. As to the former, our goal is not to reach utmost performance but to show the general course of action when using recurrent neural networks to model this kind of data.
Recurrent neural networks
When our data has a sequential structure, it is recurrent neural networks (RNNs) we use to model it.
As of today, among RNNs, the best established architectures are the GRU (Gated Recurrent Unit) and the LSTM (Long Short Term Memory). For today, let’s not zoom in on what makes them special, but on what they have in common with the most strippeddown RNN: the basic recurrence structure.
In contrast to the prototype of a neural network, often called Multilayer Perceptron (MLP), the RNN has a state that is carried on over time. This is nicely seen in this diagram from Goodfellow et al., a.k.a. the “bible of deep learning”:
At each time, the state is a combination of the current input and the previous hidden state. This is reminiscent of autoregressive models, but with neural networks, there has to be some point where we halt the dependence.
That’s because in order to determine the weights, we keep calculating how our loss changes as the input changes. Now if the input we have to consider, at an arbitrary timestep, ranges back indefinitely  then we will not be able to calculate all those gradients. In practice, then, our hidden state will, at every iteration, be carried forward through a fixed number of steps.
We’ll come back to that as soon as we’ve loaded and preprocessed the data.
Setup, preprocessing, and exploration
Libraries
Here, first, are the libraries needed for this tutorial.
If you have not previously run Keras in R, you will need to install Keras using the install_keras()
function.
Data
sunspot.month
is a ts
class (not tidy), so we’ll convert to a tidy data set using the tk_tbl()
function from timetk
. We use this instead of as.tibble()
from tibble
to automatically preserve the time series index as a zoo
yearmon
index. Last, we’ll convert the zoo
index to date using lubridate::as_date()
(loaded with tidyquant
) and then change to a tbl_time
object to make time series operations easier.
# A time tibble: 3,177 x 2
# Index: index
index value
<date> <dbl>
1 17490101 58
2 17490201 62.6
3 17490301 70
4 17490401 55.7
5 17490501 85
6 17490601 83.5
7 17490701 94.8
8 17490801 66.3
9 17490901 75.9
10 17491001 75.5
# ... with 3,167 more rows
Exploratory data analysis
The time series is long (265 years!). We can visualize the time series both in full, and zoomed in on the first 10 years to get a feel for the series.
Visualizing sunspot data with cowplot
We’ll make two ggplot
s and combine them using cowplot::plot_grid()
. Note that for the zoomed in plot, we make use of tibbletime::time_filter()
, which is an easy way to perform timebased filtering.
Backtesting: time series cross validation
When doing cross validation on sequential data, the time dependencies on preceding samples must be preserved. We can create a cross validation sampling plan by offsetting the window used to select sequential subsamples. In essence, we’re creatively dealing with the fact that there’s no future test data available by creating multiple synthetic “futures”  a process often, esp. in finance, called “backtesting”.
As mentioned in the introduction, the rsample package includes facitlities for backtesting on time series. The vignette, “Time Series Analysis Example”, describes a procedure that uses the rolling_origin()
function to create samples designed for time series cross validation. We’ll use this approach.
Developing a backtesting strategy
The sampling plan we create uses 50 years (initial
= 12 x 50 samples) for the training set and ten years (assess
= 12 x 10) for the testing (validation) set. We select a skip
span of about twenty years (skip
= 12 x 20  1) to approximately evenly distribute the samples into 6 sets that span the entire 265 years of sunspots history. Last, we select cumulative = FALSE
to allow the origin to shift which ensures that models on more recent data are not given an unfair advantage (more observations) over those operating on less recent data. The tibble return contains the rolling_origin_resamples
.
# Rolling origin forecast resampling
# A tibble: 6 x 2
splits id
<list> <chr>
1 <S3: rsplit> Slice1
2 <S3: rsplit> Slice2
3 <S3: rsplit> Slice3
4 <S3: rsplit> Slice4
5 <S3: rsplit> Slice5
6 <S3: rsplit> Slice6
Visualizing the backtesting strategy
We can visualize the resamples with two custom functions. The first, plot_split()
, plots one of the resampling splits using ggplot2
. Note that an expand_y_axis
argument is added to expand the date range to the full sun_spots
dataset date range. This will become useful when we visualize all plots together.
The plot_split()
function takes one split (in this case Slice01), and returns a visual of the sampling strategy. We expand the axis to the range for the full dataset using expand_y_axis = TRUE
.
The second function, plot_sampling_plan()
, scales the plot_split()
function to all of the samples using purrr
and cowplot
.
We can now visualize the entire backtesting strategy with plot_sampling_plan()
. We can see how the sampling plan shifts the sampling window with each progressive slice of the train/test splits.
And, we can set expand_y_axis = FALSE
to zoom in on the samples.
We’ll use this backtesting strategy (6 samples from one time series each with 50/10 split in years and a ~20 year offset) when testing the veracity of the LSTM model on the sunspots dataset.
The LSTM model
To begin, we’ll develop an LSTM model on a single sample from the backtesting strategy, namely, the most recent slice. We’ll then apply the model to all samples to investigate modeling performance.
We can reuse the plot_split()
function to visualize the split. Set expand_y_axis = FALSE
to zoom in on the subsample.
Data setup
To aid hyperparameter tuning, besides the training set we also need a validation set.
For example, we will use a callback, callback_early_stopping
, that stops training when no significant performance is seen on the validation set (what’s considered significant is up to you).
We will dedicate 2 thirds of the analysis set to training, and 1 third to validation.
First, let’s combine the training and testing data sets into a single data set with a column key
that specifies where they came from (either “training” or “testing)”. Note that the tbl_time
object will need to have the index respecified during the bind_rows()
step, but this issue should be corrected in dplyr
soon.
# A time tibble: 1,800 x 3
# Index: index
index value key
<date> <dbl> <chr>
1 18490601 81.1 training
2 18490701 78 training
3 18490801 67.7 training
4 18490901 93.7 training
5 18491001 71.5 training
6 18491101 99 training
7 18491201 97 training
8 18500101 78 training
9 18500201 89.4 training
10 18500301 82.6 training
# ... with 1,790 more rows
Preprocessing with recipes
The LSTM algorithm will usually work better if the input data has been centered and scaled. We can conveniently accomplish this using the recipes
package. In addition to step_center
and step_scale
, we’re using step_sqrt
to reduce variance and remov outliers. The actual transformations are executed when we bake
the data according to the recipe:
# A tibble: 1,800 x 3
index value key
<date> <dbl> <fct>
1 18490601 0.714 training
2 18490701 0.660 training
3 18490801 0.473 training
4 18490901 0.922 training
5 18491001 0.544 training
6 18491101 1.01 training
7 18491201 0.974 training
8 18500101 0.660 training
9 18500201 0.852 training
10 18500301 0.739 training
# ... with 1,790 more rows
Next, let’s capture the original center and scale so we can invert the steps after modeling. The square root step can then simply be undone by squaring the backtransformed data.
center.value scale.value
6.694468 3.238935
Reshaping the data
Keras LSTM expects the input as well as the target data to be in a specific shape.
The input has to be a 3d array of size num_samples, num_timesteps, num_features
.
Here, num_samples
is the number of observations in the set. This will get fed to the model in portions of batch_size
. The second dimension, num_timesteps
, is the length of the hidden state we were talking about above. Finally, the third dimension is the number of predictors we’re using. For univariate time series, this is 1.
How long should we choose the hidden state to be? This generally depends on the dataset and our goal. If we did onestepahead forecasts  thus, forecasting the following month only  our main concern would be choosing a state length that allows to learn any patterns present in the data.
Now say we wanted to forecast 12 months instead, as does SILSO, the World Data Center for the production, preservation and dissemination of the international sunspot number. The way we can do this, with Keras, is by wiring the LSTM hidden states to sets of consecutive outputs of the same length. Thus, if we want to produce predictions for 12 months, our LSTM should have a hidden state length of 12.
These 12 time steps will then get wired to 12 linear predictor units using a time_distributed()
wrapper.
That wrapper’s task is to apply the same calculation (i.e., the same weight matrix) to every state input it receives.
Now, what’s the target array’s format supposed to be? As we’re forecasting several timesteps here, the target data again needs to be 3dimensional. Dimension 1 again is the batch dimension, dimension 2 again corresponds to the number of timesteps (the forecasted ones), and dimension 3 is the size of the wrapped layer.
In our case, the wrapped layer is a layer_dense()
of a single unit, as we want exactly one prediction per point in time.
So, let’s reshape the data. The main action here is creating the sliding windows of 12 steps of input, followed by 12 steps of output each. This is easiest to understand with a shorter and simpler example. Say our input were the numbers from 1 to 10, and our chosen sequence length (state size) were 4. Tthis is how we would want our training input to look:
1,2,3,4
2,3,4,5
3,4,5,6
And our target data, correspondingly:
5,6,7,8
6,7,8,9
7,8,9,10
We’ll define a short function that does this reshaping on a given dataset. Then finally, we add the third axis that is formally needed (even though that axis is of size 1 in our case).
Building the LSTM model
Now that we have our data in the required form, let’s finally build the model. As always in deep learning, an important, and often timeconsuming, part of the job is tuning hyperparameters. To keep this post selfcontained, and considering this is primarily a tutorial on how to use LSTM in R, let’s assume the following settings were found after extensive experimentation (in reality experimentation did take place, but not to a degree that performance couldn’t possibly be improved).
Instead of hard coding the hyperparameters, we’ll use tfruns to set up an environment where we could easily perform grid search.
We’ll quickly comment on what these parameters do but mainly leave those topics to further posts.
After all these preparations, the code for constructing and training the model is rather short! Let’s first quickly view the “long version”, that would allow you to test stacking several LSTMs or use a stateful LSTM, then go through the final short version (that does neither) and comment on it.
This, just for reference, is the complete code.
Now let’s step through the simpler, yet better (or equally) performing configuration below.
As we see, training was stopped after ~55 epochs as validation loss did not decrease any more. We also see that performance on the validation set is way worse than performance on the training set  normally indicating overfitting.
This topic too, we’ll leave to a separate discussion another time, but interestingly regularization using higher values of dropout
and recurrent_dropout
(combined with increasing model capacity) did not yield better generalization performance. This is probably related to the characteristics of this specific time series we mentioned in the introduction.
Now let’s see how well the model was able to capture the characteristics of the training set.
We compute the average RSME over all sequences of predictions.
21.01495
How do these predictions really look? As a visualization of all predicted sequences would look pretty crowded, we arbitrarily pick start points at regular intervals.
This looks pretty good. From the validation loss, we don’t quite expect the same from the test set, though.
Let’s see.
31.31616
That’s not as good as on the training set, but not bad either, given this time series is quite challenging.
Having defined and run our model on a manually chosen example split, let’s now revert to our overall resampling frame.
Backtesting the model on all splits
To obtain predictions on all splits, we move the above code into a function and apply it to all splits. First, here’s the function. It returns a list of two dataframes, one for the training and test sets each, that contain the model’s predictions together with the actual values.
Mapping the function over all splits yields a list of predictions.
Calculate RMSE on all splits:
How does it look? Here’s RMSE on the training set for the 6 splits.
# A tibble: 6 x 2
id rmse
<chr> <dbl>
1 Slice1 22.2
2 Slice2 20.9
3 Slice3 18.8
4 Slice4 23.5
5 Slice5 22.1
6 Slice6 21.1
# A tibble: 6 x 2
id rmse
<chr> <dbl>
1 Slice1 21.6
2 Slice2 20.6
3 Slice3 21.3
4 Slice4 31.4
5 Slice5 35.2
6 Slice6 31.4
Looking at these numbers, we see something interesting: Generalization performance is much better for the first three slices of the time series than for the latter ones. This confirms our impression, stated above, that there seems to be some hidden development going on, rendering forecasting more difficult.
And here are visualizations of the predictions on the respective training and test sets.
First, the training sets:
And the test sets:
This has been a long post, and necessarily will have left a lot of questions open, first and foremost: How do we obtain good settings for the hyperparameters (learning rate, number of epochs, dropout)? How do we choose the length of the hidden state? Or even, can we have an intuition how well LSTM will perform on a given dataset (with its specific characteristics)? We will tackle questions like the above in upcoming posts.
Learning More
Check out our other articles on Time Series!
 Time Series Deep Learning: Forecasting Sunspots With Keras Stateful LSTM In R
 Tidy Time Series Analysis, Part 1: Tidy Period Apply
 Tidy Time Series Analysis, Part 2: Tidy Rolling Functions
 Tidy Time Series Analysis, Part 3: Tidy Rolling Correlations
 Tidy Time Series Analysis, Part 4: Lags and Autocorrelations
Business Science University
Business Science University is a revolutionary new online platform that get’s you results fast.
Why learn from Business Science University? You could spend years trying to learn all of the skills required to confidently apply Data Science For Business (DS4B). Or you can take the first course in our Virtual Workshop, Data Science For Business (DS4B 201). In 10 weeks, you’ll learn:

A 100% ROIdriven Methodology  Everything we teach is to maximize ROI.

A clear, systematic plan that we’ve successfully used with clients

Critical thinking skills necessary to solve problems

Advanced technology: _H2O Automated Machine Learning__

How to do 95% of the skills you will need to use when wrangling data, investigating data, building highperformance models, explaining the models, evaluating the models, and building tools with the models
You can spend years learning this information or in 10 weeks (one chapter per week pace). Get started today!
DS4B Virtual Workshop: Predicting Employee Attrition
Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.
Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301
Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:
 DS4B 201: Predicting Employee Attrition with
h2o
andlime
 DS4B 301 (Coming Soon): Building A
Shiny
Web Application  DS4B 302 (EST Q4): Data Communication With
RMarkdown
Reports and Presentations  DS4B 303 (EST Q4): Building An R Package For Your Organization,
tidyattrition
The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISPDM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.
Don’t Miss A Beat
 Sign up for the Business Science blog to stay updated
 Get started with Business Science University to learn how to solve realworld data science problems from Business Science
 Check out our Open Source Software
Connect With Business Science
If you like our software (anomalize
, tidyquant
, tibbletime
, timetk
, and sweep
), our courses, and our company, you can connect with us:
 businessscience on GitHub
 Business Science, LLC on LinkedIn
 bizScienc on twitter
 Business Science, LLC on Facebook