Tidy Time Series Analysis, Part 4: Lags and Autocorrelation
Written by Matt Dancho on August 30, 2017
In the fourth part in a series on Tidy Time Series Analysis, we’ll investigate lags and autocorrelation, which are useful in understanding seasonality and form the basis for autoregressive forecast models such as AR, ARMA, ARIMA, SARIMA (basically any forecast model with “AR” in the acronym). We’ll use the
tidyquant package along with our tidyverse downloads data obtained from
cranlogs. The focus of this post is using
lag.xts(), a function capable of returning multiple lags from a xts object, to investigate autocorrelation in lags among the daily tidyverse package downloads. When using
tq_mutate() we can scale to multiple groups (different tidyverse packages in our case). If you like what you read, please follow us on social media to stay up on the latest Business Science news, events and information! As always, we are interested in both expanding our network of data scientists and seeking new clients interested in applying data science to business and finance. If interested, contact us.
If you haven’t checked out the previous tidy time series posts, you may want to review them to get up to speed.
- Part 1: Tidy Period Apply
- Part 2: Tidy Rolling Functions
- Part 3: Tidy Rolling Correlations
- Part 4: Lags and Autocorrelations
Here’s an example of the autocorrelation plot we investigate as part of this post:
We’ll need to load several libraries today.
CRAN tidyverse Downloads
We’ll be using the same “tidyverse” dataset as the last several posts. The script below gets the package downloads for the first half of 2017.
We can visualize daily downloads, but detecting the trend is quite difficult due to noise in the data.
Lags (Lag Operator)
The lag operator (also known as backshift operator) is a function that shifts (offsets) a time series such that the “lagged” values are aligned with the actual time series. The lags can be shifted any number of units, which simply controls the length of the backshift. The picture below illustrates the lag operation for lags 1 and 2.
Lags are very useful in time series analysis because of a phenomenon called autocorrelation, which is a tendency for the values within a time series to be correlated with previous copies of itself. One benefit to autocorrelation is that we can identify patterns within the time series, which helps in determining seasonality, the tendency for patterns to repeat at periodic frequencies. Understanding how to calculate lags and analyze autocorrelation will be the focus of this post.
Finally, lags and autocorrelation are central to numerous forecasting models that incorporate autoregression, regressing a time series using previous values of itself. Autoregression is the basis for one of the most widely used forecasting techniques, the autoregressive integrated moving average model or ARIMA for short. Possibly the most widely used tool for forecasting, the
forecast package by Rob Hyndman, implements ARIMA (and a number of other forecast modeling techniques). We’ll save autoregression and ARIMA for another day as the subject is truly fascinating and deserves its own focus.
Background on Functions Used
TTR packages have some great functions that enable working with time series. The
tidyquant package enables a “tidy” implementation of these functions. You can see which functions are integrated into
tidyquant package using
tq_mutate_fun_options(). We use
glimpse() to shorten the output.
Today, we’ll take a look at the
lag.xts() function from the
xts package, which is a really great function for getting multiple lags. Before we dive into an analysis, let’s see how the function works. Say we have a time series of ten values beginning in 2017.
lag.xts() function generates a sequence of lags (t-1, t-2, t-3, …, t-k) using the argument
k. However, it only works on xts objects (or other matrix, vector-based objects). In other words, it fails on our “tidy” tibble. We get an “unsupported type” error.
Now, watch what happens when converted to an xts object. We’ll use
tk_xts() from the
timetk package to coerce from a time-based tibble (tibble with a date or time component) to and xts object.
timetkpackage is a toolkit for working with time series. It has functions that simplify and make consistent the process of coercion (converting to and from different time series classes). In addition, it has functions to aid the process of time series machine learning and data mining. Visit the docs to learn more.
We get our lags! However, we still have one problem: We need our original values so we can analyze the counts against the lags. If we want to get the original values too, we can do something like this.
That’s a lot of work for a simple operation. Fortunately we have
tq_mutate() to the rescue!
tq_mutate() function from
tidyquant enables “tidy” application of the xts-based functions. The
tq_mutate() function works similarly to
dplyr in the sense that it adds columns to the data frame.
tidyquantpackage enables a “tidy” implementation of the xts-based functions from packages such as xts, zoo, quantmod, TTR and PerformanceAnalytics. Visit the docs to learn more.
Here’s a quick example. We use the
select = value to send the “value” column to the mutation function. In this case our
mutate_fun = lag.xts. We supply
k = 5 as an additional argument.
That’s much easier. We get the value column returned in addition to the lags, which is the benefit of using
tq_mutate(). If you use
tq_transmute() instead, the result would be the lags only, which is what
Analyzing tidyverse Downloads: Lag and Autocorrelation Analysis
Now that we understand a little more about lags and the
tq_mutate() functions, let’s put this information to use with a lag and autocorrelation analysis of the tidyverse package downloads. We’ll analyze all tidyverse packages together, showing off the scalability of
Scaling the Lag and Autocorrelation Calculation
First, let’s get lags 1 through 28 (4 weeks of lags). The process is quite simple: we take the
tidyverse_downloads data frame, which is grouped by package, and apply
tq_mutate() using the
lag.xts function. We can provide column names for the new columns by prefixing “lag_” to the lag numbers,
k, which the sequence from 1 to 28. The output is all of the lags for each package.
Next, we need to correlate each of the lags to the “count” column. This involves a few steps that can be strung together in a
dplyr pipe (
The goal is to get count and each lag side-by-side so we can do a correlation. To do this we use
gather()to pivot each of the lagged columns into a “tidy” (long format) data frame, and we exclude “package”, “date”, and “count” columns from the pivot.
Next, we convert the new “lag” column from a character string (e.g. “lag_1”) to numeric (e.g. 1) using
mutate(), which will make ordering the lags much easier.
Next, we group the long data frame by package and lag. This allows us to calculate using subsets of package and lag.
Finally, we apply the correlation to each group of lags. The
summarize()function can be used to implement
cor(), which takes
x = countand
y = lag_value. Make sure to pass
use = "pairwise.complete.obs", which is almost always desired. Additionally, the 95% upper and lower cutoff can be approximated by:
- N = number of observations.
Putting it all together:
Visualizing Autocorrelation: ACF Plot
Now that we have the correlations calculated by package and lag number in a nice “tidy” format, we can visualize the autocorrelations with
ggplot to check for patterns. The plot shown below is known as an ACF plot, which is simply the autocorrelations at various lags. Initial examination of the ACF plots indicate a weekly frequency.
Which Lags Consistently Stand Out?
We see that there appears to be a weekly pattern, but we want to be sure. We can verify the weekly pattern assessment by reviewing the absolute value of the correlations independent of package. We take the absolute autocorrelation because we use the magnitude as a proxy for how much explanatory value the lag provides. We’ll use
dplyr functions to manipulate the data for visualization:
We drop the package group constraint using
We calculate the absolute correlation using
mutate(). We also convert the lag to a factor, which helps with reordering the plot later.
select()only the “lag” and “cor_abs” columns.
We group by “lag” to lump all of the lags together. This enables us to determine the trend independent of package.
We can now visualize the absolute correlations using a box plot that lumps each of the lags together. We can add a line to indicate the presence of outliers at values above . If the values are consistently above this limit, the lag can be considered an outlier. Note that we use the
fct_reorder() function from
forcats to organize the boxplot in order of decending magnitude.
Lags in multiples of seven have the highest autocorrelation and are consistently above the outlier break point indicating the presence of a strong weekly pattern. The autocorrelation with the seven-day lag is the highest, with a median of approximately 0.75. Lags 14, 21, and 28 are also outliers with median autocorrelations in excess of our outlier break point of 0.471.
Note that the median of Lag 1 is essentially at the break point indicating that half of the packages have a presence of “abnormal” autocorrelation. However, this is not part of a seasonal pattern since a periodic frequency is not present.
Lag and autocorrelation analysis is a good way to detect seasonality. We used the autocorrelation of the lagged values to detect “abnormal” seasonal patterns. In this case, the tidyverse packages exhibit a strong weekly pattern. We saw how the
tq_mutate() function was used to apply
lag.xts() to the daily download counts to efficiently get lags 1 through 28. Once the lags were retrieved, we used other
dplyr functions such as
gather() to pivot the data and
summarize() to calculate the autocorrelations. Finally, we saw the power of visual analysis of the autocorrelations. We created an ACF plot that showed a visual trend. Then we used a boxplot to detect which lags had consistent outliers. Ultimately a weekly pattern was confirmed.
Enjoy data science for business? We do too. This is why we created Business Science University where we teach you how to do Data Science For Busines (#DS4B) just like us!
Our first DS4B course (HR 201) is now available!
Who is this course for?
Anyone that is interested in applying data science in a business context (we call this DS4B). All you need is basic
ggplot2 experience. If you understood this article, you are qualified.
What do you get it out of it?
You learn everything you need to know about how to apply data science in a business context:
Using ROI-driven data science taught from consulting experience!
Solve high-impact problems (e.g. $15M Employee Attrition Problem)
Use advanced, bleeding-edge machine learning algorithms (e.g. H2O, LIME)
Apply systematic data science frameworks (e.g. Business Science Problem Framework)
“If you’ve been looking for a program like this, I’m happy to say it’s finally here! This is what I needed when I first began data science years ago. It’s why I created Business Science University.”
Matt Dancho, Founder of Business Science