# timekit: Time Series Forecast Applications Using Data Mining

*Written by Matt Dancho on May 2, 2017*

The `timekit`

package contains a collection of tools for working with time series in R. There’s a number of benefits. One of the biggest is the ability to use a time series signature to predict future values (forecast) through data mining techniques. While this post is geared toward exposing the user to the `timekit`

package, there are examples showing the power of data mining a time series as well as how to work with time series in general. A number of `timekit`

functions will be discussed and implemented in the post. The first group of functions works with the time series index, and these include functions `tk_index()`

, `tk_get_timeseries_signature()`

, `tk_augment_timeseries_signature()`

and `tk_get_timeseries_summary()`

. We’ll spend the bulk of this post introducing you to these. The next function deals with creating a future time series from an existing index, `tk_make_future_timeseries()`

. The last set of functions deal with coercion to and from the major time series classes in R, `tk_tbl()`

, `tk_xts()`

, `tk_zoo()`

(and `tk_zooreg()`

), and `tk_ts()`

.

# Benefits

So, why another time series package? The short answer is because it helps with data mining, communication between time series objects, and facilitating accurate future time series. The long answer is slightly more complicated, and I will attempt to explain.

##### Time Series Signature

The first reason and arguably the most important reason is the idea that there is a large amount of information stored inside a simple yet complex **time index** that is very useful for modeling and data mining. The time index is the collection of time-based values that define *when* each observation occurred. Consider the timestamp “2016-01-01 00:00:00”. This contains a wealth of information related to the observation including year, month, day, hour, minute and second. We can even extract more information including half, quarter, week of year, day of year, and so on with little effort. Next is the concept of the **frequency** (or periodicity or scale), which is the amount of time between multiple observations. From this time difference we can get even more information such as the periodicity of the data, whether the observations are regular or irregularly spaced, and even which observations are frequently missing. By my count, there’s at least 20+ features that can be retrieved from a simple timestamp. The important concept is that these features can exploded (or broken out) into what I’m calling the **time series signature**, which is nothing more than a decomposition of the unique features related to time index values. This data is very useful as it can be summarized, modeled, mined, sliced and diced, etc. As and example of the power of the signature, we can generate a prediction using data mining techniques such as this (see alcohol sales example later).

##### Prediction and Forecast Accuracy

The second reason is that often we want to make predictions into the future. There’s a number of packages such as `forecast`

and `prophet`

that already specialize in this. For `forecast`

the future dates can be incorrect especially for daily data. A regular numeric system doesn’t contain true dates and a sequential system results in inaccuracy with respect to irregular dates. For prophet, the mechanism to compute holidays and missing days is internal to the `predict()`

method, and therefore the a method specific to creating future dates is needed. Two types of days cause problems: those that are regularly skipped and irregularly skipped. The regularly skipped days (such as weekends or sometimes companies get to take every other Friday off) need to be factored into the future date sequence. The irregularly skipped days (think holidays) cause issues as well, and these suffer the additional problem as they can be difficult (but not impossible) to predict.

##### Communication and Coercion Between Time-Based Object Classes

The third reason is that the R object structures that contain time-based information are difficult use together. My first attempt was born in `tidyquant`

where I created the `as_tibble()`

and `as_xts()`

functions to coerce (convert) back and forth. I was naive in this attempt because the problem is larger: we have `zoo`

, `ts`

and many other packages that work with time-based information. The `xts`

and `zoo`

packages solved part of the problem, but there’s two issues that persist. First, the time-based tibble (“tidy” data frame with class `tbl`

) does not communicate well with the rest of the group. Coercing to `xts`

, `zoo`

and `ts`

objects can result in a lot of issues especially when coercion rules for homogeneous object classes take over. Weird things can happen such as turning numeric data into character or converting date to numeric without warning. Further, each coercion method (`as_tibble`

, `as.xts`

, `as.zoo`

, `as.ts`

) has its own nuances that are inconsistent. Second, some classes like `ts`

do not use a time-based index, but rather use a regularized numeric-based index. Without maintaining the time-based index, we can never go back to the original data structure, whether it is `tbl`

, `xts`

, `zoo`

, etc.

##### Enter timekit

The `timekit`

package solves each of these issues. It includes functions to create a time series signature and a time series summary from a sequence of dates. It includes methods to accurately generate future time series index values, which is especially important for daily data. It provides consistent coercion methods that prevent inadvertent class coercion issues resulting from homogeneous object structures and that maximize time-based index retention for regularized data structures.

# Test Driving timekit

Let’s take `timekit`

for a test drive. We’ll be using a few other packages in the process to help with the examples. First, install `timekit`

.

Next, load these packages:

## Example 1: Predicting Daily Volume for FB

This example is intended to expose potential users to several functions in `timekit`

. We’ll develop a prediction algorithm to predict daily volume using the **time series signature**. First, start with the `FANG`

data set from the `tidyquant`

package. Filter to get just the FB stock prices, and select the “date” and “volume” columns. This is a typical time series data structure. A time-based tibble with a “date” column and a features column (“volume” in this case).

First, split the data into two sets, one for training and one for comparing the actual output to our predictions.

Next, augment the time series signature to the training set using `tk_augment_timeseries_signature()`

. This function adds the time series signature as additional columns to the data frame. The signature will be used next for the data mining process.

Next, model the data using a regression. We are going to use the `lm()`

function to model volume using the time series signature.

Now we need to build the future data to model. We already have the index in the `actual_future`

data. However, in practice we don’t normally have the future index. Let’s build it using the existing index following three steps:

- Extract the index from the training set with
`tk_index()`

- Make a future index factoring in holidays and weekly inspection using
`tk_make_future_timeseries()`

- Create a time series signature from the future index using
`tk_get_timeseries_signature()`

Now use the `predict()`

function to run a regression prediction on the new data.

Let’s compare the prediction to the actual daily FB volume in 2016. Using the `add_column()`

function, we can add the predictions to the actual data, `actual_future`

. We can then plot the prediction using `ggplot()`

.

The predictions are a bit off as compared to the actuals and in some months the values are actually negative which is impossible. While the result is not necessarily earth shattering, let’s see how a regression algorithm performs data with a more prevalent pattern. Note that we did a performance comparison and the `prophet`

package with default settings did much better job at identifying the volume pattern. With different modeling methods and tuning, the data mining approach can be significantly improved but it’s difficult to tell if the performance would be better than `prophet`

.

## Example 2: Forecasting Alcohol Sales

In this example, we’ll evaluate a time series with a more prevalent pattern. The beauty of this example is that you will see the power of data mining the time series signature with just a simple linear regression. We’ll be using a linear regression model again to model the time series signature, but you should be thinking about what other better modeling methods could be implemented. The example is truncated for brevity since the major steps are the same as Example 1.

When a pattern is present, data mining using the time series signature can provide exceptional results. Further, the analyst has the flexibility to implement other data mining techniques and methods. We implemented a linear regression, but possibly other regression methods would work better.

## Example 3: Simplified and Extensible Coercion

In the final example, we’ll examine briefly the various coercion functions that enable simplified coercion back and forth. We’ll start with the `FB_tbl`

data.

We use the various `timekit`

coercion methods to go back and forth without data loss. See how the original tibble is returned. Note the argument `silent = TRUE`

removes the warning that the “date” column is being dropped. This is desirable since `xts`

and the other matrix-based time classes should only use numeric data. No need to specify “order.by” arguments or worry about non-numeric data types being passed inadvertently. In addition, the `ts`

object maintains a time-based index in addition to a regularized index.

One caveat is the going from `ts`

to `tbl`

. The default is `timekit_idx = FALSE`

argument which returns a regularized index. If the time-based index is needed, just set `timekit_idx = TRUE`

.

# Recap

Hopefully you can now see how `timekit`

benefits time series analysis. We reviewed several of the functions related to extracting an index, adding a time series signature to an index or augmenting to a data frame, making a future time series that accounts for weekends and holidays, and coercing between various time series object classes. We also saw how the time series signature can be used in predictive analytics and data mining. The goal was to introduce you to `timekit`

. Hopefully you now have a baseline to assist with future time series analysis.

# Announcements

If you’re interested in meeting with the members of *Business Science*, we’ll be speaking at the following upcoming conferences:

- R/Finance: Chicago, May 19-20
- Enterprise Applications of the R Language (EARL): San Francisco, June 5-7

# Important Links

If you are interested learning more about `timekit`

:

- Visit our
`pkgdown`

site for detailed documentation - Visit our GitHub site for code updates
- Visit our website for news and announcements

# Further Reading

I find the R Data Mining Website and Reference Card to be an invaluable tool when researching (and trying to remember) the various data mining techniques. Many of these techniques can be implemented in time series analysis with `timekit`

.