# Cleaning Anomalies to Reduce Forecast Error by 9% with anomalize

Written by Matt Dancho on September 30, 2019

In this tutorial, we’ll show how we used clean_anomalies() from the anomalize package to reduce forecast error by 9%.

R Packages Covered:

• anomalize - Time series anomaly detection

## Cleaning Anomalies to Reduce Forecast Error by 9%

We can often improve forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the clean_anomalies() function from anomalize into your forecast workflow.

## Forecast Workflow

We’ll use the following workflow to remove time series anomalies prior to forecasting.

1. Identify the anomalies - Decompose the time series with time_decompose() and anomalize() the remainder (residuals)

2. Clean the anomalies - Use the new clean_anomalies() function to reconstruct the time series, replacing anomalies with the trend and seasonal components

3. Forecast - Use a forecasting algorithm to predict new observations from a training set, then compare to test set with and without anomalies cleaned

### Step 2 - Get the Data

This tutorial uses the tidyverse_cran_downloads dataset that comes with anomalize. These are the historical downloads of several “tidy” R packages from 2017-01-01 to 2018-03-01.

Let’s take one package with some extreme events. We’ll hone in on lubridate (but you could pick any).

We’ll filter() downloads of the lubridate R package.

Here’s a visual representation of the forecast experiment setup. Training data will be any data before “2018-01-01”.

## Step 3 - Workflow for Cleaning Anomalies

The workflow to clean anomalies:

1. We decompose the “counts” column using time_decompose() - This returns a Seasonal-Trend-Loess (STL) Decomposition in the form of “observed”, “season”, “trend” and “remainder”.

2. We fix any negative values - If present, they can throw off forecasting transformations (e.g. log and power transformations)

3. We identifying anomalies (anomalize()) on the “remainder” column - Returns “remainder_l1” (lower limit), “remainder_l2” (upper limit), and “anomaly” (Yes/No).

4. We use the function, clean_anomalies(), to add new column called “observed_cleaned” that repairs the anomalous data by replacing all anomalies with the trend + seasonal components from the decompose operation.

date anomaly observed observed_cleaned
2017-01-12 Yes 0 3522.194
2017-04-19 Yes 8549 5201.716
2017-09-01 Yes 0 4136.721
2017-09-07 Yes 9491 4871.176
2017-10-30 Yes 11970 6412.571
2017-11-13 Yes 10267 6640.871

Here’s a visual of the “observed” (uncleaned) vs the “observed_cleaned” (cleaned) training sets. We’ll see what influence these anomalies have on a forecast regression (next).

First, we’ll make a function, forecast_downloads(), that can take the input of both cleaned and uncleaned anomalies and return the forecasted downloads versus actual downloads. The modeling function is described in the Appendix - Forecast Downloads Function.

### Step 4.1 - Before Cleaning with anomalize

We’ll first perform a forecast without cleaning anomalies (high leverage points).

• The forecast_downloads() function trains on the “observed” (uncleaned) data and returns predictions versus actual.
• Internally, a power transformation (square-root) is applied to improve the forecast due to the multiplicative properties.
• The model uses a linear regression of the form sqrt(observed) ~ numeric index + year + quarter + month + day of week.

#### Forecast vs Actual Values

The forecast is overplotted against the actual values.

We can see that the forecast is shifted vertically, an effect of the high leverage points.

#### Forecast Error Calculation

The mean absolute error (MAE) is 1570, meaning on average the forecast is off by 1570 downloads each day.

### Step 4.2 - After Cleaning with anomalize

We’ll next perform a forecast this time using the repaired data from clean_anomalies().

• The forecast_downloads() function trains on the “observed_cleaned” (cleaned) data and returns predictions versus actual.
• Internally, a power transformation (square-root) is applied to improve the forecast due to the multiplicative properties.
• The model uses a linear regression of the form sqrt(observed_cleaned) ~ numeric index + year + quarter + month + day of week

#### Forecast vs Actual Values

The forecast is overplotted against the actual values. The cleaned data is shown in Yellow.

Zooming in on the forecast region, we can see that the forecast does a better job following the trend in the test data.

#### Forecast Error Calculation

The mean absolute error (MAE) is 1435, meaning on average the forecast is off by 1435 downloads each day.

## 8.6% Reduction in Forecast Error

Using the new anomalize function, clean_anomalies(), prior to forecasting results in an 8.6% reduction in forecast error as measure by Mean Absolute Error (MAE).

# Conclusion

Forecasting with clean anomalies is a good practice that can provide substantial improvement to forecasting accuracy by removing high leverage points. The new clean_anomalies() function in the anomalize package provides an easy workflow for removing anomalies prior to forecasting. Learn more in the anomalize documentation.

# Data Science Training

### Interested in Learning Anomaly Detection?

Business Science offers two 1-hour labs on Anomaly Detection:

• Learning Lab 18 - Time Series Anomaly Detection with anomalize

• Learning Lab 17 - Anomaly Detection with H2O Machine Learning

### Interested in Improving Your Forecasting?

Business Science offers a 1-hour lab on increasing Forecasting Accuracy:

• Learning Lab 5 - 5 Strategies to Improve Forecasting Performance by 50% (or more) using arima and glmnet

### Interested in Becoming an Expert in Data Science for Business?

Business Science offers a 3-Course Data Science for Business R-Track designed to take students from no experience to an expert data scientists (advanced machine learning and web application development) in under 6-months.

The forecast_downloads() function uses the following procedure:
• Split the data into training and testing data using a date specified using the sep argument.
• Apply a statistical transformation: none, log-1-plus (log1p()), or power (sqrt())
• Model the daily time series of the training data set from observed (demonstrates no cleaning) or observed and cleaned (demonstrates improvement from cleaning). Specified by the col_train argument.