Anomaly Detection Using Tidy and Anomalize
Written by Matt Dancho
We recently had an awesome opportunity to work with a great client that asked Business Science to build an open source anomaly detection algorithm that suited their needs. The business goal was to accurately detect anomalies for various marketing data consisting of website actions and marketing feedback spanning thousands of time series across multiple customers and web sources. Enter
anomalize: a tidy anomaly detection algorithm that’s time-based (built on top of
tibbletime) and scalable from one to many time series!! We are really excited to present this open source R package for others to benefit. In this post, we’ll go through an overview of what
anomalize does and how it works.
Case Study: When Open Source Interests Align
We work with many clients teaching data science and using our expertise to accelerate their business. However, it’s rare to have a client’s needs and their willingness to let others benefit align with our interests of pushing the boundaries of data science. This was an exception.
Our client had a challenging problem: detecting anomalies in time series on daily or weekly data at scale. Anomalies indicate exceptional events, which could be increased web traffic in the marketing domain or a malfunctioning server in the IT domain. Regardless, it’s important to flag these unusual occurrences to ensure the business is running smoothly. One of the challenges was that the client deals with not one time series but thousands that need to be analyzed for these extreme events.
An opportunity presented itself to develop an open source package that aligned with our interests of building a scalable adaptation of the Twitter
AnomalyDetection package and our client’s desire for a package that would benefit from the open source data science community’s ability to improve over time. The result is
2 Minutes To Anomalize
We’ve made a short introductory video that’s part of our new Business Science Software Intro Series on YouTube. This will get you up and running in under 2 minutes.
For those of us who prefer to read, here’s the gist of how
anomalize works in four simple steps.
Step 1: Install Anomalize
Step 2: Load Tidyverse and Anomalize
Step 3: Collect Time Series Data
We’ve provided a dataset,
tidyverse_cran_downloads, to get you up and running. The dataset consists of daily download counts of 15 “tidyverse” packages.
Step 4: Anomalize
Use the three tidy functions:
time_recompose() to detect anomalies. Tack on a fourth,
plot_anomalies() to visualize.
Well that was easy… but, what did we just do???
You just implemented the “anomalize” (anomaly detection) workflow, which consists of:
- Time series decomposition with
- Anomaly detection of remainder with
- Anomaly lower and upper bound transformation with
Time Series Decomposition
The first step is time series decomposition using
time_decompose(). The “count” column is decomposed into “observed”, “season”, “trend”, and “remainder” columns. The default values for time series decompose are
method = "stl", which is just seasonal decomposition using a Loess smoother (refer to
trend parameters are automatically set based on the time scale (or periodicity) of the time series using
tibbletime based function under the hood.
A nice aspect is that the
trend are automatically selected for you. If you want to see what was selected, set
message = TRUE. Also, you can change the selection by inputting a time-based period such as “1 week” or “2 quarters”, which is typically more intuitive that figuring out how many observations fall into a time span. Under the hood,
time_trend() convert these from time-based periods to numeric values using
Anomaly Detection Of Remainder
The next step is to perform anomaly detection on the decomposed data, specifically the “remainder” column. We did this using
anomalize(), which produces three new columns: “remainder_l1” (lower limit), “remainder_l2” (upper limit), and “anomaly” (Yes/No Flag). The default method is
method = "iqr", which is fast and relatively accurate at detecting anomalies. The
alpha parameter is by default set to
alpha = 0.05, but can be adjusted to increase or decrease the height of the anomaly bands, making it more difficult or less difficult for data to be anomalous. The
max_anoms parameter is by default set to a maximum of
max_anoms = 0.2 for 20% of data that can be anomalous. This is the second parameter that can be adjusted. Finally,
verbose = FALSE by default which returns a data frame. Try setting
verbose = TRUE to get an outlier report as a list.
If you want to visualize what’s happening, now’s a good point to try out another plotting function,
plot_anomaly_decomposition(). It only works on a single time series so we’ll need to select just one to review. The “season” is removing the weekly cyclic seasonality. The trend is smooth, which is desirable to remove the central tendency without overfitting. Finally, the remainder is analyzed for anomalies detecting the most significant outliers.
Anomaly Lower and Upper Bounds
The last step is to create the lower and upper bounds around the “observed” values. This is the work of
time_recompose(), which recomposes the lower and upper bounds of the anomalies around the observed values. Two new columns were created: “recomposed_l1” (lower limit) and “recomposed_l2” (upper limit).
Let’s visualize on just the “lubridate” data. We can do so using
plot_anomalies() and setting
time_recomposed = TRUE. This function works on both single and grouped data.
That’s it. Once you have the “anomalize workflow” down, you’re ready to detect anomalies!
Packages That Helped In Development
The first thing we did after getting this request was to investigate what methods are currently available. The last thing we wanted to do was solve a problem that’s old news. We were aware of three excellent open source tools:
AnomalyDetectionpackage: Available on GitHub.
- Rob Hyndman’s
forecast::tsoutliers()function available on through the
forecastpackage on CRAN.
- Javier Lopez-de-Lacalle’s package,
tsoutliers, on CRAN.
We have worked with all of these R packages and functions before, and each presented learning opportunities that could be integrated into a scalable workflow.
What we liked about Twitter’s
AnomalyDetection was that it used two methods in tandem that work extremely well for time series. The “Twitter” method uses time series decomposition (i.e.
stats::stl()) but instead of subtracting the Loess trend, it uses the piece-wise median of the data (one or several medians split at specified intervals). The other method that
AnomalyDetection employs is the use of Generalized Extreme Studentized Deviate (GESD) as a way of detecting outliers. GESD is nice because it is resistant to the high leverage points that tend to pull a mean or even median in the direction of the most significant outliers. The package works very well with stationary data or even data with trend. However, the package was not built with a tidy interface making it difficult to scale.
Forecast tsoutliers() Function
tsoutliers() function from the
forecast package is a great way to efficiently collect outliers for cleaning prior to performing forecasts. It uses an outlier detection method based on STL with a 3X inner quartile range around remainder from time series decomposition. It’s very fast because there are a maximum of two iterations to determine the outlier bands. However, it’s not setup for a tidy workflow. Nor does it allow adjustment of the 3X. Some time series may need more or less depending on the magnitude of the variance of the remainders in relation to the magnitude of the outliers.
tsoutliers package works very effectively on a number of traditional forecast time series for detecting anomalies. However, speed was an issue especially when attempting to scale to multiple time series or with minute or second timestamp data.
Anomalize: Incorporating The Best Of All
In reviewing the available packages, we learned from them all incorporating the best of each:
Decomposition Methods: We include two time series decomposition methods:
"stl"(using traditional seasonal decomposition by Loess) and
"twitter"(using seasonal decomposition with median spans).
Anomaly Detection Methods: We include two anomaly detection methods:
"iqr"(using an approach similar to the 3X IQR of
"gesd"(using the GESD method employed by Twitter’s
In addition, we’ve made some improvements of our own:
Anomalize Scales Well: The workflow is tidy and scales with
dplyrgroups. The functions operate as expected on grouped time series meaning you can just as easily anomalize 500 time series data sets as a single data set.
Visuals For Analyzing Anomalies:
We include a way to get bands around the “normal” data separating the outliers. People are visual, and bands are really useful in determining how the methods are working or if we need to make adjustments.
We include two plotting functions making it easy to see what’s going on during the “anomalize workflow” and providing a way to assess the affect of “adjusting the knobs” that drive
The entire workflow works with
tibbletimedata set up with a time-based index. This is good because in our experience almost all time data comes with a date or datetime timestamp that’s really important to characteristics of the data.
There’s no need to calculate how many observations fall within a frequency span or trend span. We set up
trendusing time-based spans such as “1 week” or “2 quarters” (powered by
We hope that the open source community can benefit from
anomalize. Our client is very happy with it, and it’s exciting to see that we can continue to build in new features and functionality that everyone can enjoy.
Business Science specializes in “ROI-driven data science”. We offer training, education, coding expertise, and data science consulting related to business and finance. Our latest creation is Business Science University, which is coming soon! In addition, we deliver about 80% of our effort into the open source data science community in the form of software and our Business Science blog. Visit Business Science on the web or contact us to learn more!