Speed Up Your Code Part 2: Parallel Processing Financial Data with multidplyr + tidyquant
Since my initial post on parallel processing with
multidplyr, there have been some recent changes in the
tidy eco-system: namely the package
tidyquant, which brings financial analysis to the
tidyquant package drastically increase the amount of tidy financial data we have access to and reduces the amount of code needed to get financial data into the tidy format. The
multidplyr package adds parallel processing capability to improve the speed at which analysis can be scaled. I seriously think these two packages were made for each other. I’ll go through the same example used previously, updated with the new
Table of Contents
Parallel Processing Applications in Financial Analysis
Collecting financial data in tidy format was overly difficult. Getting from
tibble was a pain, and there’s some amazing
tidyverse functionality that cannot be used without the
tibble (tidy data frame) format. That all changed with
tidyquant. There’s a wide range of free data sources, and the
tidyquant package makes it super simple to get financial and economic data in tidy format (more on this in a minute).
There’s one caveat to collecting data at scale: it takes time. Getting data from the internet, historical stock prices or financial statements or real-time statistics, for 500+ stocks can take anywhere from several minutes to 20+ minutes depending on the data size and number of stocks.
tidyquant makes it easier to get and analyze data in the correct format, but we need a new tool to speed up the process. Enter
multidplyr package makes it super simple to parallelize code. It works perfectly in the
tidyverse, and, by the associative propery, works perfectly with
The example in this post shows off the new
tq_get() function for getting stock prices at scale. However, we can get and scale much more than just stock prices.
tq_get has the following data retrieval options:
get = "stock.index": This retrieves the entire list of stocks in an index. 18 indexes are available. Use
tq_get_stock_index_options() to see the full list.
get = "stock.prices": This retrieves historical stock prices over a time period specified by
from. This is the default option, and the one we use in this post.
get = "key.ratios": This retrieves the key ratios from Morningstar, which are historical values over the past 10-years. There are 89 key ratios ranging from valuation, to growth and profitability, to efficiency. This is a great place to chart business performance over time and to compare the key ratios by company. Great for financial analysis!
get = "key.stats": This retrieves real-time key stats from Yahoo Finance, which consist of bid, ask, day’s high, day’s low, change, current P/E valuation, current Market Cap, and many more up-to-the-minute stats on a stock. This is a great place for the day trader to work since all of the data is accurate as of the second you download it.
get = "financials": This retrieves the annual and quarterly financial statement data from Google Finance. Great for financial analysis!
get = "economic.data": This retrieves economic data from the FRED database by FRED code. As of the blog post date, there are 429,000 US and international time-series data sets from 80 sources. All you need is the FRED symbol such as “CPIAUCSL” for CPI. Vist the FRED website to learn more.
Other get options:
"splits". There’s lots of data you can get using
The point I want to make is that ANY OF THESE GET OPTIONS CAN BE SCALED USING THE PROCESS I USE NEXT!!!
multidplyr package is not available on CRAN, but you can install it using
devtools. Also, install the development version of
tidyquant which has added functionality that will be available on CRAN soon with v0.3.0.
For those following along in R, you’ll need to load the following packages:
I also recommend the open-source RStudio IDE, which makes R Programming easy and efficient.
multidplyr workflow can be broken down into five basic steps shown in Figure 1. The five steps are implemented in Processing in Parallel.
Figure 1: multidplyr Workflow
Essentially, you start with some data set that you need to do things to multiple times. Your situation generally falls into one of two types:
- It could be a really large data set that you want to split up into several small data sets and perform the same thing on each.
- It could be one data set that you want to perform multiple things on (e.g. apply many models).
The good news is both situations follow the same basic workflow. The toughest part is getting your data in the format needed to process using the workflow. Don’t worry, we’ll go through a real world example shortly so you can see how this is accomplished.
Real World Example
We’ll go through the
multidplyr workflow using a real world example that I routinely use: collecting stock prices from the inter-web. Other uses include using modeling functions over grouped data sets, using many models on the same data set, and processing text (e.g. getting n-grams on large corpora). Basically anything with a loop.
In preparation for collecting stock prices, we need two things:
- A list of stocks
- A function to get stock prices from a stock symbol
Let’s see how tidyquant makes this easy. First, getting a stock index used to be a pain:
Before tidyquant (Don’t use this):
Now with tidyquant (Use this!):
Second, getting stock prices in tidy format used to be a pain:
Before tidyquant (Don’t use this):
Now with tidyquant (Use this!):
Note that you can replace
"financials", etc to get other financial data for a stock symbol. These options can be scaled as well!
Processing In Series
The next computation is the routine that we wish to parallelize, but first we’ll time the script running on one processor, looping in series. We are collecting ten years of historical daily stock prices for each of the 500+ stocks. This is now a simple chaining operation with
tidyquant, which accepts a tibble of stocks with the symbols in the first column.
sp_500_processed_in_series is a
tibble (tidy data frame) with the stock prices for the 500+ stocks.
And, let’s see how long it took when processing in series. The processing time is the time elapsed in seconds. Converted to minutes this is approximately 3.71 minutes.
Processing in Parallel
We just collected ten years of daily stock prices for over 500 stocks in about 3.71 minutes. Let’s parallelize the computation to get an improvement. We will follow the six steps shown in Figure 1.
Step 0: Get Number of Cores (Optional)
Prior to starting, you may want to determine how many cores your machine has. An easy way to do this is using
parallel::detectCores(). This will be used to determine the number of groups to split the data into in the next set.
Step 1: Add Groups
Let’s add groups to
sp_500. The groups are needed to divide the data across your
cl number cores. For me, this is 8 cores. We create a
group vector, which is a sequential vector of
1:cl (1 to 8) repeated the length of the number of rows in
sp_500. We then add the
group vector to the
sp_500 tibble using the
Step 2: Create Clusters
create_cluster() function from the
multidplyr package. Think of a cluster as a work environment on a core. Therefore, the code below establishes a work environment on each of the 8 cores.
Step 3: Partition by Group
Next is partitioning. Think of partitioning as sending a subset of the initial
tibble to each of the clusters. The result is a partitioned data frame (
party_df), which we explore next. Use the
partition() function from the
multidplyr package to split the
sp_500 list by group and send each group to a different cluster.
by_group, looks similar to our original
tibble, but it is a
party_df, which is very different. The key is to notice that the there are 8
Shard has between 63 and 64 rows, which evenly splits our data among each shard. Now that our
tibble has been partitioned into a
party_df, we are ready to move onto setting up the clusters.
Step 4: Setup Clusters
The clusters have a local, bare-bones R work environment, which doesn’t work for the vast majority of cases. Code typically depends on libraries, functions, expressions, variables, and/or data that are not available in base R. Fortunately, there is a way to add these items to the clusters. Let’s see how.
For our computation, we are going to need to add the
tidyquant library and our variables
from. We do this by using the
cluster_assign_value() functions, respectively.
We can verify that the libraries are loaded using the
Step 5: Run Parallelized Code
Now that we have our clusters and partitions set up and everything looks good, we can run the parallelized code. The code chunk is a little bit different than before because we need to use
purrr to map
tq_get() to each stock symbol:
- Instead of starting with the
sp_500 tibble, we start with the
- This is how we scale with
purrr: We use a combination of
purrr::map() to map our
tq_get() function to the stocks.
- We combine the results at the end using the
multidplyr::collect() function. The result is a nested tibble.
- We unnest with
tidyr::unnest() to return the same single-level tibble as before.
- Finally, we use
dplyr::arrange() to arrange the stocks in the same order as previous. The
collect() function returns the shards (cluster data groups) binded in whatever group is done first. Typically you’ll want to re-arrange.
Let’s check out the results
And, let’s see how long it took when processing in parallel.
The processing time is approximately 0.66 minutes, which is 5.6X faster! Note that it’s not a full 8X faster because of transmission time as data is sent to and from the nodes. With that said, the speed will approach 8X improvement as calculations become longer since the transmission time is fixed whereas the computation time is variable.
Parallelizing code can drastically improve speed on multi-core machines. It makes the most sense in situations involving many iterative computations. On an 8 core machine, processing time significantly improves. It will not be quite 8X faster, but the longer the computation the closer the speed gets to the full 8X improvement. For a computation that takes two minutes under normal conditions, we improved the processing speed by over 5X through parallel processing!
The focus of this post was to show how you can implement
tidyquant with the
multidplyr parallel-processing package for parallel processing financial applications. We worked through the five main steps in the
multidplyr workflow using the new
tidyquant package, which makes it much easier to get financial data in tidy format. Keep in mind that you can use any
tq_get “get” option at scale.
Tidyquant Vignettes: This tutorial just scratches the surface of
tidyquant. The vignettes explain much, much more!
Multidplyr on GitHub: The vignette explains the
multidplyr workflow using the
flights data set from the