tidyquant Integrates Quandl: Getting Data Just Got Easier
Written by Matt Dancho
Today I’m very pleased to introduce the new Quandl API integration that is available in the development version of
tidyquant. Normally I’d introduce this feature during the next CRAN release (v0.5.0 coming soon), but it’s really useful and honestly I just couldn’t wait. If you’re unfamiliar with Quandl, it’s amazing: it’s a web service that has partnered with top-tier data publishers to enable users to retrieve a wide range of financial and economic data sets, many of which are FREE! Quandl has it’s own R package (aptly named
Quandl) that is overall very good but has one minor inconvenience: it doesn’t return multiple data sets in a “tidy” format. This slight inconvenience has been addressed in the integration that comes packaged in the latest development version of
tidyquant. Now users can use the Quandl API from within
tidyquant with three functions:
quandl_search(), and the core function
tq_get(get = "quandl"). In this post, we’ll go through a user-contributed example, How To Perform a Fama French 3 Factor Analysis, that showcases how the Quandl integration fits into the “Collect, Modify, Analyze” financial analysis workflow. Interested readers can download the development version using
devtools::install_github("business-science/tidyquant"). More information is available on the tidyquant GitHub page including the updated development vignettes.
Table of Contents
- Example: How to Perform a Fama French 3 Factor Analysis
- Further Reading
tidyquant: Bringing financial analysis to the tidyverse
The topic for today is the Quandl integration today. Quandl enables access to a wide range of financial and economic data. It has it’s own R library appropriately named
Quandl. Users can sign up for a FREE account, and in return users get an API key that enables access to numerous free and paid data sets. The
Quandl package is very good: it enables searching the Quandl databases from the R console. Once a data set is found, the data set “code” can be used to retrieve the data in various formats. The one downside is that, although you can get multiple data sets (e.g. for multiple stocks, FRED codes, etc), the data returned is not “tidy”. This is where the
tidyquant integration fits in. The integration makes it even more convenient to get data, and when multiple data sets are retrieved they are returned in one “tidy” data frame (aka “long” format which is perfect for grouping and scaling analysis)! In addition, you only need to load one package,
tidyquant, to get the full capabilities of the Quandl API. The figure below shows how Quandl fits into the “Collect, Modify, Analyze”
tidyquant financial analysis workflow.
If you are new to
tidyquant, there’s a few core functions that you need to be aware of. I’ve broken them down by step in the CMA process.
tq_get()is a one-stop shop for getting web-based financial data in a “tidy” data frame format. Get data for daily stock prices (historical), key statistics (real-time), key ratios (historical), financial statements, economic data from the FRED, FOREX rates from Oanda, and now Quandl!
- Midstream (Modify):
tq_mutate()manipulate financial data.
tq_mutate()is used to add a column to the data frame.
tq_transmute()is used to return a new data frame which is necessary for periodicity changes.
tq_portfolio()aggregates a group (or multiple groups) of asset returns into one or more portfolios.
- Downstream (Analyze):
PerformanceAnalyticsfunctions that turn investment returns into performance metrics.
To learn more about the functions, browse the Development Vignettes on GitHub.
To use the Quandl integration and other new
tidyquant features, you’ll need to install the development version available on the Business Science GitHub Site. You can download with
corrr packages. The
corrr packages will help in our analysis at the end of the financial analysis workflow.
I also recommend the open-source RStudio IDE, which makes R Programming easy and efficient especially for financial analysis. Now, onto a really neat example showing off why Quandl is such a great tool.
Before we get started, I’d like to thank Bob Rietveld for the usage case. He’s been doing a lot of work with Fama French three and five factor models. You can find an example of his FF analyses here. In this example, we’ll perform a Fama French three factor regression on a portfolio of the following stocks: 20% AAPL, 20% F, 40% GE, and 20% MSFT. According to Investopedia:
The Fama and French Three Factor Model is an asset pricing model that expands on the capital asset pricing model (CAPM) by adding size and value factors to the market risk factor in CAPM. This model considers the fact that value and small-cap stocks outperform markets on a regular basis. By including these two additional factors, the model adjusts for the out-performance tendency, which is thought to make it a better tool for evaluating manager performance.
The CMA process steps we’ll implement are as follows:
- Collect Data: We’ll use the new Quandl integration to get both stock prices and Fama French data sets.
- Modify Data: This is a portfolio analysis so we’ll need to aggregate stock returns into a weighted portfolio
- Analyze Data: We’ll perform a regression analysis, and we need the
broompackage for the
In this step, we will collect two data frames. The first is the historical stock returns for individual stocks. The second is the Fama French three factor data set. We are going to use the Quandl API integration so first set your API key using
quandl_api_key(). If you don’t have an API key yet, you can sign up with Quandl.
Collecting Historical Stock Returns
Next, let’s create a table of stocks. We will use the “WIKI” database which returns open, high low, close, volume, dividends, splits, and adjusted prices. The Quandl data sets use the following code format: “Database” / “Data Set”. For “AAPL”, this would be “WIKI/AAPL” indicating the WIKI database and AAPL data set. The code in the first column will allow us to pipe (
%>%) the stock list to the
tq_get() function next.
Once we have the stocks, we can very easily use
tq_get(get = "quandl") to get stock prices and even stock returns depending on the options we use. The following time series options are available to be passed to the underlying
order= “asc”, “desc”
from) = “yyyy-mm-dd”
to) = “yyyy-mm-dd”
column_index= numeric column number (e.g. 1)
rows= numeric row number indicating first n rows (e.g. 100)
collapse= “none”, “daily”, “weekly”, “monthly”, “quarterly”, “annual”
transform= “none”, “diff”, “rdiff”, “cumul”, “normalize”
to to select a ten year time period from the beginning of 2007 through the end of 2016,
transform = "rdiff" to get percentage returns,
collapse = "monthly" to get monthly data, and
column_index = 11 to get the eleventh column, “adj.close”. We’ll
rename the column from “adj.close” to “monthly.returns” to accurately describe the values.
Collecting Fama French 3-Factor Monthly Data
Next, we need to get the Fama French data. Suppose we don’t know exactly what we are looking for. We’ll use the function,
quandl_search(), to query the Quandl API (a wrapper for
Quandl.search()). We can search within the R console by setting
query to a descriptive value. We’ll set
per_page = 5 to get the top 5 results. We’ll set
silent = TRUE to turn off the meta data output (in practice it may be beneficial to leave this easy-to-read option on). The results returned contain the “id”, dataset_code, “database_code”, “name”, “description”, etc, which gives us both insight into the data set contents and the information needed to retrieve. I’ve removed “description” to make it easier to view the information.
|30216128||MOMENTUM_A||KFRENCH||Fama/French Factors (Annual)||2017-03-18T21:08:12.712Z||2016-12-31||1927-12-31||Date, Momentum||annual||Time Series||FALSE||389|
|30579533||FACTORS_A||KFRENCH||Fama/French Factors (Annual)||2017-03-18T21:06:21.885Z||2016-12-31||1927-12-31||Date, Mkt-RF, SMB, HML, RF||annual||Time Series||FALSE||389|
|2292156||FACTORS_M||KFRENCH||Fama/French Factors (Monthly)||2017-03-18T21:06:21.953Z||2017-01-31||1926-07-31||Date, Mkt-RF, SMB, HML, RF||monthly||Time Series||FALSE||389|
|2292158||FACTORS_W||KFRENCH||Fama/French Factors (Weekly)||2017-03-18T21:06:25.103Z||2017-01-27||1926-07-02||Date, Mkt-RF, SMB, HML, RF||weekly||Time Series||FALSE||389|
|2676225||MOMENTUM_M||KFRENCH||Fama/French Factors (Monthly)||2017-03-18T21:08:12.746Z||2017-01-31||1927-01-31||Date, Momentum||monthly||Time Series||FALSE||389|
The third result, “FACTORS_M”, is what we need. We can retrieve with
tq_get(get = "quandl") by piping (
"KFRENCH/FACTORS_M". (Remember that the format is always database code / dataset code). We’ll tack on
collapse = "monthly" to ensure the dates match up with the returns,
Now we have all of the data needed. We are ready to move on to modifying the data.
There’s two parts to this step. First, we will aggregate the portfolio in the weights specified in the beginning of the example:
- 20% AAPL, 20% F, 40% GE, and 20% MSFT.
Second, we will join the aggregated portfolio returns with the Fama French data.
Portfolio aggregation is performed using
tq_portfolio() as follows. We create a tibble (“tidy” data frame) of weights that can be mapped using the first column, “stocks”.
Then we pass the individual stock returns,
stock_returns_quandl, to the
tq_portfolio() function specifying the assets column “symbol” and the returns column “monthly.returns”. The
weights_tib tibble is also passed to the
weights argument. Note that there is also an argument,
rebalance_on = c(NA, "years", "quarters", "months", "weeks", "days") if rebalancing is a consideration to factor into the model. Last, the output column is renamed to “monthly.returns” using the
Join Portfolio Returns and Fama French Data
We can join the two data sets by the “date” column in each using
left_join from the
Now we are ready to analyze.
In the final step we will analyze two ways. First, we will perform the three factor regression, which yields model parameters. Second, we will review visually by plotting a correlation matrix.
Three Factor Regression Model
The article, “Rolling Your Own: Three-Factor Analysis”, by William J. Bernstein with Efficient Frontier goes through an excellent step-by-step explaining the method. We are concerned with the following variables:
- Return of the Total Market minus the T-Bill return (mkt.rf): The return of the total market (CRSP 1-10) minus the T-bill return (Mkt)
- Small Minus Big (smb): The return of small company stocks minus that of big company stocks
- High Minus Low (hml): The return of the cheapest third of stocks sorted by price/book minus the most expensive third
Three factor regression is performed with the
lm() function by analyzing the relationship between the portfolio returns and the three FF factors.
glance from the
broom package, we can review the regression metrics. Note
kable() from the
knitr package is used to create aesthetically pleasing tables. We can see from the “r.squared” value that 67% of the variance of the portfolio returns is explained by the model.
tidy from the
broom package, we can review the model coefficients: these are the most interesting. The intercept is the alpha, and at 0.005 the portfolio is outperforming the model by approximately 0.005% per month or roughly 0.055% per year (although the p-value indicates this is not statistically significant). Next are the “loadings” for the three factors. The “mkt.rf” is the beta for the portfolio, which indicates very low volatility compared to the market (anything less than 1.0 means lower volatility than the market). The “smb” value of essentially zero signifies large-cap (anything below 0 is large cap, above 0.5 is small cap). The “hml” value of essentially zero signifies a growth fund (a zero value defines a growth portfolio, a value of more than 0.3, a value fund).
We can also visualize the results using the new
corrr package to get a sense of the relationship the portfolio returns to each of the factors. The
correlate() function creates a correlation matrix. The
shave() function removes the upper diagonal.
Visualizing is just one more step with
rplot() (similar to
ggplot() + correlation geom). We can see that the market reference is highly correlated to the monthly portfolio returns, but this is the only value that has a significant correlation.
The new Quandl integration opens up a lot of doors with regards to financial and economic data. The API is now integrated into the “tidyverse” workflow enabling scaling and easy data manipulations following the “Collect, Modify, Analyze” financial analysis workflow we use with
tidyquant. The Fama French analysis is just one example of new and interesting analyses that are now easily performed. This is just the beginning. Feel free to email us at Business Science with new and interesting ways you are using
We covered a lot of ground today. We exposed you to the new Quandl integration and how it fits within the “Collect, Modify, Analyze” financial analysis workflow. We used
quandl_api_key() to set an API key, enabling access to the Quandl API. We used
quandl_search() to search the Quandl API for Fama French data. We used
tq_get(get = "quandl") to retrieve data from Quandl, passing various options to conveniently get monthly returns. We aggregated a portfolio using
tq_portfolio and joined the portfolio returns with the Fama French data. We then performed a basic Fama French Three Factor analysis. The entire analysis from beginning to end was easy, efficient, and “tidy”! =)
TQ01-Core Functions in tidyquant, Development Version: The Core Functions in tidyquant vignette (development version) has a nice section on the features related to the Quandl integration!
R for Data Science: A free book that thoroughly covers the “tidyverse”. A prerequisite for maximizing your abilities with