Quantitative Stock Analysis Tutorial: Screening the Returns for Every S&P500 Stock in Less than 5 Minutes
Written
Quantitative trading strategies are easy to develop in R if you can manage the data workflow. In this post, I analyze every stock in the S&P500 to screen in terms of risk versus reward. I’ll show you how to use quantmod
to collect daily stock prices and calculate log returns, rvest
to web scrape the S&P500 list of stocks from Wikipedia, purrr
to map functions and perform calculations on nested tibbles (tidyverse
data frames), and plotly
to visualize risk versus reward and extract actionable information for use in your trading strategies. At the end, you will have a visualization that compares the entire set of S&P500 stocks, and we’ll screen them to find those with the best future prospects on a quantitative basis. As a bonus we’ll investigate correlations to add diversification to your portfolio. Finally, the code that generates the plotly
and corrplot
visualizations is made available on GitHub for future stock screening. Whether you are a veteran trader or a newbie stock enthusiast, you’ll learn a useful workflow for modeling and managing massive data sets using the tidyverse
packages. And, the entire script runs in less than five minutes so you can begin screening stocks quickly.
Here’s a sneak peek at the interactive S&P500 stock screening visualization that compares stocks based on growth (reward), variability (risk), and number of samples (risk). The tool can be used to visualize stocks with good characteristics: High growth (large mean log returns), low variability (low standard deviation), and high number of samples (lots of days traded). Make sure you zoom in on sections of interest and hover over the dots to gain insights.
Here’s a sneak peek at the correlation analysis, which is useful in determining which pairs of stocks to select to minimize risk. These are the top 30 stocks according to high growth and low risk. The key here is to select a portfolio that minimizes the correlations between stocks, thus further minimizing risk. Select stocks with low correlation (i.e. small light blue dots indicate pairs have low correlation (good); large dark blue combinations indicate high correlation (bad)).
Table of Contents
Update
- 2016-10-29: An issue was uncovered due to stock splits. Stock splits cause a large decline in the stock performance, which penalizes the stocks that split by increasing the standard deviation of the daily returns. The fix is to use the adjusted stock prices. In addition to the fix, I expanded on the correlation section to include the top 30 stocks with high growth, low variability and high samples. Based on feedback, I also expanded the introduction to better explain the goals of the analysis and to include the key visualizations.
Prerequisites
For those following along in R, you’ll need to load the following packages:
If you don’t have these installed, run install.packages(pkg_names)
with the package names as a character vector (pkg_names <- c("quantmod", "xts", ...)
).
Stock Analysis: An Individual Stock
Before we extrapolate to a full-blown S&P 500 Analysis, we need to understand how to analyze a single stock. We’ll walk through three steps to quickly get you up to speed on quantitatively analyzing stock returns:
- Exploring the Quantmod Package
- Simulating Stock Prices
- Understanding Drivers of the Simulation
Exploring the Quantmod Package
We’ll use the quantmod
package to retrieve and manipulate stock information. There’s a lot of details behind quantmod
and extensible timeseries (xts
) objects, much of which is beyond the scope of this post. We’ll skim the surface to get you to a proficient level. If you want to get to expert, a good resource is the University of Washington’s R Programming for Quantitative Finance. We’ll use MasterCard, ticker symbol “MA”, for the example.
Quantmod Fundamentals
First, load the quantmod
package and use the getSymbols()
function to retrieve stock prices. I use the optional from
and to
arguments to limit the range of prices.
The getSymbols()
function adds a variable to R’s Global Environment with the stock ticker as the variable name. If you are using RStudio, check your Environment tab, and you should see MA
. What is MA
? Informally, it’s the daily stock prices and volumes. We need to understand it more formally. Send MA
to the class()
function using the pipe operator (%>%
):
xts Objects for Time Series
It turns out that MA
is an xts
object, which is a format conducive to manipulating timeseries. It plays nicely with and is an extension of the zoo
package. The xts vignette elaborates on the details, and another nice resource is the R/Finance Workshop Presentation: Working with xts and quantmod. Our priority is to understand the structure of the data. Pipe MA
to the str()
and head()
functions to inspect.
A couple observations:
- The index is a column of dates. This is different from data frames, which generally do not have row names and indicies are a sequence from
1:nrow
.
- The values are prices and volumes. We’ll be working primarily the adjusted price when graphing, since this removes the effect of stock splits.
quantmod + xts
The quantmod
package plays nicely with xts
formatted objects. We can plot using the quantmod::chartSeries()
function. Because of stock splits, I changed to use the adjusted prices with the Ad()
function. For workflow purposes, I use the pipe (%>%
) to get the adjusted prices first, and then send the adjusted prices to the chart function. I do this because it is easier to read and it makes complex actions simple to understand. You can also use chartSeries(Ad(MA))
, but this is more difficult to understand.
The chartSeries()
function has a number of options help visualize the stock performance. A few that I’ve added are:
subset
: Isolated the year to 2015. This takes dates and ranges in xts
subset format. Here’s some tricks from StackOverflow.
TA
: Added Bollinger Bands, Volume, and Moving Average Convergence Divergence (MACD) plots. These can also be added after the fact with add___() functions (e.g addMACD()
).
theme
: Changed to a white theme.
The quantmod
package includes functions to get daily, weekly, monthly, quarterly, and annual returns. We’ll use dailyReturn()
to get a xts
object of daily returns (see ?periodReturn
, which is the base function). The default is an arithmetic return. Specify type = "log"
if a logarithmic return is needed. We’ll be dealing with log returns for structural reasons.
Although we just skim the surface of quantmod
in this post, the package has a lot to offer for stock analysis (visit www.quantmod.com for more information). We’ll primarily use it to retrieve stock prices (getSymbols()
) and calculate log returns (periodReturn(period = 'daily')
). Just keep in mind you can do a lot more with it.
Simulating Stock Prices
The approach I use is similar to rfortraders.com’s Lecture 6 - Stochastic Processes and Monte Carlo. The fundamental idea is that stock returns, measured daily, weekly, monthly, …, are approximately normally distributed and uncorrelated. As a result, we can theoretically model the behavior of stock prices within a confidence interval based on the stock’s prior returns.
Returns are Normally Distributed
Applying the log-transformation, we can visually see that the daily returns are approximately normally distributed:
We can examine the distribution of log returns by applying the quantile()
function.
The median daily log return is 0.001 and the 95% confidence interval is between -0.0447 and 0.0475.
The mean and standard deviation of the daily log returns are:
Finally, to get the actual returns, we need to re-transform the log returns. Pipe the mean of the log returns (mean_log_returns
) to exp()
:
On average, the mean daily return is 0.0976% more than the previous day’s price. Doesn’t sound like much, but it compounds daily at an exponential rate.
Random Walk
Since we have a mean and a standard deviation, we can simulate using a process called a random walk. We’ll simulate prices for 1000 trading days. Keep in mind that every year has approximately 252 trading days, so this simulation spans just under four years.
The script below does the following: we start by specifying the number of random walks (N
), the mean (mu
), and the standard deviation (sigma
). The script then simulates prices by progressively calculating a new price using a random return from the normal distribution characterized by mu
and sigma
, and multiplying it to the previous day’s price. We then visualize the simulation.
Bummer! It looks like we are going to lose money with this investment. Can we trust this simulation?
No. The single random walk is just one of the many probabilistic outcomes. We need to simulate many iterations to build confidence intervals.
Monte Carlo Simulation
Monte-Carlo simulation does just that: we repeatedly perform the random walk simulation process hundreds or thousands of times. We’ll perform 250 Monte Carlo simulations (M = 250
) for one year of trading days simulations (N = 252
).
We can get confidence intervals for the stock price at the end of the simulation using the quantile()
function.
The 95% confidence interval is between $69.37 and $227.69, with a median (“most likely”) estimated price of $129.99.
Is this realistic? It looks close considering the simulated growth rate is very close to the historical compound annual growth rate (CAGR):
Understanding Drivers of the Simulation
Most importantly, the simulation gives us a baseline to compare stocks using the key drivers. The drivers of the simulation are the mean and the standard deviation of the log returns: The mean characterizes the average growth or return while the standard deviation characterizes the volatility or risk.
We now know how to analyze an individual stock using log returns, but wouldn’t it be better if we could apply this analysis to many stocks so we can review and screen them based on risk versus reward? Uh… yes! This is easy to do. Let’s get going!
Stock Analysis: Expanded to All S&P500 Stocks
We’ll expand to a full blown S&P500 Analysis following the modeling process workflow from R For Data Science, Chapter 25: Many Models. Once we have a modeling process down for an individual stock, we can map it to many stocks using a modeling workflow. Our workflow involves four steps:
- Get the list of S&P500 stocks.
- Create functions that retrieve needed data from a single stock. We’ll need to get stock prices, and compute log-returns.
- Utilize
purrr
to map()
the functions across all stocks.
- Visualize the results.
Web Scraping: Get the List of S&P500 Stocks
We’ll use the rvest
package to collect S&P500 stocks from Wikipedia. We use read_html()
to collect the HTML from Wikipedia. Then, we parse the HTML to a data frame using html_node()
and html_table()
. At the end, I return the data as a tibble
, and cleanup the column names.
The sp_500
data frame has four columns:
ticker.symbol
: The stock ticker
security
: The company name
gics.sector
: The primary sector according to the GICS
gics.sub.industry
: The industry sub sector according to the GICS
It’s a good idea to inspect the categorical data before we start the analysis. The lapply()
function below loops through each column of the data set counting the length of the unique items. The result is a count of distinct values for each column.
We have a minor problem:
Notice that we have 505 ticker symbols but only 504 securities. There must be a duplicate entry in securities. Let’s inspect any securities that are duplicated. We’ll use the dplyr
functions group_by()
, summarize()
, and filter()
to find any security
that appears more than once.
The culprit is Under Armour. Let’s check out why.
Under Armour has two ticker symbols: UA and UA.C. We’ll remove the UA.C from the data set.
Back to inspecting the S&P500 data:
We know from the lapply()
function that there are 504 securities that are in 126 sub industries and in 10 sectors. However, we might want to understand the distribution of securities by sector. I again use group_by()
and summarise()
to get counts. In the visualization, I use forcats::fct_reorder()
to organize the gics.sector
by greatest frequency (count
). The forcats
package makes organizing factor
categories a breeze in R, which previously was a huge pain! Check out R for Data Science, Chapter 15: Factors for help working with factors.
Separating the distribution of securities into industry sectors shows us our options if we were to select stocks using a diversification strategy. While diversification and portfolio optimization is outside the scope of this tutorial, it’s worth mentioning that a risk mitigation technique is to select a basket (or portfolio) of stocks that have low return correlation. Typically, selecting from different industries and sectors helps to reduce this correlation and diversify the portfolio. While I don’t go into portfolio optimization, as a bonus I’ll show how it’s possible to visualize the correlations of the best performing stocks.
Creating Functions to Map
Technical Details Ahead:
We are going to dive into some structural aspects of xts
and tibble
objects. Feel free to skip the details and copy the functions in this section. You will still be able to follow along.
We have a tidy data frame (tibble
) of S&P500 securities stored in the variable sp_500
. We want to perform some functions on the data frame such as:
- Getting the stock prices for each security using
getSymbols()
.
- Getting the log returns of the stock prices
periodReturn()
.
We need to come through an issue between xts
and tibble
objects before we jump into mapping the functions. We are going to create nested data frames, which basically means that the data is stored as a list in the cells of sp_500
(Don’t worry if this doesn’t make sense yet). A structural issue occurs with nesting xts
objects within tibble
data frames. The xts
objects won’t unnest. To get around this structural issue, we need to convert these objects to tibble
format.
Get Stock Prices Function
First, let’s create a function to return the stock prices. get_stock_prices()
is a wrapper for quantmod::getSymbols()
that takes a ticker, a return format, and any other getSymbols()
arguments and returns the prices in either xts
or tibble
format. In this post, we’ll use the tibble
format for unnesting. We set auto.assign = FALSE
to return the object to the variable rather than automatically creating the variable in the Global Environment.
Let’s see what happens when we try get_stock_prices()
on MasterCard using ticker, “MA”.
A few key actions that happened:
-
Instead of creating an object MA
in the Global Environment, an object was created as a local variable that we immediately output to the screen. This is important when using map()
, which needs to return a local object rather than auto assign an object in the Global Environment.
-
The object returned has a consistent column name structure. This is needed if we plan to unnest()
the object. More on this later.
-
We specified return_format = "tibble"
, which can be nested (tidyr::nest()
) and unnested (tidyr::unnest()
).
Get Log Returns Function
Next, we need a function to get log returns. get_log_returns()
is a wrapper for quantmod::periodReturns()
that takes a stock prices in xts
or tibble
format, a return format, and any other periodReturns()
arguments and returns the log returns in either xts
or tibble
format.
We’ll test the function to see what it returns:
Same as before, the return format is a tibble
, which is exactly what we need for mapping into nested data frames next.
Mapping the Functions
Now the fun part! We are going to use purrr::map()
to build a nested data frame. If you’re not familiar with nested data frames, check out tidyr::nest()
and R For Data Science, Chapter 25: Many Models. Data frames can store lists in columns, which means we can store data frames in the cells of data frames (aka nesting). The map()
function allows us to map functions to the nested list-columns. The result is the ability to extend complex calculations to loop through entire data frames.
It’s OK if the script below doesn’t make sense. We’ll get a feel for what happened by exploring the results.
Warning: The following script stores the stock prices and log returns for the entire list of 504 stocks in the S&P500. It takes my laptop a minute or two to run the script.
Let’s take a look at sp_500
. Notice that the stock.prices
and log.returns
columns include tibble
s within the data frame.
Let’s peek into one of the nested tibbles:
It’s the stock prices for the first stock, 3M: all 2470 observations.
Key point on nested tibbles: The sub-data (e.g. stock prices, log returns) are all stored as data frames within the upper-level data frame (e.g. sp_500
)! This means we can sort, filter, and perform calculations on all the sub-data at once with the nested tibbles, and the sub-data for each observation goes along for the ride.
The final output included mean.log.returns
, sd.log.returns
, and n.trade.days
, which were mapped using the map_dbl()
function. The map()
function always returns a list
, but sometimes we want a value. For scalars we can return the value using map_dbl()
, map_chr()
, …, depending on the return type. Let’s check out the final output.
To recap:
Essentially, we collected the stock prices for every S&P500 stock ticker using get_stock_prices()
, then passed each set of stock prices to get_log_returns()
, then sent the log returns for each stock to the mean()
and sd()
functions to return the values we want to compare on. Pretty cool. ;)
You may also see that I added a variable, n.trade.days
, for the number of trade days (observations) for the stock. We want stocks with a large number of observations because this gives us more samples (and a longer time span) to trend the stock, thus increasing our confidence in the statistics. This will come in handy when we perform our visualization next.
Visualizing the Results with Plotly
Back to the main purpose: we want to compare all of the S&P500 stocks based on risk versus reward. As stated before, we consider:
- Rewarding stocks have a higher mean of the log returns (measure of growth).
- Riskier stocks have a higher standard deviation of the log returns (measure of volatility).
What we need is a way to visualize a scatter plot of all of the stocks, showing the stock name and it’s related information in one concise plot. The plotly
library is perfect for this task. Using hover fields and color/size characteristics we can quickly get the essence of the data to screen the stocks.
The script below uses plot_ly()
to create a scatter plot showing the standard deviation of log-returns (x-axis), the mean log-returns (y-axis), and the number of trade days collected (size and color aesthetics). Using the text
argument, we can add information to the plot upon hover. Data exploration can be performed using the hover, zoom, and pan fetures. Try it out and develop your intuition about which stocks are the best investments.
From the plot we can see that a number of stocks have a unique combination of high mean and low standard deviation log returns. We can isolate them:
Upon inspection PCLN sports a unique combination of low volatility, high average returns, and a lot of samples. Here’s the quantmod
chart:
Questions About the Analysis
-
What can you say about the relationship between the standard deviation and mean of the log returns? Does there appear to be one? As volatility (standard deviation of returns) increases, what tends to happen to growth (mean increase of returns)?
-
The stock with the ninth highest mean is HPE. Would you use it in as an investment? Why or why not?
-
How can the investment analysis be improved to form a well-rounded strategy? (Hints: What about qualitative analysis in addition to quantitative analysis? What about dividends? What about back testing for performance?)
Bonus: Computing Correlations
Selecting a portfolio of stocks can be challenging because on the one hand you want to select those with the best performance, but on the other hand you want to diversify your investment so you don’t have all your eggs in one basket. The quantitative analysis done previously presented a method to screen stocks with high potential. To balance this, a correlation assessment can be performed to visualize which high potential stocks are least correlated.
Let’s first cut the list down to the top 30 stocks with more than four years of samples (approx 1000 trading days). We’ll add a rank
column using the min_rank()
function on mean.log.returns
, and filter the data on the top 30 ranked means.
Next, we need to unnest()
the tibble: unnest()
effectively ungroups the nested tibbles within the top level of the tibble (You’ll see what I mean below). We only need to keep the ticker.symbol
and the unnested log.returns
data.
The unnested tibble now contains the daily log returns for the 30 stocks. Notice that there are now 64,269 rows.
The last data manipulation task is to get the unnested tibble into the format for correlation analysis. We need to tidyr::spread()
the ticker.symbols
to get them into the columns. Because not all the stocks have data for the full range of dates, we need to use na.omit()
to remove rows with NA
values.
Finally, we are ready to get and visualize correlations. We’ll use the corrplot
package for visualizing the correlations. We remove the Date
column so only values remain. Then, we send to the cor()
function, which calculates the correlations.
We send the correlations to corrplot()
to visualize the correlations. I ordered using hclust
, which groups the stocks by similarity. The addrect
argument adds boxes around the highly correlated groups, and by inspecting the circle colors I settled on 11 boxes.
The key is to choose stocks that have low correlation. For example, you might select the combination of AMZN and FB because their returns have low correlations. However, you might select only one from APPL, AVGO and SWKS because the stocks are highly correlated. The high correlations make sense because Avago (Broadcom) and Skyworks are semiconductor manufacturers that supply Apple iPhones.
Download the .R File
The .R file file is available on GitHub. The .R
file has everything needed to generate the plotly
and corrplot
visualizations. Depending on your PC processing power, it should take just a minute or two to run.
Conclusions
The quantitative analysis is a powerful tool. Thanks to purrr
and other tidyverse
packages the analysis can easily be extended to many stocks to compare and screen potential investments. However, before we jump into making investment decisions, we need to recognize the strengths and weaknesses of the analysis:
-
Computing quantitative statistics (mean and standard deviation) of stock returns should be a go-to resource for evaluating stocks. It provides useful performance metrics that quantitatively characterize risk and reward.
-
Correlation analysis identifies the extent to which assets are correlated. When building a portfolio, the general rule is to select assets with a balance of high performance (i.e. quantitative statistics) and low correlation.
-
The analysis discussed herein is purely quantitative, taking into account the historical performance only. Selecting investments on statistics alone is never a good idea. Evaluation of the fundamentals (e.g. asset valuation, EPS growth, future industry prospects, level of competition, industry diversification, etc) should be investigated as a compliment to the quantitative analysis to form a well-rounded investment strategy.
Recap
We covered a lot of ground in this post: starting from a basic quantitative stock analysis, and ending on a full fleged S&P500 stock screening workflow! If you’ve read and understood the full post, congratulations! This post required an understanding of quantitative stock analysis, R programming, and many powerful R packages including:
quantmod
: Retrieving stock prices (getSymbols()
) and returns (periodReturns()
), and visualizing stock charts (chartSeries()
)
xts
(extensible timeseries) objects: A key structure in R for timeseries data
rvest
: Web scraping tools in R
tidyverse
: A compilation of the following packages:
tibble
: The data frame format for working with tidy data
purrr
: For mapping functions (map()
) to tibbles (tidy data frames), and scaling the modeling workflow to many models
tidyr
: For manipulating tibble form using spread()
, gather()
, nest()
and unnest()
dplyr
: For manipulating tibble variables (columns) using mutate()
, group_by()
+ summarize()
, select()
, and filter()
ggplot2
: For static visualizations using the grammar of graphics
plotly
: For interactive visualizations
corrplot
: For visualizing correlation plots
Probably the most important concept is that you now should have a basic understanding of a modeling workflow: Developing the analysis for a single observation (e.g. a stock ticker), and then expanding it to many observations (e.g. the entire S&P500 list of stock tickers).
Again, great job for making it this far! Happy investing. :)
Further Reading
-
R Programming for Quantitative Finance: A resource for quantmod
and xts
packages that is good for those looking to get up to speed quickly.
-
R/Finance 2009 Workshop: Working with xts and quantmod: A second resource for quantmod
and xts
packages that is good for those looking to dive into the details.
-
rfortraders.com: A useful website covering the basics of quantitative financial strategy.
-
R For Data Science, Chapter 25: Many Models: Chapter 25 covers the workflow for modeling many models using tidyverse
, purrr
, and modelr
packages. This a very powerful resource that helps you extend a single model to many models. The entire R for Data Science book is free and online.
-
Portfolio Visualizer: This is a free web application for building and modeling portfolios. The Portfolio Visualizer provides online portfolio analysis tools for backtesting, Monte Carlo simulation, tactical asset allocation and optimization, and investment analysis tools for exploring factor regressions, correlations and efficient frontiers.