Quantitative Stock Analysis Tutorial: Screening the Returns for Every S&P500 Stock in Less than 5 Minutes
Quantitative trading strategies are easy to develop in R if you can manage the data workflow. In this post, I analyze every stock in the S&P500 to screen in terms of risk versus reward. I’ll show you how to use
quantmod to collect daily stock prices and calculate log returns,
rvest to web scrape the S&P500 list of stocks from Wikipedia,
purrr to map functions and perform calculations on nested tibbles (
tidyverse data frames), and
plotly to visualize risk versus reward and extract actionable information for use in your trading strategies. At the end, you will have a visualization that compares the entire set of S&P500 stocks, and we’ll screen them to find those with the best future prospects on a quantitative basis. As a bonus we’ll investigate correlations to add diversification to your portfolio. Finally, the code that generates the
corrplot visualizations is made available on GitHub for future stock screening. Whether you are a veteran trader or a newbie stock enthusiast, you’ll learn a useful workflow for modeling and managing massive data sets using the
tidyverse packages. And, the entire script runs in less than five minutes so you can begin screening stocks quickly.
Here’s a sneak peek at the interactive S&P500 stock screening visualization that compares stocks based on growth (reward), variability (risk), and number of samples (risk). The tool can be used to visualize stocks with good characteristics: High growth (large mean log returns), low variability (low standard deviation), and high number of samples (lots of days traded). Make sure you zoom in on sections of interest and hover over the dots to gain insights.
Here’s a sneak peek at the correlation analysis, which is useful in determining which pairs of stocks to select to minimize risk. These are the top 30 stocks according to high growth and low risk. The key here is to select a portfolio that minimizes the correlations between stocks, thus further minimizing risk. Select stocks with low correlation (i.e. small light blue dots indicate pairs have low correlation (good); large dark blue combinations indicate high correlation (bad)).
Table of Contents
- Stock Analysis: An Individual Stock
- Stock Analysis: Expanded to All S&P500 Stocks
- Bonus: Computing Correlations
- Download the .R File
- Further Reading
- 2016-10-29: An issue was uncovered due to stock splits. Stock splits cause a large decline in the stock performance, which penalizes the stocks that split by increasing the standard deviation of the daily returns. The fix is to use the adjusted stock prices. In addition to the fix, I expanded on the correlation section to include the top 30 stocks with high growth, low variability and high samples. Based on feedback, I also expanded the introduction to better explain the goals of the analysis and to include the key visualizations.
For those following along in R, you’ll need to load the following packages:
If you don’t have these installed, run
install.packages(pkg_names) with the package names as a character vector (
pkg_names <- c("quantmod", "xts", ...)).
Before we extrapolate to a full-blown S&P 500 Analysis, we need to understand how to analyze a single stock. We’ll walk through three steps to quickly get you up to speed on quantitatively analyzing stock returns:
We’ll use the
quantmod package to retrieve and manipulate stock information. There’s a lot of details behind
quantmod and extensible timeseries (
xts) objects, much of which is beyond the scope of this post. We’ll skim the surface to get you to a proficient level. If you want to get to expert, a good resource is the University of Washington’s R Programming for Quantitative Finance. We’ll use MasterCard, ticker symbol “MA”, for the example.
First, load the
quantmod package and use the
getSymbols() function to retrieve stock prices. I use the optional
to arguments to limit the range of prices.
getSymbols() function adds a variable to R’s Global Environment with the stock ticker as the variable name. If you are using RStudio, check your Environment tab, and you should see
MA. What is
MA? Informally, it’s the daily stock prices and volumes. We need to understand it more formally. Send
MA to the
class() function using the pipe operator (
xts Objects for Time Series
It turns out that
MA is an
xts object, which is a format conducive to manipulating timeseries. It plays nicely with and is an extension of the
zoo package. The xts vignette elaborates on the details, and another nice resource is the R/Finance Workshop Presentation: Working with xts and quantmod. Our priority is to understand the structure of the data. Pipe
MA to the
head() functions to inspect.
A couple observations:
- The index is a column of dates. This is different from data frames, which generally do not have row names and indicies are a sequence from
- The values are prices and volumes. We’ll be working primarily the adjusted price when graphing, since this removes the effect of stock splits.
quantmod + xts
quantmod package plays nicely with
xts formatted objects. We can plot using the
quantmod::chartSeries() function. Because of stock splits, I changed to use the adjusted prices with the
Ad() function. For workflow purposes, I use the pipe (
%>%) to get the adjusted prices first, and then send the adjusted prices to the chart function. I do this because it is easier to read and it makes complex actions simple to understand. You can also use
chartSeries(Ad(MA)), but this is more difficult to understand.
chartSeries() function has a number of options help visualize the stock performance. A few that I’ve added are:
subset: Isolated the year to 2015. This takes dates and ranges in
xtssubset format. Here’s some tricks from StackOverflow.
TA: Added Bollinger Bands, Volume, and Moving Average Convergence Divergence (MACD) plots. These can also be added after the fact with add___() functions (e.g
theme: Changed to a white theme.
quantmod package includes functions to get daily, weekly, monthly, quarterly, and annual returns. We’ll use
dailyReturn() to get a
xts object of daily returns (see
?periodReturn, which is the base function). The default is an arithmetic return. Specify
type = "log" if a logarithmic return is needed. We’ll be dealing with log returns for structural reasons.
Although we just skim the surface of
quantmod in this post, the package has a lot to offer for stock analysis (visit www.quantmod.com for more information). We’ll primarily use it to retrieve stock prices (
getSymbols()) and calculate log returns (
periodReturn(period = 'daily')). Just keep in mind you can do a lot more with it.
The approach I use is similar to rfortraders.com’s Lecture 6 - Stochastic Processes and Monte Carlo. The fundamental idea is that stock returns, measured daily, weekly, monthly, …, are approximately normally distributed and uncorrelated. As a result, we can theoretically model the behavior of stock prices within a confidence interval based on the stock’s prior returns.
Returns are Normally Distributed
Applying the log-transformation, we can visually see that the daily returns are approximately normally distributed:
We can examine the distribution of log returns by applying the
The median daily log return is 0.001 and the 95% confidence interval is between -0.0447 and 0.0475.
The mean and standard deviation of the daily log returns are:
Finally, to get the actual returns, we need to re-transform the log returns. Pipe the mean of the log returns (
On average, the mean daily return is 0.0976% more than the previous day’s price. Doesn’t sound like much, but it compounds daily at an exponential rate.
Since we have a mean and a standard deviation, we can simulate using a process called a random walk. We’ll simulate prices for 1000 trading days. Keep in mind that every year has approximately 252 trading days, so this simulation spans just under four years.
The script below does the following: we start by specifying the number of random walks (
N), the mean (
mu), and the standard deviation (
sigma). The script then simulates prices by progressively calculating a new price using a random return from the normal distribution characterized by
sigma, and multiplying it to the previous day’s price. We then visualize the simulation.
Bummer! It looks like we are going to lose money with this investment. Can we trust this simulation?
No. The single random walk is just one of the many probabilistic outcomes. We need to simulate many iterations to build confidence intervals.
Monte Carlo Simulation
Monte-Carlo simulation does just that: we repeatedly perform the random walk simulation process hundreds or thousands of times. We’ll perform 250 Monte Carlo simulations (
M = 250) for one year of trading days simulations (
N = 252).
We can get confidence intervals for the stock price at the end of the simulation using the
The 95% confidence interval is between $69.37 and $227.69, with a median (“most likely”) estimated price of $129.99.
Is this realistic? It looks close considering the simulated growth rate is very close to the historical compound annual growth rate (CAGR):
Most importantly, the simulation gives us a baseline to compare stocks using the key drivers. The drivers of the simulation are the mean and the standard deviation of the log returns: The mean characterizes the average growth or return while the standard deviation characterizes the volatility or risk.
We now know how to analyze an individual stock using log returns, but wouldn’t it be better if we could apply this analysis to many stocks so we can review and screen them based on risk versus reward? Uh… yes! This is easy to do. Let’s get going!
We’ll expand to a full blown S&P500 Analysis following the modeling process workflow from R For Data Science, Chapter 25: Many Models. Once we have a modeling process down for an individual stock, we can map it to many stocks using a modeling workflow. Our workflow involves four steps:
- Get the list of S&P500 stocks.
- Create functions that retrieve needed data from a single stock. We’ll need to get stock prices, and compute log-returns.
map()the functions across all stocks.
- Visualize the results.
We’ll use the
rvest package to collect S&P500 stocks from Wikipedia. We use
read_html() to collect the HTML from Wikipedia. Then, we parse the HTML to a data frame using
html_table(). At the end, I return the data as a
tibble, and cleanup the column names.
sp_500 data frame has four columns:
ticker.symbol: The stock ticker
security: The company name
gics.sector: The primary sector according to the GICS
gics.sub.industry: The industry sub sector according to the GICS
It’s a good idea to inspect the categorical data before we start the analysis. The
lapply() function below loops through each column of the data set counting the length of the unique items. The result is a count of distinct values for each column.
We have a minor problem:
Notice that we have 505 ticker symbols but only 504 securities. There must be a duplicate entry in securities. Let’s inspect any securities that are duplicated. We’ll use the
filter() to find any
security that appears more than once.
The culprit is Under Armour. Let’s check out why.
Under Armour has two ticker symbols: UA and UA.C. We’ll remove the UA.C from the data set.
Back to inspecting the S&P500 data:
We know from the
lapply() function that there are 504 securities that are in 126 sub industries and in 10 sectors. However, we might want to understand the distribution of securities by sector. I again use
summarise() to get counts. In the visualization, I use
forcats::fct_reorder() to organize the
gics.sector by greatest frequency (
forcats package makes organizing
factor categories a breeze in R, which previously was a huge pain! Check out R for Data Science, Chapter 15: Factors for help working with factors.
Separating the distribution of securities into industry sectors shows us our options if we were to select stocks using a diversification strategy. While diversification and portfolio optimization is outside the scope of this tutorial, it’s worth mentioning that a risk mitigation technique is to select a basket (or portfolio) of stocks that have low return correlation. Typically, selecting from different industries and sectors helps to reduce this correlation and diversify the portfolio. While I don’t go into portfolio optimization, as a bonus I’ll show how it’s possible to visualize the correlations of the best performing stocks.
Technical Details Ahead:
We are going to dive into some structural aspects of
tibble objects. Feel free to skip the details and copy the functions in this section. You will still be able to follow along.
We have a tidy data frame (
tibble) of S&P500 securities stored in the variable
sp_500. We want to perform some functions on the data frame such as:
- Getting the stock prices for each security using
- Getting the log returns of the stock prices
We need to come through an issue between
tibble objects before we jump into mapping the functions. We are going to create nested data frames, which basically means that the data is stored as a list in the cells of
sp_500 (Don’t worry if this doesn’t make sense yet). A structural issue occurs with nesting
xts objects within
tibble data frames. The
xts objects won’t unnest. To get around this structural issue, we need to convert these objects to
Get Stock Prices Function
First, let’s create a function to return the stock prices.
get_stock_prices() is a wrapper for
quantmod::getSymbols() that takes a ticker, a return format, and any other
getSymbols() arguments and returns the prices in either
tibble format. In this post, we’ll use the
tibble format for unnesting. We set
auto.assign = FALSE to return the object to the variable rather than automatically creating the variable in the Global Environment.
Let’s see what happens when we try
get_stock_prices() on MasterCard using ticker, “MA”.
A few key actions that happened:
Instead of creating an object
MAin the Global Environment, an object was created as a local variable that we immediately output to the screen. This is important when using
map(), which needs to return a local object rather than auto assign an object in the Global Environment.
The object returned has a consistent column name structure. This is needed if we plan to
unnest()the object. More on this later.
return_format = "tibble", which can be nested (
tidyr::nest()) and unnested (
Get Log Returns Function
Next, we need a function to get log returns.
get_log_returns() is a wrapper for
quantmod::periodReturns() that takes a stock prices in
tibble format, a return format, and any other
periodReturns() arguments and returns the log returns in either
We’ll test the function to see what it returns:
Same as before, the return format is a
tibble, which is exactly what we need for mapping into nested data frames next.
Now the fun part! We are going to use
purrr::map() to build a nested data frame. If you’re not familiar with nested data frames, check out
tidyr::nest() and R For Data Science, Chapter 25: Many Models. Data frames can store lists in columns, which means we can store data frames in the cells of data frames (aka nesting). The
map() function allows us to map functions to the nested list-columns. The result is the ability to extend complex calculations to loop through entire data frames.
It’s OK if the script below doesn’t make sense. We’ll get a feel for what happened by exploring the results.
Warning: The following script stores the stock prices and log returns for the entire list of 504 stocks in the S&P500. It takes my laptop a minute or two to run the script.
Let’s take a look at
sp_500. Notice that the
log.returns columns include
tibbles within the data frame.
Let’s peek into one of the nested tibbles:
It’s the stock prices for the first stock, 3M: all 2470 observations.
Key point on nested tibbles: The sub-data (e.g. stock prices, log returns) are all stored as data frames within the upper-level data frame (e.g.
sp_500)! This means we can sort, filter, and perform calculations on all the sub-data at once with the nested tibbles, and the sub-data for each observation goes along for the ride.
The final output included
n.trade.days, which were mapped using the
map_dbl() function. The
map() function always returns a
list, but sometimes we want a value. For scalars we can return the value using
map_chr(), …, depending on the return type. Let’s check out the final output.
Essentially, we collected the stock prices for every S&P500 stock ticker using
get_stock_prices(), then passed each set of stock prices to
get_log_returns(), then sent the log returns for each stock to the
sd() functions to return the values we want to compare on. Pretty cool. ;)
You may also see that I added a variable,
n.trade.days, for the number of trade days (observations) for the stock. We want stocks with a large number of observations because this gives us more samples (and a longer time span) to trend the stock, thus increasing our confidence in the statistics. This will come in handy when we perform our visualization next.
Back to the main purpose: we want to compare all of the S&P500 stocks based on risk versus reward. As stated before, we consider:
- Rewarding stocks have a higher mean of the log returns (measure of growth).
- Riskier stocks have a higher standard deviation of the log returns (measure of volatility).
What we need is a way to visualize a scatter plot of all of the stocks, showing the stock name and it’s related information in one concise plot. The
plotly library is perfect for this task. Using hover fields and color/size characteristics we can quickly get the essence of the data to screen the stocks.
The script below uses
plot_ly() to create a scatter plot showing the standard deviation of log-returns (x-axis), the mean log-returns (y-axis), and the number of trade days collected (size and color aesthetics). Using the
text argument, we can add information to the plot upon hover. Data exploration can be performed using the hover, zoom, and pan fetures. Try it out and develop your intuition about which stocks are the best investments.
From the plot we can see that a number of stocks have a unique combination of high mean and low standard deviation log returns. We can isolate them:
Upon inspection PCLN sports a unique combination of low volatility, high average returns, and a lot of samples. Here’s the
Questions About the Analysis
What can you say about the relationship between the standard deviation and mean of the log returns? Does there appear to be one? As volatility (standard deviation of returns) increases, what tends to happen to growth (mean increase of returns)?
The stock with the ninth highest mean is HPE. Would you use it in as an investment? Why or why not?
How can the investment analysis be improved to form a well-rounded strategy? (Hints: What about qualitative analysis in addition to quantitative analysis? What about dividends? What about back testing for performance?)
Selecting a portfolio of stocks can be challenging because on the one hand you want to select those with the best performance, but on the other hand you want to diversify your investment so you don’t have all your eggs in one basket. The quantitative analysis done previously presented a method to screen stocks with high potential. To balance this, a correlation assessment can be performed to visualize which high potential stocks are least correlated.
Let’s first cut the list down to the top 30 stocks with more than four years of samples (approx 1000 trading days). We’ll add a
rank column using the
min_rank() function on
mean.log.returns, and filter the data on the top 30 ranked means.
Next, we need to
unnest() the tibble:
unnest() effectively ungroups the nested tibbles within the top level of the tibble (You’ll see what I mean below). We only need to keep the
ticker.symbol and the unnested
The unnested tibble now contains the daily log returns for the 30 stocks. Notice that there are now 64,269 rows.
The last data manipulation task is to get the unnested tibble into the format for correlation analysis. We need to
ticker.symbols to get them into the columns. Because not all the stocks have data for the full range of dates, we need to use
na.omit() to remove rows with
Finally, we are ready to get and visualize correlations. We’ll use the
corrplot package for visualizing the correlations. We remove the
Date column so only values remain. Then, we send to the
cor() function, which calculates the correlations.
We send the correlations to
corrplot() to visualize the correlations. I ordered using
hclust, which groups the stocks by similarity. The
addrect argument adds boxes around the highly correlated groups, and by inspecting the circle colors I settled on 11 boxes.
The key is to choose stocks that have low correlation. For example, you might select the combination of AMZN and FB because their returns have low correlations. However, you might select only one from APPL, AVGO and SWKS because the stocks are highly correlated. The high correlations make sense because Avago (Broadcom) and Skyworks are semiconductor manufacturers that supply Apple iPhones.
The .R file file is available on GitHub. The
.R file has everything needed to generate the
corrplot visualizations. Depending on your PC processing power, it should take just a minute or two to run.
The quantitative analysis is a powerful tool. Thanks to
purrr and other
tidyverse packages the analysis can easily be extended to many stocks to compare and screen potential investments. However, before we jump into making investment decisions, we need to recognize the strengths and weaknesses of the analysis:
Computing quantitative statistics (mean and standard deviation) of stock returns should be a go-to resource for evaluating stocks. It provides useful performance metrics that quantitatively characterize risk and reward.
Correlation analysis identifies the extent to which assets are correlated. When building a portfolio, the general rule is to select assets with a balance of high performance (i.e. quantitative statistics) and low correlation.
The analysis discussed herein is purely quantitative, taking into account the historical performance only. Selecting investments on statistics alone is never a good idea. Evaluation of the fundamentals (e.g. asset valuation, EPS growth, future industry prospects, level of competition, industry diversification, etc) should be investigated as a compliment to the quantitative analysis to form a well-rounded investment strategy.
We covered a lot of ground in this post: starting from a basic quantitative stock analysis, and ending on a full fleged S&P500 stock screening workflow! If you’ve read and understood the full post, congratulations! This post required an understanding of quantitative stock analysis, R programming, and many powerful R packages including:
quantmod: Retrieving stock prices (
getSymbols()) and returns (
periodReturns()), and visualizing stock charts (
xts(extensible timeseries) objects: A key structure in R for timeseries data
rvest: Web scraping tools in R
tidyverse: A compilation of the following packages:
tibble: The data frame format for working with tidy data
purrr: For mapping functions (
map()) to tibbles (tidy data frames), and scaling the modeling workflow to many models
tidyr: For manipulating tibble form using
dplyr: For manipulating tibble variables (columns) using
ggplot2: For static visualizations using the grammar of graphics
plotly: For interactive visualizations
corrplot: For visualizing correlation plots
Probably the most important concept is that you now should have a basic understanding of a modeling workflow: Developing the analysis for a single observation (e.g. a stock ticker), and then expanding it to many observations (e.g. the entire S&P500 list of stock tickers).
Again, great job for making it this far! Happy investing. :)
R Programming for Quantitative Finance: A resource for
xtspackages that is good for those looking to get up to speed quickly.
R/Finance 2009 Workshop: Working with xts and quantmod: A second resource for
xtspackages that is good for those looking to dive into the details.
rfortraders.com: A useful website covering the basics of quantitative financial strategy.
R For Data Science, Chapter 25: Many Models: Chapter 25 covers the workflow for modeling many models using
modelrpackages. This a very powerful resource that helps you extend a single model to many models. The entire R for Data Science book is free and online.
Portfolio Visualizer: This is a free web application for building and modeling portfolios. The Portfolio Visualizer provides online portfolio analysis tools for backtesting, Monte Carlo simulation, tactical asset allocation and optimization, and investment analysis tools for exploring factor regressions, correlations and efficient frontiers.