How To Learn R, Part 1: Learn From A Master Data Scientist's Code
Written by Matt Dancho on March 3, 2018
The R programming language is a powerful tool used in data science for business (DS4B), but R can be unnecessarily challenging to learn. We believe you can learn R quickly by taking an 80/20 approach to learning the most in-demand functions and packages. In this article, we seek to ultimately understand what techniques are most critical to a beginners success through analyzing a master data scientist’s code base. Half of this article covers the web scraping procedure (using
purrr) we used to collect our data (if new to R, you can skip this). The second half covers the insights gained from analyzing a master’s code base. In the next article in our series, we’ll develop a strategic learning plan built on our knowledge of the master. Last, there’s a bonus at the end of the article that shows how you can analyze your own code base using the new
fs package. Enjoy.
Previous articles in the DS4B series:
Analyzing A Leader Through Their Code
It’s no secret that the R Programming Community has a number of leaders. It’s one of the draws that separates R from the rest of the pack! The leaders have earned their leadership position by making an impact through high-end data science and unselfishly giving back to the community. We can all learn from what they do. The key is dissecting their code bases to understand the tools and techniques they use.
What we’ve done is made a first step in figuring out why these individuals are successful through analyzing their most frequently used tools. We started with one standout, a true master of data science…
David Robinson (Our Master Data Scientist)
We examined David Robinson, Chief Data Scientist at DataCamp (previously StackOverflow) (aka DRob), analyzing his code base by dissecting the functions and packages he uses regularly!
Variance Explained Blog (Where We Got The Data)
DRob actively maintains an excellent blog called Variance Explained. He has 58 articles currently, most with code. We used the
rvest package to collect the code contained in each post. We started with one blog post involving mixture models of baseball statistics. We then extended it to all 58 articles to increase our confidence in what tools he frequently uses.
80/20 Analysis For Learning R (What We Did With The Data)
We made several graphs that tell us interesting things about what R functions and packages DRob regularly uses. We then performed an 80/20 analysis on his code base. We output all of the high-usage functions and packages DRob regularly uses at the end of the article. This will be used in our next article developing a strategy to learn R efficiently.
Our first graph helps us answer which functions DRob routinely uses. Here are his top 20 functions.
Our second graph helps us understand which packages are most frequently used based on the number of times functions within those packages appear in his code.
We noticed a theme that DRob is frequently using packages in the “tidyverse” (e.g.
ggplot2, etc). The third chart shows how “tidy” his code is by measuring the percentage of tidyverse functions vs non-tidyverse functions.
And, finally, we listed out the top 88 functions in our 80/20 analysis. These are the functions he used 80% of the time. Check out the 80/20 list of most frequently used functions below!
Bonus: How to Learn R by Analyzing Your Code
We show as a bonus how you can apply the custom scripts developed in this article to do an analysis on your code with the new
fs package (short for file system). We briefly analyze our code base for the HR 201: Predicting Employee Attrition course that is under development for our Business Science University Virtual Workshop on predicting employee turnover and developing
Shiny web applications. The Virtual Workshop is coming soon!
Learning R can be unnecessarily challenging if one focuses on learning everything immediately rather than applying a strategic approach. Our hypothesis is composed of two theories:
- You do not need to learn everything to become effective (refer to 80/20 Rule)
- You can learn a lot by analyzing others’ code bases (refer to Learning from a Master Data Scientist)
Our quest is to prove these theories.
Trying to solve every aspect of a challenge is overwhelming and often not the best use of your time. The 80/20 Rule can often help in these situations. Generally speaking, the 80/20 Rule posits that roughly 20% of activities or tasks produce 80% of the results you want. The challenge of learning the R language falls perfectly into this model. The key is figuring out which areas are the highest value for your time.
To make an optimal strategy, we need data on what tools are being used to perform high-end data science. But, where can we get data? From a data scientist that regularly performs high-end data science publicly via the internet. A true master at the dark art of data science!
Why we chose David Robinson (aka DRob):
I met DRob at the EARL Boston conference last November. He had just given a stellar presentation on StackOverflow trends concluding that R was growing rapidly. What stuck out about his presentation was his ability to analyze a question or problem - This is a key skill in DS4B!
Many data scientists such as top Kaggle competitors focus on how to create a high end predictions (which is important) but it’s not representative of most real-world situations. It’s really unique to see a data scientist effectively using problem solving and critical thinking in combination with data science to learn about the problem. DRob did this very well.
I had checked out his blog before, but never picked up on the fact that he’s really doing data science that can be applied to business (although he’s mainly applying to other areas such as sports and politics). DRob has written a number of articles on his blog, Variance Explained. Some of my favorites are his articles on statistics applications in baseball. His approaches are novel leaning on Bayesian A/B testing, hierarchical modeling, mixture models, and many other tools that are very useful in business analysis. Further, he employs problem solving and critical thinking, which are the same skills that are needed in DS4B.
Analysis: Learning R From A Master
If we treat DRob’s code on his Variance Explained blog as a text analysis, we can find the most frequently used packages and functions that cover the majority of code he produces. We can then use an 80/20 approach to determine which functions and packages are most used and therefore most important to master. We’ll split this analysis into two parts:
Part 1: Web Scraping The Variance Explained Blog - Note this is a technical section showing how we retrieved the data. Novice learners may wish to skip this part.
Part 2: Learning From DRob’s Code - The analytical work is done here!
If you wish to follow along, please load the following libraries.
CAUTION: Technical details ahead. Those new to R may wish to skip this section and get right to the results.
This part of the analysis is a how-to in web scraping. We expose the process to collect the data, which use the
rvest package and a number of custom functions to parse the text. We split this part into two steps:
This is a text analysis. As such, we are going to need to parse some text to extract function names, to determine which packages functions belong to, and to analyze the text with counts and percents. To do so, we create a few helper functions:
count_to_pct(): Utility function to quickly convert counts to percentages. Works well with
parse_function_names(): Takes in text and returns function names.
find_functions_in_package(): Takes in a library or package name and returns all functions in the library or package.
find_loaded_packages(): Detects which packages are loaded in the R system.
map_loaded_package_functions(): Maps the function names to the respective package.
Function: Utility function to quickly convert counts to percentages. Works well with
dplyr::count() to retrieve counts by grouping variables. Then use
count_to_pct() to quickly get percentages within groups. Exclude groups if overall percentages are desired. Note this example uses the
mpg data set from
Function: Takes in text and returns function names.
Usage: Parses the function name preceding “(“. Returns a tibble.
Function: Takes in a library or package name and returns all functions in the library or package.
Usage: Takes in a package that is loaded via
library(package_name) (the package must be loaded for this to work properly). Returns a tibble of all functions in a package.
Function: Detects which packages are loaded in the R system.
Usage: Returns a tibble of loaded packages. Used in conjunction with
map_loaded_package_functions() to build a corpus of functions and associated packages, which is needed to determine which package the target function comes from.
Function: Maps the function names to the respective package.
Usage: Used in conjunction with
find_loaded_packages(). Builds a corpus of package and function combinations for every package that is loaded. Returns a tibble of packages and associated functions.
Now that we have the parsing and utility functions setup, we can begin the web scraping process to return the code on the Variance Explained blog. The first action is to setup our process for pulling the functions and packages for a single post. Once that process is defined, we can scale our web scraping to ALL posts!
DRob has a number of posts that include code-throughs (walkthroughs using code). The code is contained within the HTML as a node named
<code>. This makes it very easy to extract using the
We can get the functions and packages used using the following code. It takes four steps:
- Get a
pathfor one of the articles. In our case, we chose DRob’s article on analyzing baseball stats with mixture models.
- Create a corpus of all packages that are loaded in our R session. We will use this to determine which package the function that DRob uses comes from.
read_html()to read the HTML from the page. Then collect all nodes containing
<code>. Then extract the text within those nodes using
- Run the text through our custom
parse_function_names()function. This returns parsed function names. We still need the packages, which we can get by using
The final output is all of the functions and most of the package names for the functions that are used in this article! We use the
glimpse() function to keep the output minimal.
From the output, we see that DRob used 131 functions in this particular article.
Next, we can then do a quick analysis to see what functions DRob used most frequently in this article. We can see that
dplyr comes up quite frequently. The most popular function used is
mutate(), which is used in this article about 11% of the time.
This is just one sample. We need more data to increase our confidence in which packages and functions are important. Next, let’s scale the web scraping to ALL of DRob’s blog posts on Variance Explained.
Scaling to all blog posts is fairly easy with the
purrr package. We need to do two things:
Web scrape all of the titles, dates, and paths for each of DRob’s articles using
Scale the analysis using
purrrin combination with a custom function,
build_function_names_tbl_from_url_path(), that we will create.
Web scraping the titles, dates, and paths (href) are again easy with the
rvest package. On the Variance Explained Posts page, we can again examine the HTML to find that the structure contains:
- Titles are stored in the
article anodes as text
- Dates are stored in the
article p.datetimenodes as text
- Paths (href) are stored in the
article anodes as
Source: All Posts, Variance Explained
We can extract this information using three web scrapings (one for title, dates, and hrefs), and then binding each together using
bind_cols(). We output the first six posts in a table using the
|What digits should you bet on in Super Bowl squares?||2018-02-04||http://varianceexplained.org/r/super-bowl-squares/|
|Exploring handwritten digit classification: a tidy analysis of the MNIST dataset||2018-01-22||http://varianceexplained.org/r/digit-eda/|
|What’s the difference between data science, machine learning, and artificial intelligence?||2018-01-09||http://varianceexplained.org/r/ds-ml-ai/|
|Advice to aspiring data scientists: start a blog||2017-11-14||http://varianceexplained.org/r/start-blog/|
|Announcing “Introduction to the Tidyverse”, my new DataCamp course||2017-11-09||http://varianceexplained.org/r/intro-tidyverse/|
|Don’t teach students the hard way first||2017-09-21||http://varianceexplained.org/r/teach-hard-way/|
We now have all 58 of DRob’s posts (title, date, and href) and are now ready to scale! To simplify the process, we’ll create a custom function,
build_function_names_tbl_from_url_path(), that combines several
rvest operations from the previous section, Web Scrape A Single Blog Post.
We can test the function to see how it takes a
path and a tibble of
loaded_functions_tbl and returns the functions and packages.
Next, we can scale this to all posts using
map() functions from the
purrr package. Several of the posts have no code and therefore return nested
NA values. We filter them out by mapping
function_name column to reveal the nested function names and packages.
Awesome - We now have all of the function names and most of the packages that DRob used ALL of his code on Variance Explained! Notice that the sample size has increased to 2314 functions extracted. This is a much larger sample size than before with the single Mixture Models post, which had 131 functions extracted.
The question we need to answer is “What Code Does An R Master Use To Perform Data Science?” We can break this down into separate questions of interest:
- Which Functions Are Most Frequently Used by DRob?
- Which Packages Are Most Frequently Used by DRob?
- How “Tidy” Is DRob’s Code? - Note that “tidy” means using the “tidyverse” packages, which use a common API for data science
- Million Dollar Question: Which Functions and Packages Should We Focus On To Learn R?
We can answer this question with by counting our package and function name frequencies, sorting, and taking the top 20, which gives us a subset of the most frequently used functions.
We can visualize this data using
ggplot2. We chose a lollipop style chart that extends lengthwise for the top 20, which shows off the number and percentage of total for each of the top 20 functions. We can see that
dplyr::mutate() are very frequently used by DRob. In fact, these four functions comprise 23.3% of his total functions. Unfortunately,
aes() can’t be used alone (see below for how it’s used with the
ggplot() function). However, with knowledge of
library() and the combination of
dplyr, a learner can understand 18% of DRob’s code!
We can answer this question a number of ways, and we elect to make a time-based analysis to expose underlying trends within packages over time. The idea is that some packages may be used more frequently for specific reasons, and we aim to uncover the true trend of the packages which is not constant. We’ll use the
tibbletime package to help out with the time-based analysis by aggregating (or grouping) the data by six-month intervals. Note that we lump (using
fct_lump()) all packages into six categories based on the top 5 packages and an extra column called “Other”. a label is made by pasting “H” with
semester(date) to return the which half of the year the data is aggregated.
Next, we can visualize with
ggplot2. The total functions (column
ve_package_frequency_tbl) used are misleading since in some half years DRob posts less than in others. We can normalize by switching to percentage of total functions by half year.
We switch to a percentage of total functions (
pct column in
ve_package_frequency_tbl) to get a better perspective on what trends are happening within posts over time. We see that DRob is trending in the direction of more
ggplot2 and using fewer “Unknown” packages, which are packages that I do not have currently loaded on my machine (e.g. not “tidyverse” or “base”). It’s clear that
ggplot2 are DRob’s toolkits of choice.
Finally, we can get the overall percentage of package usage by
uncounting and recounting by package. We add a cumulative percentage column and see that we can almost get to 80% with just three package:
We saw in the package analysis that DRob is using quite a few “tidy” packages. We can extend the analysis to see how frequently he’s using “tidyverse” functions.
The tidyverse is a very popular set of packages that are developed specifically to do data science in an integrated and easy to understand way. Currently, the “tidyverse” consists of the following packages:
We can flag functions from the tidyverse package from DRob’s code base using the
tidyverse_packages() function. If functions are in a tidyverse package, the are flagged as “Yes” and otherwise “No”.
Here’s how easy it is to quickly see how tidy DRob is. About 60% of his functions are “tidyverse” functions.
How has DRob’s “tidiness” changed over time? We’ll again call upon
tibbletime to help transform the data using
Here’s a fun fact… According to this graph, DRob is over twice as “tidy” now as when he started blogging in 2015. This should tell us that we really need to give the “tidyverse” a shot if we aren’t using it now.
Now the million dollar question: What should we focus on if we are just starting out in R? We’ll use the 80/20 Rule, which boils down to which top functions build 80% of DRob’s code. Ideally this should be around 20% according to the rule. The question is actually really easy to answer using the
cumsum() function from
base. We can flag any cumulative percentages that are less than or equal to 80% as “high usage”.
Next, we just count our high usage flags and turn the count to percent. We can see that 28.2% of functions create 80% of DRob’s code.
Finally, here are the functions by package that we should focus on if we are just starting out. Keep in mind this is just DRob and we may want to expand to other masters of data science to get an even better picture of the high usage functions.
Takeaways From DRob’s Code
DRob is using quite a bit of
ggplot2code. In fact, these three libraries account for 77.6% of his code on Variance Explained.
DRob’s code is getting… tidier! DRob is using approximately 80% tidyverse code in the most recent half-year of blogging. This trend is increasing, although it will eventually top out. This compares to around 37% tidy code when he began blogging in 2015.
If DRob is getting tidier, which area is getting impacted the most? It’s the packages I’ve categorized as “Unknown”. These are non-tidyverse or pre-loaded packages. In other words, these are uncommonly used packages that may serve a specialized need. I do not currently have these loaded, which is why they are considered “Unknown”. It’s worth mentioning that
statslibraries are declining slightly, but not to the extent that specialized packages are declining. The bottom line - DRob is using less specialized packages and more tidyverse.
One point I have not discussed is that DRob is just one really good data scientist. His code is clearly representative of the tidyverse-style, which resonates with many future data scientists coming into the industry. If one wishes to emulate DRob, this is probably a good analysis to take and run with. However, it may make sense to also view other “masters” that exist as part of a future endeavor.
Another point is that we got 2,314 functions out of 58 posts. While this is by no means a small sample, we certainly may wish to increase the sample size to get more confidence in the most high usage functions. Personally, I’d like to see a 100X ratio between top functions and total observations, meaning the top 100 functions would be from at a minimum 10,000 functions. With that said, the analysis was performed accross a large sample of projects (58 posts less those that do not contain code) and multiple years which is another factor that improves confidence.
We spoke a lot about analyzing DRob’s code, but with a few modifications you can apply this analysis to your own code stored in .R or .Rmd files! Here’s how with the
We’ll begin with a relatively large code base from a project I’m working on, which is a new course called HR 201: Predicting Employee Attrition. It’s part of our brand new Business Science University Virtual Workshop, which will be released soon! This code base (directory containing R code) is not available to you, but you can follow along if you have a code base of .R or .Rmd files stored in a directory of your own. Just change the directory path to your own.
Part 1: Extracting Your Functions From Your Code Base
First, load the
fs package. This is a great package for working with the file system on you computer.
Next, collect the path for YOUR code base directory. I will use my R Project directory for the HR 201 Course.
Use a function called
dir_info() to retrieve the contents of the directory. Add the argument,
recursive = TRUE, to collect all the files from the sub-directories. Use
head() to return the first six rows only.
|../../../Business Science/Courses/Teachable/HR201_Employee_Turnover_H2O/HR201_Employee_Turnover_Project/00_Data||directory||0||r–||2018-02-18 08:25:04||NA||NA||2586258886||1||0||1.322932e+16||4096||8||0||0||2018-02-18 08:25:04||2018-02-18 08:25:04||2018-02-02 16:43:35|
|../../../Business Science/Courses/Teachable/HR201_Employee_Turnover_H2O/HR201_Employee_Turnover_Project/00_Data/desktop.ini||file||142||rw-||2018-02-02 16:43:49||NA||NA||2586258886||1||0||2.589570e+16||4096||0||0||0||2018-02-02 16:43:49||2018-02-02 16:43:49||2018-02-02 16:43:49|
|../../../Business Science/Courses/Teachable/HR201Employee_Turnover_H2O/HR201_Employee_Turnover_Project/00_Data/WA_Fn-UseC-HR-Employee-Attrition.xlsx||file||255.7K||rw-||2017-11-26 06:37:44||NA||NA||2586258886||1||0||1.857735e+16||4096||512||0||0||2018-02-02 16:43:35||2018-02-08 15:43:31||2018-02-02 16:43:35|
|../../../Business Science/Courses/Teachable/HR201_Employee_Turnover_H2O/HR201_Employee_Turnover_Project/00_Scripts||directory||0||r–||2018-02-26 09:28:32||NA||NA||2586258886||1||0||4.053240e+16||4096||8||0||0||2018-02-26 09:28:32||2018-02-26 09:28:32||2018-02-02 16:43:35|
|../../../Business Science/Courses/Teachable/HR201_Employee_Turnover_H2O/HR201_Employee_Turnover_Project/00_Scripts/assess_attrition.R||file||4.01K||rw-||2018-02-27 21:22:11||NA||NA||2586258886||1||0||9.007199e+15||4096||16||0||0||2018-02-27 21:22:11||2018-02-27 21:22:24||2018-02-05 16:36:02|
|../../../Business Science/Courses/Teachable/HR201_Employee_Turnover_H2O/HR201_Employee_Turnover_Project/00_Scripts/desktop.ini||file||142||rw-||2018-02-02 16:43:49||NA||NA||2586258886||1||0||4.503600e+15||4096||0||0||0||2018-02-02 16:43:49||2018-02-02 16:43:49||2018-02-02 16:43:49|
Now that we see how
dir_info() works, we can use one more function called
path_file() to retrieve just the file portion of the path. We can then use the file name with
str_detect() to detect only files with “.R” or “.Rmd” at the end. We’ll create a tibble of the file names and paths.
Next, we can create a custom function called,
build_function_names_tbl_from_file_path(), which is very similar function to the url builder before. The main difference is that the HTML extraction code is replaced with
We can test it with one of the file paths.
Let’s see what it returns.
Great, it works identically to the web scraping version but with local file paths. We have 57 functions just in the first file.
We can scale it to all code in the code base using the file paths. The process is almost identical to the web scraping process.
Part 2: Analyzing Your Code
You can run through the same process with your code. Here are my top 20 functions.
You can assess the similarities and differences between you and DRob. For example, DRob and I both use quite a bit of
dplyr for data manipulation and
ggplot2 for visualization.
I have a few differences related to my coding techniques. I do quite a bit of programming so
base::function() is in fourth place and
dplyr::enquo() (part of the new tidy eval framework) is in 15th place. I also have
tidyquant::theme_tq() related to my preference for
tidyquant ggplot2 themes.
And, we can also see how DRob’s top 20 differs from mine. Most of these functions are ones I use frequently, just not in my top 20. And, this is likely the case for DRob with the dissimilar functions in the table above.
We are half-way on our quest to develop an optimal strategy on how to learn R. We picked a great candidate in DRob to learn from. He’s a tidyverse afficianado, a master data scientist, and he has a large sample of blog posts spanning multiple years to aggregate and analyze.
We learned a bunch of cool things related to our hypothesis. To recap, we hypothesized that (1) you don’t need to learn everything to become proficient at R, and (2) we can develop a strategic plan by learning from a master data scientist. We have not proven the second point yet, but the first we can confirm with confidence given that 88 functions created 80% of the output on DRob’s blog.
In the next post we’ll dive deeper into the list of top functions generated to see if we can develop a program to go from zero experience in R to intermediate status quickly! If our 80/20 theory is right, we should be able to go from zero to intermediate in just a couple weeks by focusing on the most critical functions.
Business Science specializes in “ROI-driven data science”. Our focus is helping organizations apply data science to business through projects that generate a financial benefit. Visit Business Science on the web or contact us to learn more!