How to Automate Exploratory Analysis Plots
Written by Luciano Oliveira Batista on October 8, 2020
When plotting different charts during your exploratory data analysis, you sometimes end up doing a lot of repetitive coding. What we’ll show here is a better way to do your EDA, and with less unnecessary coding and more flexibility. So, let me introduce you to the powerful package combo
ggplot2 is an awesome package for data visualization very well know in the Data Science community and probably the library that you use to build your charts during the EDA. And, purrr package enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
Here will use an implementation similar to loops, but written in a more efficient way and easier to read,
The dataset is an imaginary HR dataset made by data scientists of IBM Company. In our analysis, we’ll look just at categorical variables, and plotting the proportion of each class within the categorical variable.
The dataset has a total of 35 features, being 9 of them categorical, and also which we will use.
Before we start to build our plot, we need to specify which variable will be used in the analysis. We’ll use just look at categorical features, in order to see the proportion between different classes, we’ll write a named vector with this information.
set_names function is super handy for naming character vectors since it can use the values of the vector as names.
Create a plot
My approach to this problem is, first plot the chart that you want, and second, replace the variable with an input of a function. This part is where you need to put more effort into coding.
In the code chunk below you’ll find a plot for a specific variable, “Attrition”.
I like to use this kind of plot when we have many different plots together, instead of using bar charts. The columns of bar charts can throw to the user too much information when just the end of the bar is important.
One tip to really grasp the steps for building this kind of chart (lollipop chart) is to thinking plots like layers (grammar of graphics) and put one on top of the other.
There are three core layers that need to be built in sequence:
The rest of the plot is trivial to any ggplot chart that you already build.
Now we need to replace the variable used before as input in a function.
Creating our plotting function
Using this strategy we will have the following function:
Here is the important step where we apply the function that we create to all character features in the dataset. And also, we’ll apply the
cowplot::plot_grid() that put together all ggplot2 objects in
In this tutorial, you learned how to save time when was needed to plot a chart a lot of times. I hope that was useful for you.
Author: Luciano Oliveira Batista
Luciano is a chemical engineer and data scientist in training. Learn more on his blog at lobdata.com.