Automatic Exploratory Data Analysis in R with DataExplorer

Written by Matt Dancho on March 2, 2021



Data Scientists spend 80% of their time just trying to understand and prepare data for analysis! This process is called Exploratory Data Analysis (EDA). R has an Insane EDA​ productivity-enhancer. It’s called DataExplorer​. And I’m going to get you up and running with DataExplorer in under 5-minutes:

  1. How to make an Automatic EDA Report in seconds with DataExplorer.
  2. BONUS: How to use the 7 most important EDA Plots to get exploratory insights.

R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.

Here are the links to get set up. 👇

Video Tutorial

Learn how to use the trelliscopejs package in my 5-minute YouTube video tutorial.

What you make in this R-Tip

By the end of this R-Tip, you’ll make this exploratory data analysis report. Perfect for impressing your boss and coworkers! (Nice EDA skills)

Get the code.

Thank you developers.

Before we dive into DataExplorer, I want to take a moment to thank the developer, Boxuan Cui. He’s currently working as a Senior Data Science Manager at Tripadvisor. In his spare time, Boxuan has built and maintains one of the most useful R packages on the planet: DataExplorer. Thank you!

Automatic Exploratory Data Analysis with DataExplorer

One of the coolest features of DataExplorer is the ability to create an EDA Report in 1 line of code. This automates:

  • Basic Statistics
  • Data Structure
  • Missing Data Profiling
  • Continuous and Categorical Distribution Profiling (Histograms, Bar Charts)
  • Relationships (Correlation)

Ultimately, this saves the analyst/data scientist SO MUCH TIME.

Step 1: Load the libraries and data

To get set up, all we need to do is load the following libraries and data.

Get the code.

We’ll use the gss_cat dataset, which has income levels for people by various factors including marital status, age, race, religion, ….

With data in hand, we are ready to create the automatic EDA report. Let’s go!

Step 2: Create the Automatic EDA Report

Next, use create_report() to make our EDA report. Be sure to specify the output file, output directory, target variable (y), and give it a report title.

Get the code.

This produces an automatic EDA report that covers all of the important aspects that we need to analyze in our data! It’s that simple folks.

BONUS: How to use the 7 Most Important DataExplorer Plots

As an extra special bonus, I figured I’d teach you not only how to make the report BUT how to use the report too. I know, I’m too kind.

Here’s how to get the most out of your automatic EDA report. If you’d like the code to produce the individual plots, simply sign up for our free R-tips codebase.

1. Basic Statistics

Get the code.

The basic statistics is where we first start understanding our data. We can see information about our columns including:

  • Discrete columns: Columns with categorical data
  • Continuous columns: Columns with numeric data
  • All missing columns: Columns that have 100% missing data
  • Complete rows: Percentage of rows that are complete (no missing data)
  • Missing observations: Of all of the data, this is the percentage of missing

2. Data Structure

Get the code.

Next, we can examine each of the columns specifically learning about what data types are contained inside each of the columns.

  • This is really important when we need to know a bit more detail about our data.
  • We can begin to hypothesize what we should do to get it into the correct structure for analysis.

3. Missing Data Profile

Get the code.

The missing data profile report helps us understand which columns have missing data.

  • We can start to think about missing data treatment - imputation strategies or if we will need to remove columns
  • We can see if columns have hardly any or no missing data - which will be easier to use
  • We can see if columns have a lot of missing data - which may need to be removed or heavily treated

4. Univariate Distributions

We have a bunch of options here, which can be used to dive into the columns. I’ll focus on the 2 most important:

  • Continuous Features: Histogram
  • Categorical Features: Bar Plot

4A. Continuous Features (Histogram)

We can check out the distribution within the numeric data to quickly see what we are dealing with.

Get the code.

We can get a sense of the distribution of the numeric data.

  • Skewed data: TV Hours is very skewed with a few outliers (e.g. watching 24-hours of TV per day)
  • Non-normal data: Age tends to be more 25-50 year old respondents than over 50
  • Data Range: We can see the survey results go from year 200 to 2014. It looks like every 2-years.

4.B Categorical Features (Bar Plots)

Get the code.

The categorical feature distributions are frequency counts by category shown as a box plot. This helps us:

  • Understand categorical distribution: Categories tend to have some levels that are highly present and others that are much more rare.
  • Start thinking about categorical treatment: We may need to lump some categories together before modeling.

5. Correlation Analysis

Get the code.

  • Correlation helps us tell whether we should move onto modeling.
  • Warning! The correlation can be a bit misleading. Many correlations of numeric variables are non-linear. For example, middle-aged people may be more likely to watch less TV. But young and older people may be more likely to watch more TV. The correlation could be low because of the nonlinear relationship.

6. Principal Components

Plotting principal components can help you determine if the data can be compressed. I’ll explain what I mean by this.

Get the code.

Data that is very wide (many columns) can be computationally expensive to model.

  • By applying PCA (Principal Component Analysis), we can determine if compressing using an algorithm like PCA or UMAP is appropriate.
  • Here I can see about 37% of the variance explained is contained in the first 20 principal components.

7. Bivariate Distributions

Now we are going to focus on how each feature varies with the target (rincome - how much each person/household makes in annual income).

  • Box Plot: For analyzing numeric vs categorical
  • Scatter Plot: For analyzing numeric vs numeric (not shown)

Box Plot: Numeric vs Categorical

Get the code.

With the box plot, we can:

  • Begin to visualize relationships.
  • See how each numeric feature (age, tv hours, year) has a relationship with rincome
  • $250,000 (high income earners) tend to be in their early 40’s while low income earners are in their late 20’s

Recap

We learned how to use the DataExplorer library to automatically create an exploratory data analysis report. Great work! But, there’s a lot more to becoming a data scientist.

If you’d like to become a data scientist (and have an awesome career, improve your quality of life, enjoy your job, and all the fun that comes along), then I can help with that.

Step 1: Watch my Free 40-Minute Webinar

Learning data science on your own is hard. I know because IT TOOK ME 5-YEARS to feel confident.

AND, I don’t want it to take that long for you.

So, I put together a FREE 40-minute webinar (a masterclass) that provides a roadmap for what worked for me.

Literally 5-years of learning, consolidated into 40-minutes. It’s jammed packed with value. I wish I saw this when I was starting… It would have made a huge difference.

Step 2: Take action

For my action-takers, if you are ready to take your skills to the next level and DON’T want to wait 5-years to learn data science for business, AND you want a career you love that earns you $100,000+ salary (plus bonuses), and you’d like someone to help you do this in UNDER 6-MONTHS or less….

Then I can help with that too. There’s a link in the FREE 40-minute webinar for a special price (because you are special!) and taking that action will kickstart your journey with me in your corner.

Get ready. The ride is wild. And the destination is AMAZING!


👇 Top R-Tips Tutorials you might like:

  1. mmtable2: ggplot2 for tables
  2. ggdist: Make a Raincloud Plot to Visualize Distribution in ggplot2
  3. ggside: Plot linear regression with marginal distributions
  4. DataEditR: Interactive Data Editing in R
  5. openxlsx: How to Automate Excel in R
  6. officer: How to Automate PowerPoint in R
  7. DataExplorer: Fast EDA in R
  8. esquisse: Interactive ggplot2 builder
  9. gghalves: Half-plots with ggplot2
  10. rmarkdown: How to Automate PDF Reporting
  11. patchwork: How to combine multiple ggplots
  12. Geospatial Map Visualizations in R

Want these tips every week? Join R-Tips Weekly.