Automatic Exploratory Data Analysis in R with DataExplorer
Written by Matt Dancho on March 2, 2021
Data Scientists spend 80% of their time just trying to understand and prepare data for analysis! This process is called Exploratory Data Analysis (EDA). R has an Insane EDA productivity-enhancer. It’s called
DataExplorer. And I’m going to get you up and running with
DataExplorer in under 5-minutes:
- How to make an Automatic EDA Report in seconds with DataExplorer.
- BONUS: How to use the 7 most important EDA Plots to get exploratory insights.
This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.
Here are the links to get set up. 👇
Learn how to use the
trelliscopejs package in my 5-minute YouTube video tutorial.
What you make in this R-Tip
By the end of this R-Tip, you’ll make this exploratory data analysis report. Perfect for impressing your boss and coworkers! (Nice EDA skills)
Thank you developers.
Before we dive into
DataExplorer, I want to take a moment to thank the developer, Boxuan Cui. He’s currently working as a Senior Data Science Manager at Tripadvisor. In his spare time, Boxuan has built and maintains one of the most useful R packages on the planet:
DataExplorer. Thank you!
Automatic Exploratory Data Analysis with
One of the coolest features of DataExplorer is the ability to create an EDA Report in 1 line of code. This automates:
- Basic Statistics
- Data Structure
- Missing Data Profiling
- Continuous and Categorical Distribution Profiling (Histograms, Bar Charts)
- Relationships (Correlation)
Ultimately, this saves the analyst/data scientist SO MUCH TIME.
Step 1: Load the libraries and data
To get set up, all we need to do is load the following libraries and data.
We’ll use the
gss_cat dataset, which has income levels for people by various factors including marital status, age, race, religion, ….
With data in hand, we are ready to create the automatic EDA report. Let’s go!
Step 2: Create the Automatic EDA Report
create_report() to make our EDA report. Be sure to specify the output file, output directory, target variable (y), and give it a report title.
This produces an automatic EDA report that covers all of the important aspects that we need to analyze in our data! It’s that simple folks.
BONUS: How to use the 7 Most Important DataExplorer Plots
As an extra special bonus, I figured I’d teach you not only how to make the report BUT how to use the report too. I know, I’m too kind.
Here’s how to get the most out of your automatic EDA report. If you’d like the code to produce the individual plots, simply sign up for our free R-tips codebase.
1. Basic Statistics
The basic statistics is where we first start understanding our data. We can see information about our columns including:
- Discrete columns: Columns with categorical data
- Continuous columns: Columns with numeric data
- All missing columns: Columns that have 100% missing data
- Complete rows: Percentage of rows that are complete (no missing data)
- Missing observations: Of all of the data, this is the percentage of missing
2. Data Structure
Next, we can examine each of the columns specifically learning about what data types are contained inside each of the columns.
- This is really important when we need to know a bit more detail about our data.
- We can begin to hypothesize what we should do to get it into the correct structure for analysis.
3. Missing Data Profile
The missing data profile report helps us understand which columns have missing data.
- We can start to think about missing data treatment - imputation strategies or if we will need to remove columns
- We can see if columns have hardly any or no missing data - which will be easier to use
- We can see if columns have a lot of missing data - which may need to be removed or heavily treated
4. Univariate Distributions
We have a bunch of options here, which can be used to dive into the columns. I’ll focus on the 2 most important:
- Continuous Features: Histogram
- Categorical Features: Bar Plot
4A. Continuous Features (Histogram)
We can check out the distribution within the numeric data to quickly see what we are dealing with.
We can get a sense of the distribution of the numeric data.
- Skewed data: TV Hours is very skewed with a few outliers (e.g. watching 24-hours of TV per day)
- Non-normal data: Age tends to be more 25-50 year old respondents than over 50
- Data Range: We can see the survey results go from year 200 to 2014. It looks like every 2-years.
4.B Categorical Features (Bar Plots)
The categorical feature distributions are frequency counts by category shown as a box plot. This helps us:
- Understand categorical distribution: Categories tend to have some levels that are highly present and others that are much more rare.
- Start thinking about categorical treatment: We may need to lump some categories together before modeling.
5. Correlation Analysis
- Correlation helps us tell whether we should move onto modeling.
- Warning! The correlation can be a bit misleading. Many correlations of numeric variables are non-linear. For example, middle-aged people may be more likely to watch less TV. But young and older people may be more likely to watch more TV. The correlation could be low because of the nonlinear relationship.
6. Principal Components
Plotting principal components can help you determine if the data can be compressed. I’ll explain what I mean by this.
Data that is very wide (many columns) can be computationally expensive to model.
- By applying PCA (Principal Component Analysis), we can determine if compressing using an algorithm like PCA or UMAP is appropriate.
- Here I can see about 37% of the variance explained is contained in the first 20 principal components.
7. Bivariate Distributions
Now we are going to focus on how each feature varies with the target (rincome - how much each person/household makes in annual income).
- Box Plot: For analyzing numeric vs categorical
- Scatter Plot: For analyzing numeric vs numeric (not shown)
Box Plot: Numeric vs Categorical
With the box plot, we can:
- Begin to visualize relationships.
- See how each numeric feature (age, tv hours, year) has a relationship with rincome
- $250,000 (high income earners) tend to be in their early 40’s while low income earners are in their late 20’s
We learned how to use the
DataExplorer library to automatically create an exploratory data analysis report. Great work! But, there’s a lot more to becoming a data scientist.
If you’d like to become a data scientist (and have an awesome career, improve your quality of life, enjoy your job, and all the fun that comes along), then I can help with that.
Step 1: Watch my Free 40-Minute Webinar
Learning data science on your own is hard. I know because IT TOOK ME 5-YEARS to feel confident.
AND, I don’t want it to take that long for you.
So, I put together a FREE 40-minute webinar (a masterclass) that provides a roadmap for what worked for me.
Literally 5-years of learning, consolidated into 40-minutes. It’s jammed packed with value. I wish I saw this when I was starting… It would have made a huge difference.
Step 2: Take action
For my action-takers, if you are ready to take your skills to the next level and DON’T want to wait 5-years to learn data science for business, AND you want a career you love that earns you $100,000+ salary (plus bonuses), and you’d like someone to help you do this in UNDER 6-MONTHS or less….
Then I can help with that too. There’s a link in the FREE 40-minute webinar for a special price (because you are special!) and taking that action will kickstart your journey with me in your corner.
Get ready. The ride is wild. And the destination is AMAZING!
👇 Top R-Tips Tutorials you might like:
- mmtable2: ggplot2 for tables
- ggdist: Make a Raincloud Plot to Visualize Distribution in ggplot2
- ggside: Plot linear regression with marginal distributions
- DataEditR: Interactive Data Editing in R
- openxlsx: How to Automate Excel in R
- officer: How to Automate PowerPoint in R
- DataExplorer: Fast EDA in R
- esquisse: Interactive ggplot2 builder
- gghalves: Half-plots with ggplot2
- rmarkdown: How to Automate PDF Reporting
- patchwork: How to combine multiple ggplots
- Geospatial Map Visualizations in R
Want these tips every week? Join R-Tips Weekly.