# Excel to R, Part 2 - Speed Up Exploratory Data Analysis 100X (R Code!)

Written by Matt Dancho

Youâ€™re a Business Analyst - well versed in tools like Tableau, PowerBI, and maybe even SQL, but you want to take your data analytics abilities to the next level, by improving productivity and making predictive business insights with data science (rather than just descriptive insights). Then R is the language for you. In this article, youâ€™ll learn how to perform Exploratory Data Analysis (EDA) 100X FASTER using R!!!

Donâ€™t believe me? This is one of the techniques that one of my Business Science University students used to place 3rd in a time-sensitive Tableau competitionâ€¦ without knowing Tableau! He used R and Correlation Analysis to understand his data and develop the best story delivering critical business insights fast. Youâ€™ll learn one of the tools he used in this article.

Bonus! - Iâ€™m demonstrating a new R package thatâ€™s under developement: correlationfunnel. Youâ€™ll see how I use it to get business insights fast!

## Article Series Overview

This is the 2nd Article in a series on why you should learn R if youâ€™re coming from a Business Analytics background.

## What You Learn Today: Correlation Funnel for 100X EDA Speedup

I am going to show you one of the simplest and fastest ways to analyze your data. In this article, Iâ€™m going to introduce you to a concept called Correlation Analysis using my new correlationfunnel R package. First, weâ€™ll see why Exploratory Data Analysis (EDA) in Excel (with Pivot Tables) is painfully slow and easy to miss important aspects of your data. Then, weâ€™ll show you how to create a Correlation Funnel Visualization that brings out the best features FAST!

Hereâ€™s what the Correlation Funnel looks like for the bank marketing campaign dataset we will use. High correlation features go to the top. From the data visualization, we can quickly see that DURATION (319_Inf) and POUTCOME (success) are the top positive correlations to TERM DEPOSIT (yes). Iâ€™ll explain what this means later. For now just realize its (1) really important business insights, and (2) we can do this FAST!

Correlation Funnel - Fast EDA

## Bank Marketing Campaign Data

This 3-Part series follows a real-world study that companies like Bank of America an others that provide Certificates of Deposit (CDs) (Term Deposits) would go through to analyze customer purchasing behavior.

The bank is interested in contacting customers that are likely to enroll in a Term Deposit (CD for a fixed term). The banks love these products because they can earn interest on the customers money. More enrollments means more revenue for the bank.

The dataset weâ€™ll use is a modified version of the â€śBank Marketing Data Setâ€ť provided by the UCI Machine Learning Repository.1 The version we will use is in an Excel file with multiple tabs covering the business process.

You can get the Bank Marketing Campaign data set here in Excel here.

## The Problem with Pivot Tables

The normal approach to Exploratory Data Analysis (EDA) is to investigate each feature, mining for relationships to some goal or target. In our case, enrollment in the TERM DEPOSIT (financial product).

In Excel, we use Pivot Tables to do this. The Marketing Campaign has a 16 Dependent Features (excluding the target and the ID field). This would involve making at least 16 Plots (one for each comparison to the target). There are 2 problems with this approach:

1. Pivot Tables are not systematic - Itâ€™s easy to miss one of the features or to get side tracked in the analysis project.

2. Pivot Tables are time consuming to make - Each Pivot Table can take 10+ minutes to make and format

Hereâ€™s an example of one pivot table. It took me about 10 minutes to make and format. It looks great, but itâ€™s going to be a long day if I have to do 15 more of these just to investigate relationships.

Pivot Table - Takes 10 Minutes Per Plot x 16 Plots = 160 Minutes (~ 3hrs of nonstop visualization = Long Day)

## The Solution - Use Correlation Analysis

Letâ€™s see how we can get straight to business insights.

### 1.0 Libraries

R packages weâ€™ll need:

• correlationfunnel - Install from GitHub
• tidyverse
• readxl
• tidyquant
• knitr

Youâ€™ll need to install my new R package, correlationfunnel, from GitHub (Iâ€™ll most likely be adding to CRAN in the near future).

The tidyverse, readxl, tidyquant, knitr packages can be installed from CRAN using install.packages().

We can quickly load the marketing data and join the appropriate excel sheets (4 through 7) using the code from the Previous Article in the Excel to R series.

Key Points:

• Data from the 4 tables in the Excel Sheets (4 - 7) were joined by â€śIDâ€ť column
• Target is TERM_DEPOSIT
• Data has 16 Predictors (AGE:POUTCOME Columns)
• ID is a unique identifier, and will not provide predictive value
• No â€śdateâ€ť or â€śdate-timeâ€ť features. DAY and MONTH are already â€śfeature engineeredâ€ť. We could add Day of Week if we new what year this data set was from, but weâ€™ll leave that for another day.
• Data is formatted as either numeric continuous features (e.g. AGE is double format) or Categorical features (e.g. JOB is chr format)
• We still need to check for missing values, which will throw a correlation analysis off

### 3.0 Check Data for Missing Values

Iâ€™m going to do a quick check for missing values. The following code does a quick count of NA values by column using the combo of map_df() and gather().

Key Points:

• We can see that no missing values are present, so we are OK to move onto Correlation Analysis.

### 4.0 Correlation Analysis

I made a 3-step process to make performing a Correlation Analysis super simple.

#### Step 1 - Convert to Binary Format

We will first convert the features into a binary format using a function called binarize(). The data is feature engineered:

• Numeric data are binned into 4 quantiles (categories) using n_bins.
• Then all data are categorical and the categories are converted to a one-hot encoded binary format. You can use the thresh_infreq parameter to prevent categories with 100s of infrequent categories from adding additional columns.

Donâ€™t worry if the â€śbinarizationâ€ť step sounds like gibberish. Itâ€™s just a step we need to do to tease out correlations.

AGE__-Inf_33 AGE__33_39 AGE__39_48 AGE__48_Inf JOB__admin. JOB__blue.collar JOB__entrepreneur JOB__housemaid JOB__management JOB__retired JOB__self.employed JOB__services JOB__student JOB__technician JOB__unemployed JOB__unknown MARITAL__divorced MARITAL__married MARITAL__single EDUCATION__primary EDUCATION__secondary EDUCATION__tertiary EDUCATION__unknown DEFAULT__no DEFAULT__yes BALANCE__-Inf_72 BALANCE__72_448 BALANCE__448_1428 BALANCE__1428_Inf HOUSING__no HOUSING__yes LOAN__no LOAN__yes CONTACT__cellular CONTACT__telephone CONTACT__unknown DAY__-Inf_8 DAY__8_16 DAY__16_21 DAY__21_Inf MONTH__apr MONTH__aug MONTH__dec MONTH__feb MONTH__jan MONTH__jul MONTH__jun MONTH__mar MONTH__may MONTH__nov MONTH__oct MONTH__sep DURATION__-Inf_103 DURATION__103_180 DURATION__180_319 DURATION__319_Inf CAMPAIGN__-Inf_2 CAMPAIGN__2_3 CAMPAIGN__3_Inf POUTCOME__failure POUTCOME__other POUTCOME__success POUTCOME__unknown TERM_DEPOSIT__no TERM_DEPOSIT__yes
0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0
1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0
0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0
0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0
0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0
0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0

Key Points:

• All data is binary
• Numeric data was made categorical. The continuous features were binned into ranges.
• All categorical data (including binned numeric data) were one-hot encoded. This just means they were converted into binary features (0â€™s and 1â€™s)

#### Step 2 - Correlate the Data to the Target

Next, we need to correlate() our â€śbinarizedâ€ť features to the target. Our target (our goal) is whether or not the customers enroll in the â€śTERM DEPOSITâ€ť. However, we need to be aware that our TERM_DEPOSIT column is now 2 columns (binarized):

1. TERM_DEPOSIT__yes
2. TERM_DEPOSIT__no (opposite of TERM_DEPOSIT__yes)

This is perfectly fine because we can just correlate all columns to â€śTERM_DEPOSIT__yesâ€ť, which contains 1â€™s if the customer enrolled.

feature bin correlation
TERM_DEPOSIT no -1.0000000
TERM_DEPOSIT yes 1.0000000
DURATION 319_Inf 0.3178767
POUTCOME success 0.3067882
DURATION -Inf_103 -0.1908996
POUTCOME unknown -0.1670507
CONTACT unknown -0.1509350
HOUSING no 0.1391727
HOUSING yes -0.1391727
CONTACT cellular 0.1358729

Key Points:

• The resulting data contains the feature bin correlation to the TERM_DEPOSIT__yes column. This is our measure of relationship to the target (goal).

• The format is a the format needed for the Correlation Funnel (next)

#### Step 3 - Make a Correlation Funnel!

The final step is to make a correlation funnel. We can correlation data from the previous step, piping (%>%) into the plot_correlation_funnel() function.

Insights are critical to business. Hereâ€™s how we can use the Correlation Funnel to develop a Customer Story.

#### Insights from the Correlation Funnel

From the Correlation Funnel, we can see that the following groups of customers have a much greater correlation with enrollment in the TERM DEPOSIT product.

• When the DURATION, the amount of time a prospect is engaged in marketing campaign material, is 319 seconds or longer.

• When POUTCOME, whether or not a prospect has previously enrolled in a product, is â€śsuccessâ€ť.

• When CONTACT, the medium used to contact the person, is â€ścellularâ€ť

• When HOUSING, whether or not the contact has a HOME LOAN is â€śnoâ€ť

We can then focus our time on exploring the high correlation features rather than plotting 16 different plots (or more if we have many features in our dataset).

#### Develop the Customer Story

Hereâ€™s an example of how we can quickly investigate DURATION, the top correlation feature, which is critical to us being able to develop a customer story. Weâ€™ll make a visualization with ggplot2 that shows the median engagement duration for the marketing campaign. Those enrolling in the TERM DEPOSIT product are spending 3X more time engaging with the campaign.

## Conclusion

R can dramatically speed up the exploratory data analysis process - All we need is a few new tactics. Rather than searching through features, we can use Correlation Analysis to point us in the right direction.

We saw how to implement Correlation Analysis using the new correlationfunnel R package. This pointed us right to the high correlation features in the data set.

The best part is that the Correlation Analysis is only 5 lines of code. We can quickly correlation analysis pipeline by piping (%>%) using our 3 functions: binarize(), correlate(), and plot_correlation_funnel().

This is a lot better than using Excel Pivot Tables. I estimated about 160 Minutes (almost 3 hours of non-stop Pivot Table visualizations, not including possibly missing information or getting sidetracked. Correlation analysis in 5 lines of code is an easy win.

## More Awesome EDA Packages

Want to go even further with Exploratory Data Analysis? Here are several great EDA packages that can help you understand data issues (cleanliness) and get data prepared for Correlation Analysis!

• correlationfunnel - My new correlation analysis R package

• Data Explorer - Automates Exploration and Data Treatment. Amazing for investigating features quickly and efficiently including by data type, missing data, feature engineering, and identifying relationships.

• naniar - For understanding missing data.

• UpSetR - For generating upset plots

• GGally - The ggpairs() function is one of my all-time favorites for visualizing many features quickly.

## Correlation Analysis Success Story

Stephen Lung is my student at Business Science University. Stephen recently placed 3rd in a Tableau competition. Hereâ€™s the AMAZING part - Stephen did NOT know Tableau.

How did Stephen win 3rd place without knowing Tableau?

Stephen beat out peers with 2+ years of Tableau skills. How is this possible? Stephen used R and Correlation Analysis to rapidly iterate through the business dataset, finding features that were critical to the story. The final data product was in Tableau, but R was the real workhorse.

How did Stephen learn R?

Stephen learned R through Business Science University. Stephen is taking the following courses:

1. Business Analysis with R (DS4B 101-R) - A 7-Week course where you learn the foundations of data science for business using R, the tidyverse, parsnip (Machine Learning), dplyr (data wrangling), ggplot2 (visualization), rmarkdown (reporting), and more!

2. Advanced Machine Learning & Business Consulting (DS4B 201-R) - A 10-Week program where you learn a framework for solving business problems with data science, and you apply cutting-edge tools including H2O Automatic Machine Learning, LIME Local Feature Explanation, Correlation Analysis, and ROI Analysis.

3. Learning Labs PRO - A continuous learning system where students receive new 1-hour courses every 2-weeks on advanced topics like wrangling large data sets, creating APIâ€™s, and more!

## References

1. [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014Â