# Big Data: Wrangling 4.6M Rows with dtplyr (the NEW data.table backend for dplyr)

Written by Matt Dancho on August 15, 2019

Wrangling Big Data is one of the best features of the R programming language, which boasts a Big Data Ecosystem that contains fast in-memory tools (e.g. data.table) and distributed computational tools (sparklyr). With the NEW dtplyr package, data scientists with dplyr experience gain the benefits of data.table backend. We saw a 3X speed boost for dplyr!

We’ll go over the pros and cons and what you need to know to get up and running using a real world example of Fannie Mae Loan Performance that when combined is 4.6M Rows by 55 Columns - Not super huge, but enough to show off the new and improved dtplyr interface to the data.table package. We’ll end with a Time Study showing a 3X Speed Boost and Learning Recommendations to get you expertise fast.

If you like this article, we have more just like it in our Machine Learning Section of the Business Science Learning Hub.

## 1.0 The 30-Second Summary

We reviewed the latest advance in big data - The NEW dtplyr package, which is an interface to the high performance data.table library.

### Pros

• A 3X speed boost on the data joining and wrangling operations on a 4.6M ROw data set. The data wrangling operatiosn were performed in 6 seconds with dtplyr vs 18 seconds with dplyr.

• Performs inplace operations (:=), which vastly accelerates big data computations (see grouped time series lead() operation in Section 3.7 tutorial)

• Shows the data.table translation (this is really cool!)

### Cons

• For pure speed, you will need to learn all of data.table’s features including managing keys for fast lookups.

• In most cases, data.table will be faster than dtplyr because of overhead in the dtplyr translation process. However, we saw the difference to be very minimal.

• dtplyr is in experimental status currently - Tester’s wanted, file issues and requests here

### What Should You Learn?

Just starting out? Our recommendation is to learn dplyr first, then learn data.table, using dtplyr to bridge the gap

• Begin with dplyr, which has easy-to-learn syntax and works well for datasets of 1M Rows+.

• Learn data.table as you become comfortable in R. data.table is great for pure speed on data sets 50M Rows+. It has a different “bracketed” syntax that is streamlined but more complex for beginners. However, it has features like fast keyed subsetting and optimization for rolling joins that are out of the scope of this article.

• Use dtplyr as a translation tool to help bridge the gap between dplyr and data.table.

At a bare minimum - Learning dplyr is essential. Learn more about a system for learning dplyr in the Conclusions and Recommendations.

## 2.0 Big Data Ecosystem

R has an amazing ecosystem of tools designed for wrangling Big Data. The 3 most popular tools are dplyr, data.table, and sparklyr. We’ve trained hundreds of students on big data, and our students most common Big Data question is, “Which tool to use and when?”

Big Data: Data Wrangling Tools By Dataset Size

The “Big Data: Data Wrangling Tools by Dataset Size” graphic comes from Business Science’s Learning Lab 13: Wrangling 4.6M Rows (375 MB) of Financial Data with data.table where we taught students how to use data.table using Fannie Mae’s Financial Data Set. The graphic provides rough guidelines on when to use which tools by dataset row size.

1. dplyr (website) - Used for in-memory calculations. Syntax design and execution emphasizes readability over performance. Very good in most situations.

2. data.table (website) - Used for higher in-memory performance. Modifies data inplace for huge speed gains. Easily wrangles data in the range of 10M-50M+ rows.

3. sparklyr (website) - Distribute work across nodes (clusters) and performs work in parallel. Best used on big data (100M+ Rows).

## 3.0 Enter dtplyr: Boost dplyr with data.table backend

We now have a 4th tool that boosts dplyr using data.table as its backend. The good news is that if you are already familiar with dplyr, you don’t need to learn much to get the gains of data.table!

dtplyr: Bridging the Big Data Gap

The dtplyr package is a new front-end that wraps the High Performance data.table R package. I say new, but dtplyr has actually been around for over 2 years. However, the implementation recently underwent a complete overhaul vastly improving the functionality. Let’s check out the goals the package from the dtplyr website: https://dtplyr.tidyverse.org/.

dtplyr for Big Data

Here’s what you need to know:

• Goal: Increase speed of working with big data when using dplyr syntax

• Implementation: The dtplyr package enables the user to write dplyr code. Internally the package translates the code to data.table syntax. When run, the user gains the faster performance of data.table while being able to write the more readable dplyr code.

• Dev Status: The package is still experimental. This means that developers are still in the process of testing the package out, reporting bugs, and improving via feature requests.

# 4.0 Case Study - Wrangling 4.6M Rows (375MB) of Financial Data

Let’s try out the new and improved dtplyr + data.table combination on a large-ish data set.

### 4.1 Bad Loans Cost Millions (and Data Sets are MASSIVE)

Loan defaults cost organization millions. Further, the datasets are massive. This is a task where data.table and dtplyr will be needed as part of the preprocessing steps prior to building a Machine Learning Model.

### 4.2 Fannie Mae Data Set

The data used in the tutorial can be downloaded from Fannie Mae’s website. We will just be using the 2018 Q1 Acquisition and Performance data set.

A few quick points:

• The 2018 Q1 Performance Data Set we will use is 4.6M rows, enough to send Excel to a grinding hault, crashing your computer in the process.

• For dplyr, it’s actually do-able at 4.6M rows. However, if we were to do the full 25GB, we’d definitely want to use data.table to speed things up.

• We’ll do a series of common data manipulation operations including joins and grouped time series calculation to determine which loans become delinquent in the next 3 months.

### 4.3 Install and Load Libraries

In this tutorial, we’ll use the latest Development Version of dtplyr installed using devtools. All other packages used can be used by installing with install.packages().

Next, we’ll load the the following libraries with library():

• data.table: High-performance data wrangling
• dtplyr: Interface between dplyr and data.table
• tidyverse: Loads dplyr and several other useful R packages
• vroom: Fast reading of delimited files (e.g. csv) with vroom()
• tictoc: Simple timing operations
• knitr: Use the kable() function for nice HTML tables

We’ll read the data. The column-types are going to be pre-specified to assist in the loading process. The vroom() function does the heavy lifting.

First, I’ll setup the paths to the two files I’ll be reading:

1. Acquisitions_2018Q1.txt - Meta-data about each loan
2. Performance_2018Q1.txt - Time series data set with loan performance characteristics over time

For me, the files are stored in a folder called 2019-08-15-dtplyr. Your paths may be different depending on where the files are stored.

#### Read the Loan Acquisition Data

Note we specify the columns and types to improve the speed of reading the columns.

The loan acquisition data contains information about the owner of the loan.

loan_id original_channel seller_name original_interest_rate original_upb original_loan_term original_date first_pay_date original_ltv original_cltv number_of_borrowers original_dti original_borrower_credit_score first_time_home_buyer loan_purpose property_type number_of_units occupancy_status property_state zip primary_mortgage_insurance_percent product_type original_coborrower_credit_score mortgage_insurance_type relocation_mortgage_indicator
100001040173 R QUICKEN LOANS INC. 4.250 453000 360 2018-01-01 2018-03-01 65 65 1 28 791 N C PU 1 P OH 430 NA FRM NA NA N
100002370993 C WELLS FARGO BANK, N.A. 4.250 266000 360 2018-01-01 2018-03-01 80 80 2 41 736 N R PU 1 P IN 467 NA FRM 793 NA N
100005405807 R PMTT4 3.990 233000 360 2017-12-01 2018-01-01 79 79 2 48 696 N R SF 1 P CA 936 NA FRM 665 NA N
100008071646 R OTHER 4.250 184000 360 2018-01-01 2018-03-01 80 80 1 48 767 Y P PU 1 P FL 336 NA FRM NA NA N
100010739040 R OTHER 4.250 242000 360 2018-02-01 2018-04-01 49 49 1 22 727 N R SF 1 P CA 906 NA FRM NA NA N
100012691523 R OTHER 5.375 180000 360 2018-01-01 2018-03-01 80 80 1 14 690 N C PU 1 P OK 730 NA FRM NA NA N

Get the size of the acquisitions data set: 426K rows by 25 columns. Not that bad, but this is meta-data for the loan. The dataset we are worried about is the next one.

#### Read the Loan Performance Data

Let’s inspect the data. We can see that this is a time series where each “Loan ID” and “Monthly Reporting Period” go together.

loan_id monthly_reporting_period servicer_name current_interest_rate current_upb loan_age remaining_months_to_legal_maturity adj_remaining_months_to_maturity maturity_date msa current_loan_delinquency_status modification_flag zero_balance_code zero_balance_effective_date last_paid_installment_date foreclosed_after disposition_date foreclosure_costs prop_preservation_and_repair_costs asset_recovery_costs misc_holding_expenses holding_taxes net_sale_proceeds credit_enhancement_proceeds repurchase_make_whole_proceeds other_foreclosure_proceeds non_interest_bearing_upb principal_forgiveness_upb repurchase_make_whole_proceeds_flag foreclosure_principal_write_off_amount servicing_activity_indicator
100001040173 2018-02-01 QUICKEN LOANS INC. 4.25 NA 0 360 360 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-03-01   4.25 NA 1 359 359 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-04-01   4.25 NA 2 358 358 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-05-01   4.25 NA 3 357 357 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-06-01   4.25 NA 4 356 356 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N
100001040173 2018-07-01   4.25 NA 5 355 355 2048-02-01 18140 0 N   NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA   NA N

Let’s check out the data size. We can see it’s 4.6M rows by 31 columns! Just a typical financial time series (seriously).

### 4.5 Convert to Tibbles to dtplyr Steps

Next, we’ll use the lazy_dt() function to convert the tibbles to dtplyr steps.

We can check the class() to see what we are working with.

The returned object is the first step in a dtplyr sequence.

Key Point:

• We are going to set up operations using a sequence of steps.
• The operations will not be fully evaluated until we convert to a data.table or tibble depending on our desired output.

### 4.6 Join the Data Sets

Our first data manipulation operation is a join. We are going to use the left_join() function from dplyr. Let’s see what happens.

The output of the joining operation is a new step sequence, this time a dtplyr_step_subset.

Next, let’s examine what happens when we print combined_dt to the console.

Key Points:

• The important piece is the data.table translation code, which we can see in the ouput: Call: _DT2[_DT1, on = .(loan_id)]

• Note that we haven’t excecuted the data manipulation operation. dtplyr smartly gives us a glimpse of what the operation will look like though, which is really cool.

### 4.7 Wrangle the Data

We’ll do a sequence of data wrangling operations:

• Select specific columns we want to keep
• Arrange by loan_id and monthly_reporting_period. This is needed to keep groups together and in the right time-stamp order.
• Group by loan_id and mutate to calculate whether or not loans become delinquent in the next 3 months.
• Filter rows with NA values from the newly created column (these aren’t needed)
• Reorder the columns to put the new calculated column first.

The final output is a dtplyr_step_group, which is just a sequence of steps.

If we print the final_output_dt object, we can see the data.table translation is pretty intense.

Key Point:

• The most important piece is that dtplyr correctly converted the grouped mutation to an inplace calculation, which is data.table speak for a super-fast calculation that makes no copies of the data. Here’s inplace calculation code from the dtplyr translation: [, :=(gt_1mo_behind_in_3mo = lead(current_loan_delinquency_status, n = 3) >= 1), keyby = .(loan_id)]

### 4.8 Collecting The Data

Note that up until now, nothing has been done to process the data - we’ve just created a recipe for data wrangling. We still need tell dtplyr to execute the data wrangling operations.

To implement all of the steps and convert the dtplyr sequence to a tibble, we just call as_tibble().

Key Point:

• Calling the as_tibble() function tells dtplyr to execute the data.table wrangling operations.

## 5.0 The 3X Speedup - Time Comparisons

Finally, let’s check the performance of the dplyr vs dtplyr vs data.table. We can seed a nice 3X speed boost!

## 6.0 Conclusions and Learning Recommendations

For Big Data wrangling, the dtplyr package represents a huge opportunity for data scientists to leverage the speed of data.table with the readability of dplyr. We saw an impressive 3X Speedup going from dplyr to using dtplyr for wrangling a 4.6M row data set. This just scratches the surface of the potential, and I’m looking forward to seeing dtplyr mature, which will help bridge the gap between the two groups of data scientists using dplyr and data.table.

For new data scientists coming from other tools like Excel, my hope is that you see the awesome potential of learning R for data analysis and data science. The Big Data capabilities represent a massive opportunity for you to bring your organization data science at scale.

### You just need to learn how to go from normal data to Big Data.

My recommendation is to start by learning dplyr - The popular data manipulation library that makes reading and writing R code very easy to understand.

Once you get to an intermediate level, learn data.table. This is where you gain the benefits of scaling data science to Big Data. The data.table package has a steeper learning curve, but learning it will help you leverage its full performance and scalability.

If you need to learn dplyr as fast as possible - I recommend beginning with our Data Science Foundations DS4B 101-R Course. The 101 Course is available as part of the 3-Course R-Track Bundle, a complete learning system designed to transform you from beginner to advanced in under 6-months. You will learn everything you need to become an expert data scientist.

## 7.0 Additional Big Data Guidelines

I find that students have an easier time picking a tool based on dataset row size (e.g. I have 10M rows, what should I use?). With that said, there are 2 factors that will influence whhich tools you need to use:

1. Are you performing Grouped and Iterative Operations? Performance even on normal data sets can become an issue if you have a lot of groups or if the calculation is iterative. A particular source of pain in the financial realm are rolling (window) calculations, which are both grouped and iterative within groups. In these situation, use high-performance C++ functions (e.g. Rolling functions from the roll package or RcppRoll package).

2. Do you have sufficient RAM? Once you begin working with gig’s of data, then you start to run out of memory (RAM). In these situations, you will need to work in chunks and parellelizing operations. You can do this with distributed sparklyr, which will perform some operations in parallel and distribute across nodes.

## 8.0 Recognizing the Developers

I’d like to take a quick moment to thank the developers of data.table and dplyr. Without these two packages, Business Science probably would not exist. Thank you.

## 9.0 Coming Soon - Expert Shiny Apps Course!

I’m very excited to announce that Business Science has an Expert Shiny Course - Coming soon! Head over to Business Science University and create a free account. I will update you with the details shortly.