Big Data: Wrangling 4.6M Rows with dtplyr (the NEW data.table backend for dplyr)
Written by Matt Dancho on August 15, 2019
Wrangling Big Data is one of the best features of the
R programming language, which boasts a Big Data Ecosystem that contains fast in-memory tools (e.g.
data.table) and distributed computational tools (
sparklyr). With the NEW
dtplyr package, data scientists with
dplyr experience gain the benefits of
data.table backend. We saw a 3X speed boost for
We’ll go over the pros and cons and what you need to know to get up and running using a real world example of Fannie Mae Loan Performance that when combined is 4.6M Rows by 55 Columns - Not super huge, but enough to show off the new and improved
dtplyr interface to the
data.table package. We’ll end with a Time Study showing a 3X Speed Boost and Learning Recommendations to get you expertise fast.
Like this article? Here are more just like it!
Table of Contents
1.0 The 30-Second Summary
We reviewed the latest advance in big data - The NEW
dtplyr package, which is an interface to the high performance
A 3X speed boost on the data joining and wrangling operations on a 4.6M ROw data set. The data wrangling operatiosn were performed in 6 seconds with
dtplyrvs 18 seconds with
Performs inplace operations (
:=), which vastly accelerates big data computations (see grouped time series
lead()operation in Section 3.7 tutorial)
data.tabletranslation (this is really cool!)
For pure speed, you will need to learn all of
data.table’s features including managing keys for fast lookups.
In most cases,
data.tablewill be faster than
dtplyrbecause of overhead in the
dtplyrtranslation process. However, we saw the difference to be very minimal.
dtplyris in experimental status currently - Tester’s wanted, file issues and requests here
What Should You Learn?
Just starting out? Our recommendation is to learn
dplyr first, then learn
dtplyr to bridge the gap
dplyr, which has easy-to-learn syntax and works well for datasets of 1M Rows+.
data.tableas you become comfortable in R.
data.tableis great for pure speed on data sets 50M Rows+. It has a different “bracketed” syntax that is streamlined but more complex for beginners. However, it has features like fast keyed subsetting and optimization for rolling joins that are out of the scope of this article.
dtplyras a translation tool to help bridge the gap between
At a bare minimum - Learning
dplyr is essential. Learn more about a system for learning
dplyr in the Conclusions and Recommendations.
Before we get started, get the Cheat Sheet
The most powerful tool in my arsenal is NOT my knowledge of the key R packages, but it’s knowing where to find R packages and documentation.
The Ultimate R Cheat Sheet consolidates the documentation on every package I use frequently (including
If you tab through to page 3, you’ll see a section called “Speed and Scale”. You can quickly see options to help including
If you click the “CS” (stands for cheat sheet) next to
data.table, you’ll automagically get the DataTable Cheat Sheet. (And you can do this for every R package).
2.0 Big Data Ecosystem
R has an amazing ecosystem of tools designed for wrangling Big Data. The 3 most popular tools are
sparklyr. We’ve trained hundreds of students on big data, and our students most common Big Data question is, “Which tool to use and when?”
Big Data: Data Wrangling Tools By Dataset Size
The “Big Data: Data Wrangling Tools by Dataset Size” graphic comes from Business Science’s Learning Lab 13: Wrangling 4.6M Rows (375 MB) of Financial Data with data.table where we taught students how to use
data.table using Fannie Mae’s Financial Data Set. The graphic provides rough guidelines on when to use which tools by dataset row size.
dplyr (website) - Used for in-memory calculations. Syntax design and execution emphasizes readability over performance. Very good in most situations.
data.table (website) - Used for higher in-memory performance. Modifies data inplace for huge speed gains. Easily wrangles data in the range of 10M-50M+ rows.
sparklyr (website) - Distribute work across nodes (clusters) and performs work in parallel. Best used on big data (100M+ Rows).
3.0 Enter dtplyr: Boost dplyr with data.table backend
We now have a 4th tool that boosts
data.table as its backend. The good news is that if you are already familiar with
dplyr, you don’t need to learn much to get the gains of
dtplyr: Bridging the Big Data Gap
The dtplyr package is a new front-end that wraps the High Performance
data.table R package. I say new, but
dtplyr has actually been around for over 2 years. However, the implementation recently underwent a complete overhaul vastly improving the functionality. Let’s check out the goals the package from the
dtplyr website: https://dtplyr.tidyverse.org/.
dtplyr for Big Data
Here’s what you need to know:
Goal: Increase speed of working with big data when using
dtplyrpackage enables the user to write
dplyrcode. Internally the package translates the code to
data.tablesyntax. When run, the user gains the faster performance of
data.tablewhile being able to write the more readable
Dev Status: The package is still experimental. This means that developers are still in the process of testing the package out, reporting bugs, and improving via feature requests.
4.0 Case Study - Wrangling 4.6M Rows (375MB) of Financial Data
Let’s try out the new and improved
data.table combination on a large-ish data set.
4.1 Bad Loans Cost Millions (and Data Sets are MASSIVE)
Loan defaults cost organization millions. Further, the datasets are massive. This is a task where
dtplyr will be needed as part of the preprocessing steps prior to building a Machine Learning Model.
4.2 Fannie Mae Data Set
The data used in the tutorial can be downloaded from Fannie Mae’s website. We will just be using the 2018 Q1 Acquisition and Performance data set.
A few quick points:
The 2018 Q1 Performance Data Set we will use is 4.6M rows, enough to send
Excelto a grinding hault, crashing your computer in the process.
dplyr, it’s actually do-able at 4.6M rows. However, if we were to do the full 25GB, we’d definitely want to use
data.tableto speed things up.
We’ll do a series of common data manipulation operations including joins and grouped time series calculation to determine which loans become delinquent in the next 3 months.
4.3 Install and Load Libraries
In this tutorial, we’ll use the latest Development Version of
dtplyr installed using
devtools. All other packages used can be used by installing with
Next, we’ll load the the following libraries with
data.table: High-performance data wrangling
dtplyr: Interface between
dplyrand several other useful R packages
vroom: Fast reading of delimited files (e.g. csv) with
tictoc: Simple timing operations
knitr: Use the
kable()function for nice HTML tables
4.4 Read the Data
We’ll read the data. The column-types are going to be pre-specified to assist in the loading process. The
vroom() function does the heavy lifting.
First, I’ll setup the paths to the two files I’ll be reading:
- Acquisitions_2018Q1.txt - Meta-data about each loan
- Performance_2018Q1.txt - Time series data set with loan performance characteristics over time
For me, the files are stored in a folder called
2019-08-15-dtplyr. Your paths may be different depending on where the files are stored.
Read the Loan Acquisition Data
Note we specify the columns and types to improve the speed of reading the columns.
The loan acquisition data contains information about the owner of the loan.
|100001040173||R||QUICKEN LOANS INC.||4.250||453000||360||2018-01-01||2018-03-01||65||65||1||28||791||N||C||PU||1||P||OH||430||NA||FRM||NA||NA||N|
|100002370993||C||WELLS FARGO BANK, N.A.||4.250||266000||360||2018-01-01||2018-03-01||80||80||2||41||736||N||R||PU||1||P||IN||467||NA||FRM||793||NA||N|
Get the size of the acquisitions data set: 426K rows by 25 columns. Not that bad, but this is meta-data for the loan. The dataset we are worried about is the next one.
Read the Loan Performance Data
Let’s inspect the data. We can see that this is a time series where each “Loan ID” and “Monthly Reporting Period” go together.
|100001040173||2018-02-01||QUICKEN LOANS INC.||4.25||NA||0||360||360||2048-02-01||18140||0||N||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA||N|
Let’s check out the data size. We can see it’s 4.6M rows by 31 columns! Just a typical financial time series (seriously).
4.5 Convert to Tibbles to dtplyr Steps
Next, we’ll use the
lazy_dt() function to convert the
We can check the
class() to see what we are working with.
The returned object is the first step in a
- We are going to set up operations using a sequence of steps.
- The operations will not be fully evaluated until we convert to a
tibbledepending on our desired output.
4.6 Join the Data Sets
Our first data manipulation operation is a join. We are going to use the
left_join() function from
dplyr. Let’s see what happens.
The output of the joining operation is a new step sequence, this time a
Next, let’s examine what happens when we print
combined_dt to the console.
The important piece is the
data.tabletranslation code, which we can see in the ouput:
Call: _DT2[_DT1, on = .(loan_id)]
Note that we haven’t excecuted the data manipulation operation.
dtplyrsmartly gives us a glimpse of what the operation will look like though, which is really cool.
4.7 Wrangle the Data
We’ll do a sequence of data wrangling operations:
- Select specific columns we want to keep
- Arrange by
monthly_reporting_period. This is needed to keep groups together and in the right time-stamp order.
- Group by
loan_idand mutate to calculate whether or not loans become delinquent in the next 3 months.
- Filter rows with
NAvalues from the newly created column (these aren’t needed)
- Reorder the columns to put the new calculated column first.
The final output is a
dtplyr_step_group, which is just a sequence of steps.
If we print the
final_output_dt object, we can see the data.table translation is pretty intense.
- The most important piece is that
dtplyrcorrectly converted the grouped mutation to an inplace calculation, which is
data.tablespeak for a super-fast calculation that makes no copies of the data. Here’s inplace calculation code from the
[, :=(gt_1mo_behind_in_3mo = lead(current_loan_delinquency_status, n = 3) >= 1), keyby = .(loan_id)]
4.8 Collecting The Data
Note that up until now, nothing has been done to process the data - we’ve just created a recipe for data wrangling. We still need tell
dtplyr to execute the data wrangling operations.
To implement all of the steps and convert the
dtplyr sequence to a
tibble, we just call
- Calling the
dtplyrto execute the
5.0 The 3X Speedup - Time Comparisons
Finally, let’s check the performance of the
data.table. We can seed a nice 3X speed boost!
5.1 Time using dplyr
5.2 Time using dtplyr
5.3 Time using data.table
6.0 Conclusions and Learning Recommendations
For Big Data wrangling, the
dtplyr package represents a huge opportunity for data scientists to leverage the speed of
data.table with the readability of
dplyr. We saw an impressive 3X Speedup going from
dplyr to using
dtplyr for wrangling a 4.6M row data set. This just scratches the surface of the potential, and I’m looking forward to seeing
dtplyr mature, which will help bridge the gap between the two groups of data scientists using
For new data scientists coming from other tools like
Excel, my hope is that you see the awesome potential of learning
R for data analysis and data science. The Big Data capabilities represent a massive opportunity for you to bring your organization data science at scale.
You just need to learn how to go from normal data to Big Data.
My recommendation is to start by learning
dplyr - The popular data manipulation library that makes reading and writing R code very easy to understand.
Once you get to an intermediate level, learn
data.table. This is where you gain the benefits of scaling data science to Big Data. The
data.table package has a steeper learning curve, but learning it will help you leverage its full performance and scalability.
If you need to learn
dplyr as fast as possible - I recommend beginning with our Data Science Foundations DS4B 101-R Course. The 101 Course is available as part of the R-Track System, a complete learning system designed to transform you from beginner to advanced in under 6-months. You will learn everything you need to become an expert data scientist.
7.0 Additional Big Data Guidelines
I find that students have an easier time picking a tool based on dataset row size (e.g. I have 10M rows, what should I use?). With that said, there are 2 factors that will influence whhich tools you need to use:
Are you performing Grouped and Iterative Operations? Performance even on normal data sets can become an issue if you have a lot of groups or if the calculation is iterative. A particular source of pain in the financial realm are rolling (window) calculations, which are both grouped and iterative within groups. In these situation, use high-performance C++ functions (e.g. Rolling functions from the
Do you have sufficient RAM? Once you begin working with gig’s of data, then you start to run out of memory (RAM). In these situations, you will need to work in chunks and parellelizing operations. You can do this with distributed
sparklyr, which will perform some operations in parallel and distribute across nodes.
8.0 Recognizing the Developers
I’d like to take a quick moment to thank the developers of
dplyr. Without these two packages, Business Science probably would not exist. Thank you.