Tidy Parallel Processing in R with furrr
Written by Matt Dancho on September 14, 2021
Parallel processing in the
tidyverse couldn’t be easier with the
furrr package. If you are familiar with the
purrr::map() function, then you’ll love
furrr::future_map(), which we’ll use in this FREE R-Tip training to get a 2.6X speed-up in our code.
This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.
Here are the links to get set up. 👇
Follow along with our Full YouTube Video Tutorial.
Learn how to use
furrr in our 5-minute YouTube video tutorial.
Parallel Processing [
This tutorial showcases the awesome power of
furrr for parallel processing. We’ll get a 2.6X speed boost.
R Package Author Credits
Before we get started, get the R Cheat Sheet
furrr is great for parallel processing. But, you’ll need to learn
purrr to take full advantage. For these topics, I’ll use the Ultimate R Cheat Sheet to refer to
purrr code in my workflow.
Step 1. Download the Ultimate R Cheat Sheet.
Step 2. Then Click the “CS” hyperlink to “purrr”.
Step 3. Reference the
purrr cheat sheet.
Onto the tutorial.
Load the Libraries
Get the Data
We’ll use the
walmart_sales_weekly dataset from
timetk. We will do a bit of data manipulation using
select(): Used to select columns
set_names(): Used to update the column names
The output is a “tidy” dataset in long format where there are:
- 7 ID’s: Each ID represents a walmart store department
- Date and Value: The date and value combination represents the sales in a given week
Purrr: Nest + Mutate + Map
Next, we’ll use a common sequence of operations to iteratively apply an “expensive” modeling function to each ID (Store Deparment) that models the sales data as a function of it’s trend and month of the year.
Pro-Tip 1: Use the R cheat sheet to refer to
Pro-Tip 2: If you need to master R, then I’ll talk about my 5-Course R-Track at the end of the tutorial. It’s a way to up-skill yourself with the data science skills that organizations demand. I teach
purrr iteration and nested structures in the R-Track.
Purrr Model Nested Data
We’ll first perform our “expensive modeling” with
purrr, which runs each operation sequentially:
nest(): To convert the data to a “Nested” structure, where columns contain our data as a list structure. This generates a column called “data”, with our nested data.
map(): We use the combination of
mutate()to first add a column called “model” and
purrr::map()to iteratively apply an expensive function.
“Expensive Function”: The function that we apply is a linear regression using the
lm()function. We use
Sys.sleep(1)to simulate an expensive calculation that takes 1-second during each iteration.
Purrr Nested Models and Timing
The output is our nested data now with a column called “model” that contains the 7 Linear Regression Models we just made.
Purrr operations can be expensive. In our case the operation took 7.14 seconds, mainly because we told the “Expensive Function” to sleep for 1-second before making the model.
Furrr: Nest + Mutate + Future Map
Now, we’ll redo the calculation, this time changing
purrr::map() out for
furrr::future_map(), which will let us run each calculation in parallel for a speed boost.
Furrr Set Plan and Model Nested Data
furrr code is the same as before using
purrr with two important changes:
plan(): This allows you to set the number of CPU cores to use when parallel processing. I have 6-cores available on my computer, so I’ll use all 6.
future_map(): We swap
furrr::future_map(), which let’s the iterative process run in parallel.
Furrr Nested Models and Timing
The output is the same nested data structure as previously. But we got a 2.6X Speed up (2.57-seconds with furrr vs 7.14-seconds with purrr)
We learned how to parallel process with
furrr. But, there’s a lot more to modeling and data science. And, if you are just starting out, then this tutorial was probably difficult. That’s OK because I have a solution.
If you’d like to learn data visualizations, data wrangling,
shiny apps, and data science for business with R, then read on. 👇
My Struggles with Learning Data Science
It took me a long time to learn data science. I made a lot of mistakes as I fumbled through learning R. I specifically had a tough time navigating the ever increasing landscape of tools and packages, trying to pick between R and Python, and getting lost along the way.
If you feel like this, you’re not alone. Coding is tough, data science is tough, and connecting it all with the business is tough.
If you feel like this, you’re not alone.
The good news is that, after years of learning, I was able to become a highly-rated business consultant working with Fortune 500 clients and my career advanced rapidly. More than that, I was able to help others in the community by developing open source software that has been downloaded over 1,000,000 times, and I found a real passion for coding.
In fact, that’s the driving reason that I created Business Science to help people like you and me that are struggling to learn data science for business (You can read about my personal journey here).
What I found out is that:
Data Science does not have to be difficult, it just has to be taught smartly
Anyone can learn data science fast provided they are motivated.
How I can help
If you are interested in learning R and the ecosystem of tools at a deeper level, then I have a streamlined program that will get you past your struggles and improve your career in the process.
It’s called the 5-Course R-Track System. It’s an integrated system containing 5 courses that work together on a learning path. Through 5+ projects, you learn everything you need to help your organization: from data science foundations, to advanced machine learning, to web applications and deployment.
The result is that you break through previous struggles, learning from my experience & our community of 2000+ data scientists that are ready to help you succeed.
Ready to take the next step? Then let’s get started.
👇 Top R-Tips Tutorials you might like:
- mmtable2: ggplot2 for tables
- ggdist: Make a Raincloud Plot to Visualize Distribution in ggplot2
- ggside: Plot linear regression with marginal distributions
- DataEditR: Interactive Data Editing in R
- openxlsx: How to Automate Excel in R
- officer: How to Automate PowerPoint in R
- DataExplorer: Fast EDA in R
- esquisse: Interactive ggplot2 builder
- gghalves: Half-plots with ggplot2
- rmarkdown: How to Automate PDF Reporting
- patchwork: How to combine multiple ggplots
- Geospatial Map Visualizations in R
Want these tips every week? Join R-Tips Weekly.