# PDF Scraping in R with tabulizer

Written by Jennifer Cooper

This article comes from Jennifer Cooper, a new student in Business Science University. Jennifer is 35% complete with the 101 course - and shows off her progress in this PDF Scraping tutorial. Jennifer has an interest in understanding the plight of wildlife across the world, and uses her new data science skills to perform a useful analysis - scraping PDF tables of a Report on Endangered Species with the tabulizer R package and visualizing alarming trends with ggplot2.

R Packages Covered:

• tabulizer - Scraping PDF tables
• dplyr - Wrangling unclean data & preparation for data visualization
• ggplot2 - Data visualization and understanding trends

# Scraping PDFs and Analyzing Endangered Species

by Jennifer Cooper, Marketing Data Scientist at Skopos Financial

Hey, everybody! Hope everyone has had a great weekend 😀!!

I’ve been “heads down” this weekend working on a special R project. This week I gave myself a challenge to start using R at work and also come up with a project on the side that I could use to help review what I’ve learned so far in Business Science University’s DS4B 101-R course.

In addition to being passionate about data science, I also love animals and am concerned about the plight of wildlife across the world, particularly with climate change. I decided to take a look at data on critically endangered species.

The only information on Endangered Species I could find was in a PDF format, so I spent a lot of time trying to figure out the nuances of tabulizer for scraping PDF. I finally got it done tonight!

Through this process, I discovered I still need a lot more practice, so I’m going to continue seeing what I can do to apply it at work (figure out how to connect to our SQL database this week), carve out more time to practice, and I may write up an article on working with tabulizer and PDFs.

Interested in learning R? Join me in the 101 course.

# My Workflow

Here’s a diagram of the workflow I used:

1. Start with PDF

2. Use tabulizer to extract tables

3. Clean up data into “tidy” format using tidyverse (mainly dplyr)

4. Visualize trends with ggplot2

# Get the PDF

I analyzed the Critically Endangered Species PDF Report.

# PDF Scrape and Exploratory Analysis

## Step 1 - Load Libraries

Load the following libraries to follow along.

Note that tabulizer depends on rJava, which may require some setup. Here are a few pointers:

• Mac Users: If you have issues connecting Java to R, you can try running sudo R CMD javareconf in the Terminal (per this post)

• Windows Users: This blog article provides a step-by-step process for installing rJava on Windows machines.

## Step 2 - Extracting the Tabular Data from PDF

The tabulizer package provides a suite of tools for extracting data from PDFs. The vignette, “Introduction to tabulizer” has a great overview of tabulizer’s features.

We’ll use the extract_tables() function to pull out each of the tables from the Endangered Species Report. This returns a list of data.frames.

The table I’m interested in is the first one - the Critically Endangered Species. I’ll extract it using the pluck() function and convert it to a tibble() (the tidy data frame format). I see that I’m going to need to do a bit of cleanup.

X X.1 X.2 X.3 Critically.Endangered..CR. X.4 X.5 X.6 X.7 X.8
Year Mammals Birds Reptiles Amphibians Fishes Insects Molluscs Other invertebrates NA Plants Fungi & protists
2019 203 224 303 575 549 311 658 263 NA 3,027 14
2018 201 224 287 550 486 300 633 252 NA 2,879 14
2017 202 222 266 552 468 273 625 243 NA 2,722 10
2016 204 225 237 546 461 226 586 211 NA 2,506 8
2015 209 218 180 528 446 176 576 209 NA 2,347 5

## Step 3 - Clean Up Column Names

Next, I want to start by cleaning up the names in my data - which are actually in the first row. I’ll use a trick using slice() to grab the first row, and the new pivot_longer() function to transpose and extract the column names that are in row 1. I can then set_names() and remove row 1.

Year Mammals Birds Reptiles Amphibians Fishes Insects Molluscs Other invertebrates Missing Plants Fungi & protists
2019 203 224 303 575 549 311 658 263 NA 3,027 14
2018 201 224 287 550 486 300 633 252 NA 2,879 14
2017 202 222 266 552 468 273 625 243 NA 2,722 10
2016 204 225 237 546 461 226 586 211 NA 2,506 8
2015 209 218 180 528 446 176 576 209 NA 2,347 5
2014 213 213 174 518 443 168 576 205 NA 2,119 2

## Step 4 - Tidy the Data

There are a few issues with the data:

1. Remove columns with NAs: Column labelled “Missing” is all NA’s - We can just drop this column

2. Fix columns that were combined: Three of the columns are combined - Amphibians, Fishes, and Insects - We can separate() these into 3 columns

3. Convert to (Tidy) Long Format for visualization: The data is in “wide” format, which isn’t tidy - We can use pivot_longer() to convert to “long” format with one observation for each row

4. Fix numeric data stored as character: The numeric data is stored as character and several of the numbers have commas - We’ll remove commas and convert to numeric

5. Convert Character Year & species to Factor: The year and species columns are character - We can convert to factor for easier adjusting of the order in the ggplot2 visualizations

6. Percents by year: The visualizations will have a percent (proportion) included so we can see which species have the most endangered - We can add proportions by each year

Year species number percent label
2019 Mammals 203 0.0331320 3.3%
2019 Birds 224 0.0365595 3.7%
2019 Reptiles 303 0.0494532 4.9%
2019 Amphibians 575 0.0938469 9.4%
2019 Fishes 549 0.0896034 9.0%
2019 Insects 311 0.0507589 5.1%

## Step 5 - Visualize the Data

### Summary Visualization

I made a summary visualization using stacked bar chart to show the alarming trends of critically endangered species over time.

I then faceted the species and visualized the trend over time using a smoother (geom_smooth). Again, we see that each of the species exhibit increasing trends.

# Parting Thoughts

It was really exciting to see my hard work pay off. It took a bit to get going, but I found that tabulizer made PDF extraction manageable. The most challenging part was getting the data into a format that can be easily visualized (the tidyverse really helped as shown in Step 4!). I was particularly excited to see results of my analysis. I want to share with others the alarming trends related to the plight of wildlife, while demonstrating the power of R!

If you’d like to join me, I’m currently learning Data Science for Business in Business Science’s 101 course (Data Science Foundations), and I’ve signed up for 201 Advanced Machine Learning and 102 Shiny Web Applications.

# See what our students are doing:

### Student Success Stories

Real stories of success from our students applying their skills learned at Business Science University to get jobs and help their organizations!

### Student Code Tutorials

Tutorials made by our students using their new skills learned at Business Science University!