How to Scrape PDF Documents and Search PDFs with OpenAI LLMs (in R)

Written by Matt Dancho



Hey guys, welcome back to my R-tips newsletter. Businesses are sitting on a mountain of unstructured data. The biggest culprit is PDF Documents. Today, I’m going to share how to PDF Scrape text and use OpenAI’s Large Language Models (LLMs) to summarize it in R.

Table of Contents

Here’s what you’re learning today:

  • How to scrape PDF Documents I’ll explain how to scrape the text from your business’s PDF Documents using pdftools.
  • How I summarize PDF’s using the OpenAI LLMs in R. This will blow your mind.

XGBoost R Code

Get the Code (In the R-Tip 078 Folder)


SPECIAL ANNOUNCEMENT: AI for Data Scientists Workshop on December 18th

Inside the workshop I’ll share how I built a SQL-Writing Business Intelligence Agent with Generative AI:

Generative AI for Data Scientists

What: GenAI for Data Scientists

When: Wednesday December 18th, 2pm EST

How It Will Help You: Whether you are new to data science or are an expert, Generative AI is changing the game. There’s a ton of hype. But how can Generative AI actually help you become a better data scientist and help you stand out in your career? I’ll show you inside my free Generative AI for Data Scientists workshop.

Price: Does Free sound good?

How To Join: 👉 Register Here


R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?

Here are the links to get set up. 👇

Businesses are Sitting on $1,000,000 of Dollars of Unstructured Data (and they don’t know how to use it)

Fact: 90% of businesses are not using their unstructured data. It’s true. Many companies have no clue how to extract it. And once they extract it, they have no clue how to use it.

We’re going to solve both problems in this R-Tip.

The most common form is text located in PDF documents.

Businesses have 100,000s of PDF documents that contain valuable information.

PDF Data

OpenAI Document Summarization

One of the best use cases of LLMs is document summarization. But how do we get PDF data to OpenAI?

One easy way is in R!

R Tutorial: Scrape PDF Documents and Summarize with OpenAI

This is a simple 2 step process we’ll cover today:

  1. Extract PDF Text: We’ll use pdftools to extract text
  2. Summarize Text with OpenAI’s LLMs: We’ll use httr to connect to OpenAI’s API and summarize our PDF document

Business Objective:

I have set up a PDF document of Meta’s 2024 10K Financial Statement. We’ll use this document to analyze the risks that Meta reported in their filing (without even reading the document).

This is a massive speed up - and I can ask even more questions too beyond just the risks to really understand Meta’s business.

Good questions to ask for this financial case study:

  1. What are the top 3 risks to Meta’s business
  2. Where does Meta gain most of it’s revenue?
  3. In which business line is Meta’s revenue growing the most?

PDF Data

Get the PDF and Code

You can get the PDF and Code by joining the R-Tips Newsletter here.

T-Tip 078 Folder

Get the PDF and Code (In the R-Tip 078 Folder)

Load the Libraries

Next, load the libraries. Here’s what we’re using today:

Load Libraries

Get the PDF and Code (In the R-Tip 078 Folder)

Step 1: Extract PDF Text

With our project set up and libraries loaded, next I’m extracting the PDF text. It’s very easy to do in 1 line of code with pdftools::pdf_text().

Extract PDF Text

Get the PDF and Code (In the R-Tip 078 Folder)

This returns a list of text for 147 pages in Meta’s 10K Financial Statement. You can see the text on each page by cycling through text[1], text[2] and so on.

Step 2: Summarize the PDF Document with OpenAI LLMs

A common task: I want to know what risks Meta has identified in their 10K Financial Statement. This is required by the SEC. But, I don’t want to have to dig through the document.

The solution is to use OpenAI to summarize the document.

We will just summarize the first 30,000 characters in the document. There are more advanced ways to create a vector storage, but I’ll save that for a follow up post.

Run this code to set up OpenAI and our prompt:

Note that I have my OpenAI API key set up. I’m not going to dive into all of that. OpenAI has great documentation to set it up.

OpenAI Prompt Set Up

Get the PDF and Code (In the R-Tip 078 Folder)

Run this code to send the text and get OpenAI’s response

I’m using httr to send a POST request to OpenAI’s API. Then OpenAI provides a response with the answer to my question in the context of the text I provided it.

Connect to OpenAI API

Get the PDF and Code (In the R-Tip 078 Folder)

Run this Code to Parse the OpenAI Response

In just a couple seconds, I have a response from OpenAI’s API. Run this code to parse the response.

Parse OpenAI API Resposne

Get the PDF and Code (In the R-Tip 078 Folder)

Review the Response

Last, we can review the response from OpenAI’s Chat API. We can see that the top 3 risks are:

  1. Regulatory Compliance
  2. User Privacy and Trust Issues
  3. Competition and Innovation Risks

OpenAI Chat API Response

Conclusions:

You’ve learned my secret 2 step process for PDF Scraping documents and using LLM’s like OpenAI’s Chat API to summarize text data in R. But there’s a lot more to becoming an elite data scientist.

If you are struggling to become a Data Scientist for Business, then please read on…

Need to advance your business data science skills?

I’ve helped 6,107+ students learn data science for business from an elite business consultant’s perspective.

I’ve worked with Fortune 500 companies like S&P Global, Apple, MRM McCann, and more.

And I built a training program that gets my students life-changing data science careers (don’t believe me? see my testimonials here):

6-Figure Data Science Job at CVS Health ($125K)
Senior VP Of Analytics At JP Morgan ($200K)
50%+ Raises & Promotions ($150K)
Lead Data Scientist at Northwestern Mutual ($175K)
2X-ed Salary (From $60K to $120K)
2 Competing ML Job Offers ($150K)
Promotion to Lead Data Scientist ($175K)
Data Scientist Job at Verizon ($125K+)
Data Scientist Job at CitiBank ($100K + Bonus)

Whenever you are ready, here’s the system they are taking:

Here’s the system that has gotten aspiring data scientists, career transitioners, and life long learners data science jobs and promotions…

What They're Doing - 5 Course R-Track

Join My 5-Course R-Track Program Now!
(And Become The Data Scientist You Were Meant To Be...)

P.S. - Samantha landed her NEW Data Science R Developer job at CVS Health (Fortune 500). This could be you.

Success Samantha Got The Job