Create A Pandas Dataframe AI Agent With Generative AI, Python And OpenAI
Written by Matt Dancho
Hey guys, this is the first article in my NEW GenAI / ML Tips Newsletter. Today, we’re diving into the world of Generative AI and exploring how it can help companies automate common data science tasks. Specifically, we’ll learn how to create a Pandas dataframe agent that can answer questions about your dataset using Python, Pandas, LangChain, and OpenAI’s API. Let’s get started!
Table of Contents
Here’s what you’ll learn in this article:
This is what you are making today
We’ll use this Generative AI Workflow to combine data (from CSVs or SQL databases) with a Pandas Data Frame Agent that helps us produce common analytics outputs like visualizations and reports.
Get the Code (In the AI-Tip 001 Folder)
SPECIAL ANNOUNCEMENT: AI for Data Scientists Workshop on December 18th
Inside the workshop I’ll share how I built a SQL-Writing Business Intelligence Agent with Generative AI:
What: GenAI for Data Scientists
When: Wednesday December 18th, 2pm EST
How It Will Help You: Whether you are new to data science or are an expert, Generative AI is changing the game. There’s a ton of hype. But how can Generative AI actually help you become a better data scientist and help you stand out in your career? I’ll show you inside my free Generative AI for Data Scientists workshop.
Price: Does Free sound good?
How To Join: 👉 Register Here
GenAI/ML-Tips Weekly
This article is part of GenAI/ML Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common Data Science and Generative AI coding tasks. Pretty cool, right?
Here is the link to get set up. 👇
Get the Code (In the GenAI/ML Tip 001 Folder)
This Tutorial is Available in Video (9-minutes)
I have a 9-minute video that walks you through setting up the Pandas Data Frame Agent and running data analysis with it. 👇
Generative AI, powered by models like OpenAI’s GPT series, is reshaping the data science landscape. These models can understand and generate human-like text, making it possible to interact with data in more intuitive ways. By integrating Generative AI into data science, you can:
- Automate Data Insights: Quickly generate summaries and insights from complex datasets.
- Enhance Decision Making: Obtain answers to specific questions without manually sifting through data.
- Improve Accessibility: Make data science more accessible to non-technical stakeholders.
Creating a Pandas dataframe agent combines the power of AI with data science, enabling you to unlock new possibilities in data exploration and interpretation from Natural Language.
What is a Pandas Data Frame Agent?
A Pandas Data Frame Agent automates common Pandas operations from Natural Language inputs.
It can be used to perform:
- GroupBy + Aggregate
- Math calculations (that normal LLMs struggle with)
- Filters
- Pivots
- Window calculations
- Resampling (Time Series)
- Binning
- Log Transformations
- Summary Statistics (Mean, Median, IQR, Min/Max, Count (frequency), etc)
All from Natural Language prompts.
Make A Pandas Data Frame Agent
Let’s walk through the steps to create a Pandas data frame agent that can answer questions about a dataset using Python, OpenAI’s API, Pandas, and LangChain.
Quick Reminder: You can get all of the code and datasets shown in a Python Script and Jupyter Notebook when you join my GenAI/ML Tips Newsletter.
Code Location: /001_pandas_dataframe_agent
Step 1: Setting Up the Python Environment
First, you’ll need to set up your Python environment and install the required libraries.
pip install openai langchain langchain_openai langchain_experimental pandas plotly pyyaml
Next, import the libraries.
Then run this to access our utility function, parse_json_to_dataframe()
.
The last part is to set up your OpenAI API Key. Make sure to get an API Key from OpenAI’s API website.
Note: Replace ‘credentials.yml’ with the path to your YAML file containing the OpenAI API key or set the ‘OPENAI_API_KEY’ environment variable directly.
Step 2: Loading and Exploring the Dataset
Load your dataset into a Pandas DataFrame. For this tutorial, we’ll use a sample customer data CSV file. But you could easily use any data that you can get into a Pandas Data Frame:
- SQL Database
- CSV
- Excel File
Run this code to load the customer dataset:
This dataset contains customer information, including sales and geography data.
Step 3: Create the Pandas Data Analysis Agent with LangChain
Initialize the language model and create the Pandas data analysis agent using LangChain.
This is what’s happening:
ChatOpenAI
: Initializes the OpenAI language model.
create_pandas_dataframe_agent
: Creates an agent that can interact with the Pandas DataFrame.
agent_type
: Specifies the type of agent (using OpenAI functions).
suffix
: Instructs the agent to return results in JSON format for easy parsing.
Pro-Tip: The secret sauce is to use the suffix
parameter to specify the output format. Under the hood, this appends the agent’s default prompt template with additional information that describes how to return the information.
Step 4: Interacting with the Pandas Data Frame Agent
Now, you can ask the agent questions about your data. Try running this code with a Natural Language analysis question:
“What are the total sales by geography?”
The agent processes the query and returns a response.
This is where Post Processing comes into play. Remember when I added the suffix
parameter to return JSON. The Agent actually burries the JSON in a string.
That’s OK, because I have created a handy little parsing tool that extracts the JSON from the string and converts it to a Pandas Data Frame for us.
Step 5: Visualizing the Results
With a pandas data frame we can then report the results. I’ll do this manually with Plotly, but a great challenge is to extend the code to create an AI agent that makes the visualization code and executes it automatically.
This visualization provides a clear view of sales distribution across different geographical regions.
Quick Reminder: You can get all of the code and datasets shown in a Python Script and Jupyter Notebook when you join my GenAI/ML Tips Newsletter.
Conclusion
By integrating Generative AI with data science, you’ve created a powerful tool that can interact with your data in natural language. This Pandas data analysis agent simplifies the process of extracting insights and can help non-technical stakeholders automate common data manipulations to help them make data-driven decisions.
But there’s so much more to learn in Generative AI and data science.
If you’re excited to become a Generative AI Data Scientist with Python, then keep reading…
Become A Generative AI Data Scientist
The future of data science is AI / ML.
I’ve helped 6,107+ students learn data science and now I’m helping them become Generative AI Data Scientists, skilled in combining Generative AI / ML. With this system they have:
- Landed Promotions to Manager of AI/ML Teams ($200,000+ Role)
- Made Proof-Of-Concepts for Clients ($25,000+ Consulting Projects)
- Grew their data science skills with Generative AI (Career Growth)
Here’s the system they are taking to become Generative AI Data Scientists:
This is a Live 8-Week Generative AI Bootcamp for Data Scientists that covers:
-
Week 1: Live Kickoff Clinic + Local LLM Training + AI Fast Track
-
Week 2: Retrieval Augmented Generation (RAG)
-
Week 3: Business Intelligence AI Copilot (SQL + Pandas Tools)
-
Week 4: Customer Analytics Team (Multi-Agent Workflows)
-
Week 5: Time Series Forecasting Team (Multi-Agent Machine Learning Workflows)
-
Week 6: LLM Model Deployment AWS Bedrock
-
Week 7: Fine-Tuning LLM Models AWS Bedrock
-
Week 8: AI App Deployment With AWS Cloud
Enroll In The Next Cohort Here
(And Become A Generative AI Data Scientist in 2025)