Data Science Workflow - The Process for Solving Data Problems
Written by Matt Dancho
Data Science is often misunderstood by students seeking to enter the field, business analysts seeking to add data science as a new skill, and executives seeking to implement a data science practice. This article aims to clear up the mystery behind data science by illustrating the sequence of steps to go from a business problem to generating business value using a data science workflow. Once data science is understood, we can take steps to learn data science skills that will generate the most value and/or better make strategic investments in building a data science practice.
In this article, you will:
- Understand what data science is
- Learn how data science generates value for an organization
- Learn how to go from business problem to business value
- Get Special Offer to learn data science for business through Business Science University
The Mystery & Confusion
Data Science is a mysterious term to many, but why?
Students & Data Enthusiasts
People excited about data science see it as machine learning - 100% of the time (this is drastically disproportionate to reality). In reality, Machine Learning (or Modeling) is about 5% of your time. The rest of the time is spent:
- Understanding the Business Problem: Communicating with Domain Experts (20%)
- Working with Data: Cleaning, Manipulating, Visualizing, Processing, Transforming, and Understanding (60%)
- Communicating Results: Reporting, Slide Decking, and Building Distributed Applications (Predictive Decision-Making Tools) (15%)
How Students View Data Science
Executives & Business Professionals
Executives and business professionals see data science as a new technology that could benefit their organization, but the connection between business problem and business value is not well understood. Data Science is often viewed as Artificial Intelligence (AI), a complex, black-box technology that is very trendy. But, the question remains - “What Can AI Do?”
How Executives and Business Professionals View Data Science
Reality for Everyone
Fortunately, the reality is that large businesses:
- Have many customers - The customers churn, generate sales, drive forecasts
- Make many products and/or services - The products are linked to quality, lead time, and inventory
- Have many suppliers - The suppliers affect lead times and serviceability
- Have data - The data provides a means to measure business drivers and is the fuel for data science
This combination of business-drivers - customers, products, inventory, suppliers, and more - with a wide array of internal and external data available makes data science a competitive advantage to organizations that can effectively implement it.
Making Better Decisions Generates Business Value
The goal for any Data Science Practice (Data Science Team) is to enable the rest of the organization to make better, data-driven decisions. Therefore, a Data Science Practice is a support role (similar to IT) that allows the organization to function better. The Data Science team can add a lot of value very quickly - through better decision making.
A simple example illustrates my point - An organization that does $500M in annual revenue but has a customer churn rate of 10% loses out on $50M in revenue/year. If a data science practice can identify the issue, predict which customers are going to churn, and implement strategies that enable the workforce to targe the customers with retention strategies, the team can effectively reduce the churn rate 20%.
An organization that does $500M in annual revenue but has a customer churn rate of 10% loses out on $50M in revenue/year.
In monetary terms, a reduction in churn of 20% equates to an annual savings of $10M. Over 5 years, this is $50M in savings generated from the Data Science Practice working with the decision makers (e.g. Sales, Marketing, Production).
How Do We Go From Business Problem To Business Value?
The way to go from business problem to business value follows an iterative set of steps that at Business Science University, we call the Data Science Workflow:
The Data Science Workflow has milestones (blue clouds), stages (dotted lines), and steps (gray shapes).
We begin with a Business Problem (milestone), where the team or organization identifies a problem that is worth solving. Typically this has a specific metric assigned to it that can be measured financially (e.g. 10% of our customers are not re-purchasing each year, this is costing the organization $50M annually).
The organization prioritizes this problem with the data science team, and they step into a project management workflow. Hopefully the organization follows a systematic and repeatable approach designed to integrate the business with data science such as the Business Science Problem Framework (we teach the BSPF in our Data Science for Business with R Course (DS4B 201-R)).
3 Stages of the Data Science Workflow
There are 3 stages:
- Preparation - Data is collected and cleaned. This takes a significant amount of time because most data is unclean, meaning steps need to be taken to improve the quality and develop it into a format that machines can interpret and learn from.
- Experimentation - This is where hypotheses are generated, data is visualized, and models are generated. This takes significantly less time than Preparation.
- Distribution - Reports are generated documenting results, slide decks are created to present to management, and once management provides the go-ahead, apps are developed to implement decision making systems.
At the end of the workflow, data scientist’s call this “production” or “deployment”, and this is where Business Value (milestone) is generated.
Best Data Science Teams Focus on These Parts
The best data science teams can iterate through this process going from problem to value very efficiently, spending little time on modeling and maximum time at the ends of the spectrum:
Beginning of Workflow: Business Understanding / Domain Expert Communication, Data Understanding, Data Quality, and Feature Engineering are Critical
End of Workflow: Communication with Project Stakeholders, Product Delivery are Critical
Learn How to Implement the Data Science Workflow
Learning how to implement the Data Science Workflow requires knowing which tools to use and in what areas of the workflow they belong. Here’s exactly where the R Packages fit (we teach the Data Science Workflow in Business Science University’s 3-Course R-Track):
Primary R Packages Overlayed on the Data Science Workflow
For those that are interested in learning how to implement the data science workflow, I have excellent news. The data science skills can be learned in weeks through Business Science Unversity.
Here is why Business Science University works:
You learn how to solve business problems first - The Data Science tools are secondary to the problem solving process. Therefore a typical course focuses on solving the problem while integrating 10-20 tools, which are the mechanisms for how we arrive at the end product.
You learn the entire (end-to-end) data science workflow - Data Import, Data Preparation, Data Cleaning, Data Manipulation, Data Visualization, Functional Programing, Advanced Machine Learning, and Web Application Development.
You learn cutting-edge tools and resources - We teach what works - High-performance, fast iteration, flexibility, and business value. These include tools like
H2OAutomated Machine Learning,
LIMEfor explanations, frameworks like the BSPF Framework, and referenced resources like the Ultimate R Cheat sheet.
We provide a community for your support - We operate a private Slack Channel with over 300 active members of like-minded individuals along with instructor support (yes, I am in there contributing and communicating every day).
Business Science University is different. You learn the entire data science tool chain while you solve business problems.
I am happy to give you a special offer of 15% OFF the R-Track 3-Course Bundle.
References to Data Science Methodologies
Several other data science workflow and project management methodologies exist. The two that we use at Business Science University are:
- BSPF Framework
The CRISP-DM - Cross-Industry Standard Process for Data Mining - Is a general approach to performing data science projects. It’s cross-industry, which means it is compatible with almost any data science problem. One issue is that CRISP-DM is very general, which is why Business Science created the BSPF Framework (discussed next).
The Business Science Problem Framework modifies the CRISP-DM process to provide specific strategies that improve project productivity and help data scientists get results in terms of Return-On-Investment (ROI).
We’ve written a lot about how to build data science teams, manage data science projects, and get results with frameworks. Here are a few of our best articles Data Science in Business.
About The Author
Matt Dancho is the founder of Business Science and is and Instructor at Business Science University. He is committed to doing everything possible to help students successfully apply data science to business to generate value (ROI).
“I look forward to have you in my courses. I will do everything possible to help you succeed.”
-Matt Dancho, Founder of Business Science
What do you think about the data science workflow process? Let’s talk in the comments.
Please share this article if you enjoyed it!