H2O is the scalable, open-source ML library that features AutoML. Here's why it's an essential library for me (and you).
The enterprise-grade process for deploying, hosting, and maintaining Shiny web applications using AWS, Docker, and Git.
Moving into 2020, three things are clear - Organizations want Data Science, Cloud, and Apps. A key skill that companies need is Git for application development (I call this Full Stack Data Science). Here's what is driving Git's growth, and why you should learn Git for data science application development.
Moving into 2020, three things are clear - Organizations want Data Science, Cloud, and Apps. Here are the essential skills for Data Scientists that need to build and deploy applications in 2020 and beyond.
Getting a job in Data Science is difficult. Here's how one Business Science student aced his Data Science Interview and landed a job at a top Management Consulting Firm.
Moving into 2020, three things are clear - Organizations want Data Science, Cloud, and Apps. Here's how Docker plays a part in the essential skills of 2020.
Learn how to perform a tidy approach to classification problem with the new parsnip R package for machine learning.
A data science team has many tools that all need to be integrated. And, this can be INTIMIDATING. Here are some tips to deal with the complexity of a data science tech stack.
Organizations depend on the Data Science team to build distributed applications that solve business needs. AWS provides an infrastructure to host data science products for stakeholder to access.
Apply Data Science to Business using the Business Science Problem Frameowrk. Learn how one Business Science student created a data product that aims to help his organization improve the quality of care while reducing cost.
Learn how to web scrape HTML, wangle JSON, and visualize product data from the Bicycle Manufacturer, Specialized Bicycles.
We can often improve forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the clean_anomalies() function from anomalize into your forecast workflow.
Learn how to scrape and wrangle PDF tables of a Report on Endangered Species with the tabulizer R package and visualize trends with ggplot2.
Wrangling Big Data is one of the best features of the R programming language - which boasts a Big Data Ecosystem that contains fast in-memory tools (e.g. data.table) and distributed computational tools (sparklyr). With the NEW dtplyr package, data scientists with dplyr experience gain the benefits of data.table backend. We saw a 3X speed boost for dplyr!
I'm pleased to announce the introduction of correlationfunnel version 0.1.0, which officially hit CRAN yesterday. The correlationfunnel package is something I've been using for a while to efficiently explore data, understand relationships, and get to business insights as fast as possible.
This is a true story based on how I created my data science company from scratch. It's a detailed documentation of my personal journey along with the company I founded, Business Science.
Learn how going from Excel to R can speed up Exploratory Data Analysis, getting business insights 100X FASTER.
The Ultimate R Cheat Sheet now covers the Shinyverse - An Ecosystem of R Packages for Shiny Web Application Development, Deployment, and putting Machine Learning into Production. Download the Cheat Sheet for Free!
Becoming a data scientist in Finance can be a lofty challenge... unless you know how to streamline the path.
Student feedback led to a BRAND NEW CHEAT SHEET on customer segmentation and a major overhaul to our Week 6 Modeling Chapter in our Business Analysis with R Course.
The first article in a 3-part series on Excel to R, this article walks the reader through a Marketing Case Study exposing the 10X productivity boost from switching from Excel to R.
The ultimate R cheat sheet links to the documentation and cheat sheets for every major R package. It just got even better with a brand new second page containing special topics and R packages!
This article demonstrates a real-world case study for business forecasting with regression models including artificial neural networks (ANNs) with Keras
R and Python - learn how to integrate both R and Python into your data science workflow. Use the strengths of the two dominant data science languages.
Real world data science - Learn how to compete in a Kaggle Competition using Machine Learning with R.
I’m pleased to announce that we released brand new content for our flagship course, Data Science For Business (DS4B 201). Over the course of 10 weeks, the DS4B 201 course teaches students and end-to-end data science project solving Employee Churn with R, H2O, & LIME. The latest content is focused on transitioning from modeling Employee Churn with H2O and LIME to evaluating our binary classification model using Return-On-Investment (ROI), thus delivering business value. We do this through application of a special tool called the Expected Value Framework. Let’s learn about the new course content available now in DS4B 201, Chapter 7, which covers the Expected Value Framework for modeling churn with H2O!
We are pleased to announce that our Data Science For Business (#DS4B) Course (HR 201) is OFFICIALLY OPEN! This course is for intermediate to advanced data scientists looking to apply H2O and LIME to a real-world binary classification problem in an organization: Employee Attrition. If you are interested applying data science for business in a real-world setting with advanced tools using a client-proven system that delivers ROI to the organization, then this is the course for you. For a limited time we are offering 15% off enrollment.
Last November, our data science team embarked on a journey to build the ultimate Data Science For Business (DS4B) learning platform. We saw a problem: A gap exists in organizations between the data science team and the business. To bridge this gap, we’ve created Business Science University, an online learning platform that teaches DS4B, using high-end machine learning algorithms, and organized in the fashion of an on-premise workshop but at a fraction of the price. I’m pleased to announce that, in 5 days, we will launch our first course, HR 201, as part of a 4-course Virtual Workshop. We crafted the Virtual Workshop after the data science program that we wished we had when we began data science (after we got through the basics of course!). Now, our data science process is being opened up to you. We guide you through our process for solving high impact business problems with data science!
The R programming language is a powerful tool used in data science for business (DS4B), but R can be unnecessarily challenging to learn. We believe you can learn R quickly by taking an 80/20 approach to learning the most in-demand functions and packages. In this article, we seek to ultimately understand what techniques are most critical to a beginners success through analyzing a master data scientist’s code base. Half of this article covers the web scraping procedure (using
purrr) we used to collect our data (if new to R, you can skip this). The second half covers the insights gained from analyzing a master’s code base. In the next article in our series, we’ll develop a strategic learning plan built on our knowledge of the master. Last, there’s a bonus at the end of the article that shows how you can analyze your own code base using the new
fs package. Enjoy.
We’re happy to announce the third release of the
tibbletime package. This is a huge update, mainly due to a complete rewrite of the package. It contains a ton of new functionality and a number of breaking changes that existing users need to be aware of. All of the changes have been well documented in the NEWS file, but it’s worthwhile to touch on a few of them here and discuss the future of the package. We’re super excited so let’s check out the vision for
tibbletime and its new functionality!
Learn R for business - Data science for business is the future of business analytics. Here are 6 reasons why R is the right choice.
We’re into the fourth day of Business Science Demo Week. We have a really cool one in store today:
tibbletime, which uses a new
tbl_time class that is time-aware!! For those that may have missed it, every day this week we are demo-ing an R package:
tibbletime (Thursday) and
h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Let’s take
tibbletime for a spin!
Tonight at 7PM EST, we will be giving a LIVE #DataTalk on Using Machine Learning to Predict Employee Turnover. Employee turnover (attrition) is a major cost to an organization, and predicting turnover is at the forefront of needs of Human Resources (HR) in many organizations. Until now the mainstream approach has been to use logistic regression or survival curves to model employee attrition. However, with advancements in machine learning (ML), we can now get both better predictive performance and better explanations of what critical features are linked to employee attrition. We used two cutting edge techniques: the
h2o package’s new FREE automatic machine learning algorithm,
h2o.automl(), to develop a predictive model that is in the same ballpark as commercial products in terms of ML accuracy. Then we used the new
lime package that enables breakdown of complex, black-box machine learning models into variable importance plots. The talk will cover HR Analytics and how we used R, H2O, and LIME to predict employee turnover.
We’re into the third day of Business Science Demo Week. Hopefully by now you’re getting a taste of some interesting and useful packages. For those that may have missed it, every day this week we are demo-ing an R package:
tibbletime (Thursday) and
h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Today is
sweep, which has
broom-style tidiers for forecasting. Let’s get going!
We’re into the second day of Business Science Demo Week. What’s demo week? Every day this week we are demoing an R package:
tibbletime (Thursday) and
h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Second up is
timetk, your toolkit for time series in R. Here we go!
We’ve got an exciting week ahead of us at Business Science: we’re launching our first ever Business Science Demo Week. Every day this week we are demoing an R package:
tibbletime (Thursday) and
h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. First up is
tidyquant, our flagship package that’s useful for financial and time series analysis. Here we go!
Today we are introducing
tibbletime v0.0.2, and we’ve got a ton of new features in store for you. We have functions for converting to flexible time periods with the
~period formula~ and making/calculating custom rolling functions with
rollify() (plus a bunch more new functionality!). We’ll take the new functionality for a spin with some weather data (from the
weatherData package). However, the new tools make
tibbletime useful in a number of broad applications such as forecasting, financial analysis, business analysis and more! We truly view
tibbletime as the next phase of time series analysis in the
tidyverse. If you like what we do, please connect with us on social media to stay up on the latest Business Science news, events and information!
We are very excited to announce the initial release of our newest R package,
tibbletime. As evident from the name,
tibbletime is built on top of the
tibble package (and more generally on top of the
tidyverse) with the main
purpose of being able to create time-aware tibbles through a one-time
specification of an “index” column (a column containing timestamp information). There are a ton of useful time functions that we can now use such as
time_collapse(). We’ll walk through the basics in this post.
We’re excited to announce the
alphavantager package, a lightweight R interface to the Alpha Vantage API! Alpha Vantage is a FREE API for retreiving real-time and historical financial data. It’s very easy to use, and, with the recent glitch with the Yahoo Finance API, Alpha Vantage is a solid alternative for retrieving financial data for FREE! It’s definitely worth checking out if you are interested in financial analysis. We’ll go through the
alphavantager R interface in this post to show you how easy it is to get real-time and historical financial data. In the near future, we have plans to incorporate the
tidyquant to enable scaling from one equity to many.
We have several announcements regarding Business Science R packages. First, as of this week the R package formerly known as
timekit has changed to
timetk for time series tool kit. There are a few “breaking” changes because of the name change, and this is discussed further below. Second, the
tidyquant packages have several improvements, which are discussed in detail below. Finally, don’t miss a beat on future news, events and information by following us on social media.
We’re pleased to introduce a new package,
sweep, now on CRAN! Think of it like
broom for the
forecast package. The
forecast package is the most popular package for forecasting, and for good reason: it has a number of sophisticated forecast modeling functions. There’s one problem:
forecast is based on the
ts system, which makes it difficult work within the
tidyverse. This is where
sweep fits in! The
sweep package has tidiers that convert the output from
forecast modeling and forecasting functions to “tidy” data frames. We’ll go through a quick introduction to show how the tidiers can be used, and then show a fun example of forecasting GDP trends of US states. If you’re familiar with
broom it will feel like second nature. If you like what you read, don’t forget to follow us on social media to stay up on the latest Business Science news, events and information!
We’ve just released
timekit v0.3.0 to CRAN. The package updates include changes that help with making an accurate future time series with
tk_make_future_timeseries() and we’ve added a few features to
tk_get_timeseries_signature(). Most important are the new vignettes that cover both the making of future time series task and forecasting using the
timekit package. If you saw our last timekit post, you were probably surprised to learn that you can use machine learning to forecast using the time series signature as an engineered feature space. Now we are expanding on that concept by providing two new vignettes that teach you how to use ML and data mining for time series predictions. We’re really excited about the prospects of ML applications with time series. If you are too, I strongly encourage you to explore the
timekit package important links below. Don’t forget to check out our announcements and to follow us on social media to stay up on the latest Business Science news, events and information! Here’s a summary of the updates.
In advance of upcoming Business Science talks on
tidyquant at R/Finance and EARL San Francisco, we are releasing a technical paper entitled “New Tools For Performing Financial Analysis within the ‘Tidy’ Ecosystem”. The technical paper covers an overview of the current R financial package landscape, the independent development of the “tidyverse” data science tools, and the
tidyquant package that bridges the gap between the two underlying systems. Several usage cases are discussed. We encourage anyone interested in financial analysis and financial data science to check out the technical paper. We will be giving talks related to the paper at R/Finance on May 19th in Chicago and EARL on June 7th in San Francisco. If you can’t make it, I encourage you to read the technical paper and to follow us on social media to stay up on the latest Business Science news, events and information.
timekit package contains a collection of tools for working with time series in R. There’s a number of benefits. One of the biggest is the ability to use a time series signature to predict future values (forecast) through data mining techniques. While this post is geared toward exposing the user to the
timekit package, there are examples showing the power of data mining a time series as well as how to work with time series in general. A number of
timekit functions will be discussed and implemented in the post. The first group of functions works with the time series index, and these include functions
tk_get_timeseries_summary(). We’ll spend the bulk of this post introducing you to these. The next function deals with creating a future time series from an existing index,
tk_make_future_timeseries(). The last set of functions deal with coercion to and from the major time series classes in R,
We’ve got some good stuff cooking over at Business Science. Yesterday, we had the fifth official release (0.5.0) of
tidyquant to CRAN. The release includes some great new features. First, the Quandl integration is complete, which now enables getting Quandl data in “tidy” format. Second, we have a new mechanism to handle selecting which columns get sent to the mutation functions. The new argument name is…
select, and it provides increased flexibility which we show off in a
rollapply example. Finally, we have added several
PerformanceAnalytics functions that deal with modifying returns to the mutation functions. In this post, we’ll go over a few of the new features in version 5.
Today I’m very pleased to introduce the new Quandl API integration that is available in the development version of
tidyquant. Normally I’d introduce this feature during the next CRAN release (v0.5.0 coming soon), but it’s really useful and honestly I just couldn’t wait. If you’re unfamiliar with Quandl, it’s amazing: it’s a web service that has partnered with top-tier data publishers to enable users to retrieve a wide range of financial and economic data sets, many of which are FREE! Quandl has it’s own R package (aptly named
Quandl) that is overall very good but has one minor inconvenience: it doesn’t return multiple data sets in a “tidy” format. This slight inconvenience has been addressed in the integration that comes packaged in the latest development version of
tidyquant. Now users can use the Quandl API from within
tidyquant with three functions:
quandl_search(), and the core function
tq_get(get = "quandl"). In this post, we’ll go through a user-contributed example, How To Perform a Fama French 3 Factor Analysis, that showcases how the Quandl integration fits into the “Collect, Modify, Analyze” financial analysis workflow. Interested readers can download the development version using
devtools::install_github("business-science/tidyquant"). More information is available on the tidyquant GitHub page including the updated development vignettes.
I’m excited to announce the release of
tidyquant version 0.4.0!!! The release is yet again sizable. It includes integration with the
PerformanceAnalytics package, which now enables full financial analyses to be performed without ever leaving the “tidyverse” (i.e. with DATA FRAMES). The integration includes the ability to perform performance analysis and portfolio attribution at scale (i.e. with many stocks or many portfolios at once)! But wait there’s more… In addition to an introduction vignette, we created five (yes, five!) topic-specific vignettes designed to reduce the learning curve for financial data scientists. We also have new
ggplot2 themes to assist with creating beautiful and meaningful financial charts. We included
tq_get support for “compound getters” so multiple data sources can be brought into a nested data frame all at once. Last, we have added new
tq_exchange() functions to make collecting stock data with
tq_get even easier. I’ll briefly touch on several of the updates. The package is open source, and you can view the code on the tidyquant github page.
tidyquant, version 0.3.0, is a pretty sizable release that includes a little bit for everyone, including new financial charting and moving average geoms for use with
ggplot2, a new
tq_get get option called
"key.stats" for retrieving real-time stock information, and several nice integrations that improve the ease of scaling your analyses. If your not already familiar with
tidyquant, it integrates the best quantitative resources for collecting and analyzing quantitative data,
TTR, with the
tidyverse allowing for seamless interaction between each. I’ll briefly touch on some of the updates by going through some neat examples. The package is open source, and you can view the code on the tidyquant github page.
Since my initial post on parallel processing with
multidplyr, there have been some recent changes in the
tidy eco-system: namely the package
tidyquant, which brings financial analysis to the
tidyquant package drastically increase the amount of tidy financial data we have access to and reduces the amount of code needed to get financial data into the tidy format. The
multidplyr package adds parallel processing capability to improve the speed at which analysis can be scaled. I seriously think these two packages were made for each other. I’ll go through the same example used previously, updated with the new
tidyquant, version 0.2.0, is now available on CRAN. If your not already familiar,
tidyquant integrates the best quantitative resources for collecting and analyzing quantitative data,
TTR, with the tidy data infrastructure of the
tidyverse allowing for seamless interaction between each. I’ll briefly touch on some of the updates. The package is open source, and you can view the code on the tidyquant github page.
My new package,
tidyquant, is now available on CRAN.
tidyquant integrates the best quantitative resources for collecting and analyzing quantitative data,
TTR, with the tidy data infrastructure of the
tidyverse allowing for seamless interaction between each. While this post aims to introduce
tidyquant to the R community, it just scratches the surface of the features and benefits. We’ll go through a simple stock visualization using
ggplot2, which which shows off the integration. The package is open source, and you can view the code on the tidyquant github page.