Python and R - Part 1: Exploring Data with Datatable
Written by David Lucey
Article Update
Interested in more Python and R tutorials?
π Register for our blog to get new articles as we release them.
Introduction
Pythonβs datatable
was launched by h2o
two years ago and is still in alpha stage with cautions that it may still be unstable and features may be missing or incomplete. We found that it feels very similar to the R version, with a few syntax differences and also some important pieces still to be added (as we will discuss). We could only find a handful of posts showing how to use datatable
, and many of the examples we were probably not written by regular users of R data.table
, and were often focused on its efficiency and ability to scale relative to pandas. We use R data.table
every day and love the speed and concise syntax, so this walk-through analysis of the EPAβs Big MT cars data set will be on the syntax of the most frequent actual data exploration operations. As for plotnine, it feels more seamless with ggplot2
with a few problems formatting plots in Rmarkdown.
EPAβs Big MT Dataset
To make it a little interesting, we will use the Tidy Tuesday Big MT Cars with 36 years of 42,230 new US car models. The data dictionary with 83 variables describing each annual new car model is found here. Everyone loves cars and remembering historical models, and we have naturally been curious about this data set. After closer analysis however, we discovered that there are some unfortunate missing pieces.
When we have modeled mtcars, weight (wt) and horsepower (hp), and their interaction, have been most informative for predicting mpg. It would have been interesting to look at the evolution of the mtcars coefficients over time, but these variables are not unfortunately not available. In addition, it is hard to get a sense of fleet mileage without the annual unit-volume of each new car model. Because of this, it is impossible to know the evolution of more fuel efficient electric vehicles relative to more fuel-hungry model sales.
It is difficult to understand why these variables are not included when that information must be available to the EPA, and it clearly says on page 6 of Fuel Economy Guide 2020 that an extra 100 lbs decreases fuel economy by 1%. While the data set is still of interest to practice for data cleaning, it doesnβt look likely that we will be able replicate mtcars over time unless we can find more variables.
Loading Data with fread
We tried to download both the origin zipped data directly from the EPA website (see link below), and the .csv from the Tidy Tuesday website, but were unsuccessful in both cases using Python and R versions of fread. We were able to download the Tidy Tuesday .csv link with fread in data.table
but not datatable
, and the error message didnβt give us enough information to figure it out. The documentation for data.table
fread is among the most extensive of any function we know, while still thin for datatableβs version so far. In the end, we manually downloaded and unzipped the file from the EPAβs website, and uploaded from our local drive.
The list of all 83 variables below, and we can see that there are several pertaining to fuel efficiency, emissions, fuel type, range, volume and some of the same attributes that we all know from mtcars (ie: cylinders, displacement, make, model and transmission). As mentioned, gross horsepower and weight are missing, but carburetors, acceleration and engine shape are also absent. We have all classes of vehicles sold, so get vehicle class information (VClass) not available in mtcars which is only cars. We will discuss further down, changes to the weight cutoffs on some of the categories over time make VClass of questionable use.
Set-up Thoughts from R Perspective
There were a couple of things about the set-up for datatable
, which werenβt apparent coming over from data.table
as an R user. The first was to use from dt import *
at the outset to avoid having to reference the package short name every time within the frame. From a Python perspective, this is considered bad practice, but we are only going to do it for that one package because it makes us feel more at home. The second was to use export_names()
in order to skip having to use the f operator or quotation marks to reference variables. In order to do this, we had to create a dictionary of names using the names list from above, and each of their f expressions extracted with export_names
in a second list. We then used update from the local environment to assign all of the dictionary values to their keys as variables. From then on, we can refer to those variable without quotation marks or the f operator (although any new variables created would still need f or quotation marks). We werenβt sure why this is not the default behavior, but it is easily worked around for our purposes. These two possibly not βPythonicβ steps brought the feel of datatable
a lot closer to the usual R data.table
(ie: without the package and expression short codes).
Basic Filter and Select Operations
A few lines of some key variables are shown in the code below, and it is clear that they need significant cleaning to be of use. One difference with R data.table
can be seen below with filtering. Using our year_filter
in i (the first slot), the 1204 2019 models are shown below. Unlike R data.table
, we refer to year outside of the frame in an expression, and then call it within i of the frame. The columns can be selected within ()
or []
in j (the second slot) as shown below, and new columns can be created within {}
.
We usually like to make a quick check if there are any duplicated rows across the whole our dataFrame, but there isnβt a duplicated()
function yet in datatable. According to How to find unique values for a field in Pydatatable Data Frame, the unique()
function also doesnβt apply to groups yet. In order to work around this, identifying variables would have to be grouped, counted and filtered for equal to 1, but we werenβt sure yet exactly which variables to group on. We decided to pipe over to pandas to verify with a simple line of code that there were no duplicates, but hope this function will be added in the future.
Aggregate New Variable and Sort
We can see that below that eng_dscr
is unfortunately blank 38% of the time, and high cardinality for the rest of the levels. A small percentage are marked βGUZZLERβ and βFLEX FUELSβ. in a few cases, potentially helpful information about engine like V-6 or V-8 are included with very low frequency, but not consistently enough to make sense try to extract. Another potentially informative variable, trans_dscr
is similarly blank more than 60% of the time. It seems unlikely that we could clean these up to make it useful in an analysis, so will probably have to drop them.
Separate and Assign New Variables
As shown above, trany
has both the transmission-type and gear-speed variables within it, so we extracted the variable from big_mt with to_list()
, drilled down one level, and used regex to extract the transmission and gear information needed out into trans and gear. Notice that we needed to convert the lists back into columns with dt.Frame before assigning as new variables in big_mt.
In the third line of code, we felt like we were using an R data.table
. The {}
is used group by trans and gear, and then to create the new percent variable in-line, without affecting the other variables in big_mt. We tried to round the decimals in percent, but couldnβt figure it out so far. Our understanding is that there is no round()
method yet for datatable
, so we multiplied by 100 and converted to integer. We again called export_names()
, to be consistent in using non-standard evaluation with the two new variables.
Set Key and Join
We wanted to create a Boolean variable to denote if a vehicle had an electric motor or not. We again used {}
to create the variable in the frame, but donβt think it is possible to update by reference so still had to assign to is_ev. In the table below, we show the number of electric vehicles rising from 3 in 1998 to 149 this year. Unfortunately,
Using Regular Expressions in Row Operations
Next, we hoped to extract wheel-drive (2WD, AWD, 4WD, etc) and engine type (ie: V4, V6, etc) from model. The re_match()
function is helpful in filtering rows in i. As shown below, we found almost 17k matches for wheel drive, but only 718 for the engine size. Given that we have over 42k rows, we will extract the wheels and give up on the engine data. It still may not be enough data for wheels to be a helpful variable.
We used regex to extract whether the model was 2WD, 4WD, etc as wheels from model, but most of the time, it was the same information as we already had in drive. It is possible that our weakness in Python is at play, but this would have been a lot simpler in R, because we wouldnβt have iterated over every row in order to extract part of the row with regex. We found that there were some cases where the 2WD and 4WD were recorded as 2wd and 4wd. The replace()
function was an efficient solution to this problem, replacing matches of βwdβ with βWDβ over the entire frame.
Reshaping
There was no such thing as an 4-wheel drive SUVs back in the 80βs, and we remember the big 8-cylinder Oldsmobiles and Cadillacs, so were curious how these models evolved over time. datatable doesnβt yet have dcast()
or melt()
, so we had to pipe these out to_pandas()
and then use pivot_table()
. Its likely that a lot of the the many models where wheel-drive was unspecified were 2WD, which is still the majority of models. We would have liked to show these as whole numbers, and there is a workaround in datatable to convert to integer, but once we pivoted in pandas
, it reverted to float. We can see the first AWD models starting in the late 80s, and the number of 8-cylinder cars fall by half. There are are a lot fewer annual new car models now than in the 80s, but were surprised how many fewer 4-cylinders.
Combining Levels of Variables with High Cardinality
With 35 distinct levels often referring to similar vehicles, VClass also needed to be cleaned up. Even in R data.table
, we have been keenly awaiting the implementation of fcase
, a data.table
version of the dplyr case_when()
function for nested control-flow statements. We made a separate 16-line function to lump factor levels (not shown). In the first line below, we created the vclasses list to drill down on the VClass tuple elements as strings. In the second line, we had to iterate over the resulting strings from the 0-index of the tuple to extract wheel-drive from a list-comprehension. We printed out the result of our much smaller list of lumped factors, but there are still problems with the result. The EPA changed the cutoff for a βSmall Pickup Truckβ from 4,500 to 6,000 lbs in 2008, and also used a higher cut-off for βsmallβ SUVβs starting in 2011. This will make it pretty hard to us VClass as a consistent variable for modeling, at least for Pickups and SUVs. As noted earlier, if we had the a weight field, we could have easily worked around this.
Selecting Multiple Columns with Regex
In the chunk (below), we show how to select columns from the big_mt names tuple by creating the measures selector using regex matches for the key identifier columns and for integer mileage columns matching β08β. This seemed complicated and we couldnβt do it in line within the frame as we would have with data.table
.SD = patterns(). We also wanted to reorder to move the identifier columns (year, make and model) to the left side of the table, but couldnβt find a equivalent setcolorder
function. There is documentation about multi-column selection, but we couldnβt figure out an efficient way to make it work. We show the frame with the year_filter
which we set up earlier.
Selecting Columns and Exploring Summary Data
We looked for a Python version of skimr
, but it doesnβt seem like there is an similar library (as is often the case). We tried out pandas profiling
, but that had a lot of dependencies and seemed like overkill for our purposes, so decided to use skim_tee
on the table in a separate R chunk (below). It was necessary to convert to pandas
in the Python chunk (above), because we couldnβt figure out how to translate a datatable
back to a data.frame
via reticulate
in the R chunk.
When we did convert, we discovered there were some problems mapping NAβs which we will show below. We suspect it isnβt possible to pass a datatable
to data.table
, and this might be the first functionality we would vote to add. There is a sizable community of data.table
users who are used to the syntax, and as we are, might be looking to port into Python (rather than learn pandas
directly). As reticulate
develops, opening this door seems to make so much sense. Below, we again run export_names()
in order to also prepare the newly generated variables for non-standard evaluation within the frame, and then filtered for the 21 columns we wanted to keep.
In the result above, we see a lot of challenges if we had hoped to have appropriate data to build a model to predict mpg over time. Many variables, such as evMotor, tCharger, sCharger and guzzler, are only available in a small number of rows. When we set out on this series, we hoped we would be able to experiment with modeling gas mileage for every year just like mtcars, but that seems unlikely based on the available variables.
Conclusion
It took us a couple of months to get up and running with R data.table
, and even with daily usage, we are still learning its nuance a year later. We think the up-front investment in learning the syntax, which can be a little confusing at first, has been worth it. It is also less well documented than dplyr or pandas. We learned so much about data.table
from a few blog posts such as Advanced tips and tricks with data.table and A data.table and dplyr tour. The goal of this post is to help to similarly fill the gap for datatable
.
Python datatable
is promising, and we are grateful for it as familiar territory as we learn Python. We canβt tell how much of our difficulty has been because the package is not as mature as data.table
or our just inexperience with Python. The need to manually set variables for non-standard evaluation, to revert to pandas to accomplish certain tasks (ie: reshaping) or the challenges extracting and filtering data from nested columns. It was still not easy to navigate the documentation and there were areas where the documentation was not Also, it would be appreciated to seamlessly translate between a datatable
and data.table
.
Author: David Lucey, Founder of Redwall Analytics
David spent 25 years working with institutional global equity research with several top investment banking firms.