Python and R - Part 2: Visualizing Data with Plotnine
Written by David Lucey
Interested in more Python and R tutorials?
- Python and R - Part 1: Exploring Data with Datatable
- How to Set Up Python’s Scikit-Learn in R in 5 minutes
- R and Python: How to Integrate the Best of Both into Your Data Science Workflow
In this post, we start out where we left off in Python and R - Part 1: Exploring Data with Datatable. In the chunk below, we load our cleaned up big MT Cars data set in order to be able to refer directly to the variable without a short code or the f function from our
datatable. On the other hand, we will also load
plotnine with the short code
p9. We found this to be cumbersome relative to the R behavior, but given that we use so many different functions in ggplot when exploring a data set, it is hard to know which functions to load into the name space in advance. Our experience and discussions we have read by others with
seaborn, is that they are not very intuitive, and probably not better than
ggplot (given mixed reviews that we have read). If we can port over with a familiar library and avoid a learning curve, it would be a win. As we mentioned in our previous post,
plotnine feels very similar with
ggplot with a few exceptions. We will take the library through the paces below.
Consolidate make Into Parent manufacturer
In the previous post, we collapsed VClass from 35 overlapping categories down to 7. Here, we similarly consolidate many brands in make within their parent producers. Automotive brands often transfer, and there have been some large mergers over the years, such as Fiat and Chrysler in 2014 and upcoming combination with Peugeot, making this somewhat of a crude exercise. We used the standard that the brand was owned by the parent currently, but this may not have been the case over most of the period which will be shown in the charts below. This can also effect the parent’s efficiency compared to peers. For example, Volkswagen bought a portfolio of luxury European gas guzzlers over the recent period, so its position is being pulled down from what would be one of the most efficient brands.
Imports Started Ahead and Improved Efficency More
Here, we selected the largest volume brands in two steps, first creating an numpy vector of makes which sold more than 1500 separate models over the full period, and then creating an expression to filter for the most popular. Then, we iterated over our vector and classified vehicles as ‘Cars’ or ‘Trucks’ based on regex matches to build a new vehicle_type variable. We would love to know streamlined way to accomplish these operations, because they would surely be easier for us using
data.table. Excluding EV’s, we found the combined mean mpg by year and make for both cars and trucks. It could be that we are missing something, but it also feels more verbose than it would have been in
data.table, where we probably could have nested the filtering expressions within the frames, but again this could be our weakness in Python.
plotnine code and graph below looks very similar to one generated from
ggplot, but we struggled with sizing the plot on the page and avoiding cutting off axis and legend labels. We tried to put the legend on the right, but the labels were partially cut off unless we squeezed the charts too much. When we put it at the bottom with horizontal labels, the x-axis for the ‘Cars’ facet was still partially blocked by the legend title. We couldn’t find much written on how to make the charts bigger or to change the aspect ratio or figure size parameters, so the size looks a bit smaller than we would like. We remember these struggles while learning
ggplot, but it felt like we could figure it out more quickly.
It is also important to mention that confidence intervals are not implemented yet for lowess smoothing with
plotnine. This probably isn’t such a big deal for our purposes in this graph, where there are a large number of models in each year. However, it detracts from Figure below, where it the uncertainty about the true mean efficiency of cars with batteries in the early years is high because there were so few models.
One thing to note is that it is difficult to tell which line maps to which make just by the colors. The original plan was to pipe this into
plotly as we would do in R, but this functionality is not available. While the plotnine functionality is pretty close to ggplot, the lack of support of
plotly is a pretty serious shortcoming.
From what we can see in the chart, we can see that “Other Asian” started out well in the beginning of the period, and made remarkable progress leaving Toyota behind as the leader in cars and trucks. Our family has driven Highlanders over the last 20 years, and seen the size of that model go from moderate to large, so it is not surprising to see Toyota trucks going from 2nd most to 2nd least efficient. BMW made the most progress of all producers in cars, and also made gains since introducing trucks in 2000. As a general comment, relative efficiency seems more dispersed and stable for cars than for trucks.
When we look number of models by Manufacturer , we can see that the number of models declined steadily from 1984 though the late 1990s, but has been rising since. Although the number of truck models appear to be competitive with cars, note that the graphs have different scales so there are about 2/3 as many in most years. In addition to becoming much more fuel efficient, BMW has increased the number of models to an astonishing degree over the period, even while most other European imports have started to tail off (except Mercedes). We would be interested to know the story behind such a big move by a still niche US player. GM had a very large number of car and truck models at the beginning of the period, but now has a much more streamlined range. It is important to remember that these numbers are not vehicles sold or market share, just models tested for fuel efficiency in a given year.
Electric Vehicles Unsurprisingly Get Drastically Better Mileage
After the looking at the efficiency by manufacturer in Figure above, we had a double-take when we saw the chart Figure below. While progress for gas-powered vehicles looked respectable above, in the context of cars with batteries, gas-only vehicles are about half as efficient on average. Though the mean improved, the mileage of the most efficient gas powered vehicle in any given year steadily lost ground over the period.
Meanwhile, vehicles with batteries are not really comparable because plug-in vehicles don’t use any gas. The EPA imputes energy equivalence for those vehicles. The EPA website explains in Electric Vehicles: Learn More About the Label that a calculation of equivalent electricity to travel 100 miles for plug-in vehicles. This seems like a crude comparison as electricity prices vary around the country. Still, the most efficient battery-powered car (recently a Tesla) improved to an incredible degree.
Around 2000, there were only a handful of battery-powered cars so the error bars would be wide if included, and we are counting all cars with any battery as one category when there are hybrids and plug-ins. In any case, caution should be used in interpreting the trend, but there was a period where the average actually declined, and really hasn’t improved over 20-years with the most efficient.
Efficiency of Most Vehicle Types Started Improving in 2005
We were surprised to see the fuel efficiency of mid-sized overtake even small cars as the most efficient around 2012. Small pickups and SUV’s also made a lot of progress as did standard pick-up trucks. Sport Utility Vehicles were left behind by the improvement most categories saw since 2005, while vans steadily lost efficiency over the whole period. As mentioned earlier, we noticed that the same model SUV that we owned got about 20% larger over the period. It seems like most families in our area have at least oneSUV, but they didn’t really exist before 2000.
Efficiency by Fuel Type
We can see that fuel efficiency of electric vehicles almost doubled over the period, while we didn’t see the average efficiency of vehicles with batteries make the same improvement. We generated our is_ev battery if the car had a battery, but didn’t specify if it was plug-in or hybrid, so this discrepancy may have something to do with this. We can also see efficiency of diesel vehicles comes down sharply during the 2000s. We know that Dieselgate broke in 2015 for vehicles sold from 2009, so it is interesting to see the decline in listed efficiency started prior to that period. Natural gas vehicles seem to have been eliminated five years ago, which is surprising with the natural gas boom.
We don’t know if fuelType1 refers to the recommended or required fuel, but didn’t realize that there had been such a sharp increase in premium over the period. Our understanding was that premium gasoline had more to do with the engine performance than gas efficiency. it is notable that despite all the talk about alternative fuels, they can still be used in only a small minority of new models.
Comments About Plotnine and Python Chunks in RStudio
In addition to the charts rendering smaller than we would have liked, we would have liked to have figure captions (as we generally do in for our R chunks). In addition, our cross-referencing links are currently not working for the Python chunks as they would with R. There is a bug mentioned on the knitr news page which may be fixed when the 1.29 update becomes available.
There is a lot of complexity in this system and more going on than we are likely to comprehend in a short exploration. We know there is a regulatory response to the CAFE standards which tightened in 2005, and that at least one significant producer may not have had accurate efficiency numbers during the period. The oil price fluctuated widely during the period, but not enough to cause real change in behavior in the same way it did during the 1970s. We also don’t know how many vehicles of each brand were sold, so don’t know how producers might jockey to sell more profitable models within the framework of overall fleet efficiency constraints. There can be a fine line between a light truck and a car, and the taxation differentials importation of cars vs light trucks are significant. Also, the weight cutoffs for trucks changed in 2008, so most truck categories are not a consistent weight over the whole period. That is all for now, but a future post might involve scraping CAFE standards, where there is also long term data available, to see if some of the blanks about volumes and weights could be filled in to support more than just exploratory analysis.
Author: David Lucey, Founder of Redwall Analytics
David spent 25 years working with institutional global equity research with several top investment banking firms.