Web Scraping and Analyzing Home Rentals (Python + R)
Written by David Lucey
👉 Each month, we release tons of great content on R for Business. Register for our blog to get fresh R Tips straight to your inbox.
This tutorial showcases how to use Python with R
When I started this mini-project, I hoped to use
datatable as our main data frame in conjunction with the Python libraries like
BeautifulSoup and data structures not available in R, like dictionaries. I soon learned that
datatable doesn’t support dates yet. In Python and R - Part 1: Exploring Data with Datatable, we noted that
datatable is still in alpha stage, and worked around the lack of reshaping capability and the inability to pipe our data directly into plots, but this really a deal breaker for this project. As a result, though we were able to keep using
plotnine and pandas.
We found vacationrentalslbi.com has an extensive list of rentals, and doesn’t restrict to the particular agency maintaining the listing, so this is the perfect place to gather our data. Naturally, we would like to be polite, so check that we are allowed to scrape. We can see from the R
robotstxt library, which indicates that
TRUE, that we are good to go with our intended link.
Scraping the website for listings is a two-step process:
- Go through and extract the links to all of the listings
- Navigate back to all of the links and extract the listing details
We set our requests to the website to vary at an average 5-second delay, and build a list of the ‘href’ links from the returned ‘a’ tags. We are not running our scraping code now for the blog post, but the results is shown in the link below loaded from disc.
Get Listings from Home
Now, we build a second scraper to take the list of listings, extract the key elements of each and return a dictionary which we store in a list. We won’t go into detail here, but the way to find the desired classes is to navigate to the vacationrentalslbi.com on Google Chrome, select Ctrl-Alt-I, choose the ‘Select Element’ option in the ‘Elements’ pane, and then navigate to the desired spot on the page.
We selected the title, content, description, location, and calendar sub-element tables for ‘booked’ and ‘available’ from the ‘month_box’. It took some work to get the calendar. We then returned a dictionary with all of these elements from our
When an element of a listing is not present, we were having breaks, so we put in exception handling for those cases. Although we think we have handled most of the likely errors in get_dict, the full scraping process takes a couple of hours, so we thought best to save to disc after each request. It took us a while how to figure this out, because it turns out not to be so straight-forward to save and append a json to disc. We were able to write to disc as txt as we do in
Scrape All Listings
get_dict function, we scrape each listing, create a dictionary entry and append it to disc with
Again, we wanted to avoid re-running the code, we are showing our saved data from disc. We load the saved data from our text file as a list of 1231 Python dictionaries. The dictionary for a sample listing of ‘464’ is shown in the chunk below. The attributes of the listing are deeply nested and not easy to filter and sort. However, we learned that it is easy to extract the desired elements using the dictionary keys, which we do in the
get_calendar function below.
get_calendar, we extract the dictionary key for the listing, and then the value desired value elements including ‘rate’, ‘start_date’, ‘end_date’, ‘location’, ‘location_type’ and ‘beds’. We have to clean and transform the ‘rate’ variable to float and the date fields to datetime, and in our case, we are looking for the first two weeks of August, so we filter for just those two weeks. We also add the url back in so it is easy to take a look at an interesting listing in more detail.
We also manufactured some variables for our graphs below. For example, we generated a ‘month-year’ variable so we could aggregate weekly average prices and number of homes available. There were too many different sleep capacities, so we aggregated into just four levels (sleeps 4 or under, 8 or under, 12 or under and more than 12). Beach Haven has 7-8 separate small sections, so we changed to just one.
We loop through our dictionary and use our
get_calendar function to extract all of our listings.
In the table below, we can see the mean rental rate and number of units available by month. July has the fewest available among the months of the peak period, and also the highest rates. We can also that the average size of houses rented is higher outside the peak period.
We had hoped to segment and consider the prices for Oceanside, Ocean block, Bayside block and Bayfront, but landlords interpret the meaning of “Oceanside” liberally. We tend to think of that term as looking at the water from your deck, but ~60% of rentals are designated in this category, when true “Oceanside” can’t be more than 10%. In most cases, landlords probably mean Ocean block, but there is not a lot we can do to pick this apart. We also don’t have the exact addresses, so we are probably out of luck to find anything useful in this regard.
Biggest Rental Towns by Volume
By far the most rental action is in the Beach Haven sections in July and August (shown in orange), but those sections also have more year-round availability than the the other towns. If the plan is to go with less than 8 people, there is not a lot of options. In fact, it was surprising to see that there was more available in the largest sleeps >12 than the family of four category. As mentioned in our previous post about
plotnine, the lack of support for plotly hovering is a bit of a detraction here, because it can be hard to tell which color denotes which city. This might mean we have to learn
seaborn in the future, just as we have been forced to learn
pandas for this post.
Availability vs Booked by City
Beach Haven has more B&B’s and some of the only hotels on the Island, so smaller size properties on average and somewhat less consistent visitors. More rentals outside of Beach Haven are probably renewed annually, so it might be more impacted by delayed plans due to COVID-19 than other towns. Coupled with it being about as big as all the other towns put together, this may help explain why also shows a lot more relatively more red at this stage.
Prices for Booked Properties Peaking in July
2020 might not be a typical year with the uncertainty around COVID-19, but the fall off in prices starting in August, when there appears to be more supply, is shown here. Landlords may have pulled supply for July when things looked uncertain and then put it back on at the last minute. It also looks like the available properties might be in that category, because they are priced higher than comparable properties. At least for the bigger properties, the posted prices of available properties are clearly higher than for the booked ones. Let’s face it, if you haven’t booked your property sleeping more than 8 by now, it might be tough for most groups of that size to organize at this late stage.
Homegenous Prices Across Cities
For anyone who has been to LBI, it is pretty much nice everywhere. Accept for maybe Loveladies, there aren’t really premium towns in the sense of the NYC suburbs. Loveladies shown in light blue can be seen towards the higher end, but still among the pack. The main distinction is if the house is beachfront or not, but unfortunately, we don’t have a good source of that data at this stage. The rents for the largest homes does show quite a bit more divergence among towns than the other three categories.
Most families are constrained to July and early August, but for those with the freedom to go at other times, there is a lot of opportunity to have a great vacation at an affordable price! We also know that vacationrentalslbi.com also operates sites for Wildwood, North Wildwood, Wildwood Crest and Diamond Beach, so it our scraper would probably work the same for all of those. Now that we have the code, we can parse listings whenever considering a vacation at the Jersey Shore.
Author: David Lucey, Founder of Redwall Analytics
David spent 25 years working with institutional global equity research with several top investment banking firms.