Data Analysis – Airbnb in London

Let’s for a minute imagine a wonderful parallel universe – where a decision I need to make is where should I buy a property in London to rent out on Airbnb (for those that don’t know me this is about as far fetched as it gets).

How could we do this? We would need some data about Airbnb prices and occupancy and have an idea about how much the property would cost. I have found two data sources online which can help here (I have no idea about their accuracy or completeness – this is purely a theoretical example please don’t use this as investment advice!).

We would also need some way of crunching the numbers, for the number crunching I decided to use Pandas – a Python data analysis library.

Step One – Import the Airbnb Usage Figures

Pandas offers exceptionally easy imports of csv files with just one line of code importing the listings.csv file into a DataFrame.

Step Two – Clean up the Data

Most of the time data is dirty, there are mistakes, missing data, spelling mistakes, duplicates and a whole host of other issues in every dataset of any size. It is simply the nature of the beast. While I could have spent a long time looking through the dataset obtained – I really only cared about a very small subset of fields.  These are listed below;

Field NameDescription
IDThe unique ID of the property
NeighbourhoodThe borough of London the Airbnb is in
PriceThe cost of the Airbnb
Availability_365The Availability of the Airbnb in the next 365 days

I filtered the DataFrame to just these fields and then removed all records with a price <=19 GBP (I refuse to believe you can rent a room in London for less!  In total 451 records were identified;

I considered these to be data quality issues and removed them.

Step Three – Basic Queries

Some simple queries were written showing the cheapest room in each borough and the average price in each borough. I also calculated the Average Revenues for the next year per borough (that is the average of (price * days not available) for the next year).  

Step Four – Additional Data Source

At this point I know which borough has the highest average forecasted revenue (Kensington and Chelsea) but this doesn’t take into the fact the cost of purchasing a property there. I want the most bang for my buck!

I imported another file averageHousePrice.csv which contains the average house price per borough. I then want to add this into my analysis.

Again as expected this dataset needed some cleaning up and I had to remove comma separators from the numeric values and cast the average price as a number.

Step Five – Average Revenue / Average Price ratio

Firstly, I joined the listings to the average house prices by borough.

It seems on this analysis anyway – that Newham is the most profitable place to have an Airbnb based on its average revenues for a year vs the average cost of housing.

Step Six– Graphs

Just to demonstrate that graphs are easily produced – I coded a couple of graphs using the matplotlib showing the figures stated above.

Conclusion

As mentioned at the start of this post – Its not designed to be taken seriously as investment advice! There are obviously some massive, short fallings in this analysis that would need to be addressed in order to make it accurate chiefly these would involve significantly better data areas of improvement could be.

  1. Using a smaller catchment area that borough – within any London Borough there will be some expensive areas and some ‘cheaper’ ones
  2. Evaluating the Airbnb’s based on their size or building type (houses are more expensive than flats) – is it the whole building or a room within a building etc.
  3. Establishing a more accurate occupancy figure. Perhaps someone puts their house on Airbnb in the summer when they are away but not for the rest of the year – leaving it with a very low occupancy rating

Despite these criticisms of my approach, one must not lose sight of the purpose of this post. This was simply an exercise in using Pandas and showing how you can quickly perform analysis with very little resources – these datasets were free and publicly available as is the Pandas library/Python, its also worth stating that no database was required as well (a major advantage of using Pandas). 

When you consider what was achieved here with such little effort it is remarkably impressive. Data was obtained, cleaned, joined to external sources with key figures calculated and plotted onto a graph all with less than 80 lines of code, while this example is just touching the surface of Pandas, I hope it does show its power somewhat. I mean who would have thought in a parallel universe I’d be buying a house in Newham.

The data used and the python code is available at AirbnbLondonDataAnalysis (github.com)