Machine Learning – Predicting Fatalities on the Titanic

Introduction

This is a sequel to a post I wrote on the Data Analysis of deaths on the titanic and any correlations that can be drawn from them. I recommend you read that post before you read this one. The aim of this post is to use the same dataset to build a model that will attempt to predict based on passenger details if the individual would have survived the sinking of the Titanic or not.

It aims to introduce the topic of machine learning rather than to make the most accurate model possible. I am no means an expert in this subject area!

Making a Training and Test Data Set

The first task we need to make is to split our original dataset into two. One will be used to train the model (where we know the survival outcomes of each individual). The other will be a smaller subset where we can test our model (where the model will not know the survival outcome).

The dataset we are using (The Complete Titanic Dataset | Kaggle) contains 1,309 records. If we use 90% of the data for training and 10% for testing we need a training data set of 1178 records with the remaining 131 records for testing.

The selection of this training / test set in an ideal world should be evenly distributed across strata – so for example roughly proportional in terms of gender, class, age, ticket fare of passengers – however as this is a small blog post I am just going to randomly choose using a pandas sampling method.

We also want to remove the outcome in our test data set. We will make this set have no known survival field. This is done with the following line of code.

Data Pre-processing

Firstly we want drop or populate any missing values. Firstly lets look at the null values we have in our two sets. I will show this approach on the training set – but the same updates will need to occur on the test set as well. Lets look at what fields are null in our sets.

We can see Cabin/Age are poorly populate and embarked and fare are missing a couple of records.

For the purposes of tidying this up I am going to do the following.

  1. Drop the Cabin Column
  2. Set the age to be the median age
  3. Set the embarked value to be the the most common embarking location
  4. Set the missing fare to be the median fair value

This is done with the following code.

At this point both the training and test data sets have no null values.

Adding new Features

We want to make new fields in the dataset that from the existing data to better allow our model to associate patterns. There are three new features I will create.

  1. Title – we will obtain the title from the passengers name
  2. IsAlone – a bit value that signifies if the passenger is a solo traveller
  3. Age Band – I will group the ages into bands rather than having the full range of ages to evaluate

Title

in our data the name field follows a fixed format. This format is <Surname>, <Title>. <Given Names>

For example “Beckwith, Mr. Richard Leonard”

Because of this fixed pattern we should be able to parse the titles from the name field. the approach for this will be to split the name value based on the comma and take the 1st element position from here you can then take the initial element of string splitting on the full stop.

using our example “Beckwith, Mr. Richard Leonard”

Splitting on the comma gives the following array we would have

the subsequent split on element 1 and the full stop would lead to this

From here a whitespace trimmed element 0 would contain our title. We can code this as follows.

and we can evaluate the results of this by looking at the distinct titles in our data.

which shows us all the distinct titles in the training dataset.

There are a number of distinct titles here – it would be better in our model to simplify them into a less distinct category – perhaps Mr, Mrs, Miss, Master and Other. We can do this with a replace statement on the newly create field.

so now we have five titles that we have created from the name field. We could even look at their survival rates based on their titles with the following code.

which results in the following.

This shows some things we noticed in the initial data analysis – and highlights one new factor. Firstly we showed in the analysis that women and then children had the highest chance of survival. You can see this totally aligns with that. What is interesting is how the “Other” titles out perform their traditional counterparts – this is most likely to do with the fact these titles suggest positions of wealth and power, perhaps meaning they were prioritised in the evacuation, regardless of gender.

IsAlone

There are two fields in the dataset we can use to establish the size of the family with you. SibSp (number of siblings/spouses aboard) and Parch (number of parents/children aboard). If siblings + spouses = 0 you were a solo traveller.

this field can be calculated as follows.

and it looks like travelling alone was detrimental to survival as if we group by this value we see the following.

we can now remove the parch and SibSp fields.

Age Banding

we want to have ordinal variables not continuous ones in our analysis. We don’t want to consider everyone at age 12 to be different to those at age 13 but we do want to consider a 12 year old as different to a 40 year old. Because of this we will need to band up the ages into different categories. The code below achieves this.

Firstly I created a age band field using the cut method in built in pandas. using the output values from that I set up the ages groupings from 0-4. so our age column now consists of values 0-4 representing different bands.

Considering Class

Now we have age shown as bands and class which is banded (1-3) it might be a good approach to combine these two fields and have one field which is age (0-4) * Class as a numeric value to be considered in our model.

this is easily done as follow.

if we print the head of age, class and age*class you would now see the following

Converting Categories to Numerics

Machine learning requires our categorical data to be converted into numeric values. If we look at our training dataframe currently we see this.

sex,embarked and title need to be converted to a numeric value. Thankfully Pandas has an inbuilt method to achieve this get_dummies. this method with give us a new column for each distinct column value. So for our embarking locations we would convert this.

to this.

The code below performs this transformation for titles and embarkation locations.

if we print the head of the dataframe we can now see this.

We also need to convert fare into a banded (ordinal) category as we want to group similar fares together exactly how we did with ages. This is shown below, the bands coming from the cut method.

finally we need to convert the sex column to a numeric value (1=0 for male, 1 for female).

At this point our dataset looks ready for modelling. Every field is numeric and is split appropriately based on category. Below is the head of the training dataset

Modelling

I am going to use the scikit libary for machine learning, this comes with a ton of classifiers to train the model, each of these are different approaches, however for the purposes of keeping this blog post fairly concise I am just going to pick one – the Random Forest Classifier. I feel this should be a good fit as its based off decision trees and our data should fit that approach relatively well. For further information on this algorithm visit this page (Random forest – Wikipedia)

Firstly I need to split the data into independent variables (X) or response variables (Y). in the case of our work the response variable is the survived column and the independent variables are everything else. we want our model to classify survival based on all the variables, and once we have that we will use the X_test data to gain predictions from the model.

Lets set up these new data frames.

We can now apply our classification.

Here we are fitting the model on our training datasets and instructing the model to predict using our X_test dataset. The output of Y_pred is out predictions on the test dataset.

At this point we can join back to the original dataset and compare the results to the original data – knowing whether a passenger survived or not.

RESULTS

My results on this run were 110 correct predictions out of 131. 84% accuracy!

so the model did work to some extent as we exceeded the 50% we would get from pure guesswork. Obviously more tuning of the model and calibration (K fold validation, hyper parameter tuning) /different classifiers (support vector machines, catboost etc..) could increase this. I believe that for this situation with more work 90%+ is very possible, however I don’t really have the expertise in this field, nor desire to invest more time on optimising this.

An example of a failure – “Miss Loraine Helen Allison”.

Miss Allison Helen Loraine is a tragic story, to be honest while performing this analysis its easy to forget you are dealing with actual peoples life’s as you become focused on the numbers. Looking at this case in isolation you realise how sad this event actually was.

This was the first record in my failures and wasn’t chosen for dramatic effect. Below are the properties of Miss Helen Loraine Allison.

PropertyValue
Class1st Class
SexFemale
Age2
Fare151.55
Siblings1
Parents2

Based on the conditions within the model (1st class, young, female, with family) on every metric you would predict a survival, and the model did. However sadly Helen did not survive the sinking of the Titanic.

I think this just highlights how predicting chaotic events can never truly be 100% accurate and although we’ve demonstrated factors can predict a significant increase in an event happening it doesn’t mean that event is guaranteed in real life. According to reports she was separated from her parents with her brother at the time of the collision with the iceberg. Loraine Allison was the only child in first and second class to die.

Conclusion

hopefully this post shows how you can create basic predictive machine learning models from known data to predict values which are unknown. In this example we trained the dataset on 90% of the data available to demonstrate it forecasting the remaining 10%. Such an approach can and is used in a variety of industries, sales forecasts, currency and share values, marketing campaigns, medical trials and many many more. It combined with AI is an topic that is rapidly growing and what is covered in this post barely scratches the surface. but it does show the preparation steps required and the use of one algorithm in the sciKit library to achieve pretty impressive forecasts.

To conclude, if If youre in a sinking ship, it looks like your best hope is you are female, rich and young!