Introduction
The Sinking of the titanic is one of the most famous events of the last century. The “unsinkable” ship sinking on its maiden voyage was a truly humbling event and the introduction of radio / telegram at the time allowed the story to be relayed around the world.
Datasets detailing the list of passengers and if they survived or not are readily available and often are used as an interesting example in data analysis / machine learning.
On this post we will look at this data and see if we can draw any conclusions regarding what would have led you more likely to survive the titanic as a passenger.
Obtaining the Data
The data I will be using was obtained at the following page (The Complete Titanic Dataset | Kaggle). The fields contained in the dataset are defined below;
Field Name | Description |
Id | Unique Passenger ID |
pclass | Passenger Class (1st,2nd,3rd) |
survived | A bit detailing survival (1 = Yes, 0 = No) |
name | Passenger Name |
sex | Passenger sex (male/female) |
age | Age of passenger |
sibsp | The number of siblings or spouses aboard |
parch | The number of parents/children aboard |
ticket | Passengers ticket number |
fare | The fare paid by the passenger |
cabin | The cabin number of the passenger |
embarked | Port of Embarkation (C=Cherborg, Q = Queenstown,S = Southampton) |
Data Types and Null Values of the Dataset
Firstly lets import this file into pandas and run some high level statistics on the dataset. the info method on the imported data frame shows us the data types and population of data.
Shows the following.
Here we can see there are a total of 1039 rows. We don’t know the Age of all the passengers (263 don’t have an age) and the cabin field is not well populated at all. This is useful to know.
We can specifically count missing data by summing those null values in the data.
Showing us the fields that are not populated.
Data Analysis – Categorical variables
Categorical variables are non numeric values that might be of interest, in our case we can look at the Sex, Class and Embarked values, these are interesting properties of passengers but those which we cannot perform sums on (e.g you cant average a male / female value).
Survival by Sex
Firstly we can see there were a lot more men on the Titanic than women. This is easily calculated in pandas as follows.
which returns.
Given the survival value in the data is 0 or 1 we could calculate the mean survival value per gender to give us a figure signifying the survival rate by gender. This could be done as follows
which returns.
Here you can see that significantly more women survived the sinking than men. This is most likely because women were prioritised in the evacuation.
Survival by Age
The phrase “Women and Children First” is synonymous with evacuation, lets see if children were prioritised. Firstly lets make a new field flagging those records with an age of less than or equal to 16 as a child.
we also know from our earlier analysis that the age is not always known, so lets make a new dataframe removing records for which we don’t know the age. we can then subsequently group to look at mean surivival for under 16’s vs over 16’s.
which returns.
As you can see you were more likely to survive if yo were under 16, but not to the extent as if you were female.
Survival by Wealth (Class)
We can also look at mean survival rates by the class of ticket purchased, which indirectly looks at survival chance by the wealth of an individual. Lets plot this into a graph using seaborn and plot libaries as follows.
which results in the following graph.
So we can assume priority was given to 1st class passengers when evacuation took place.
Survival by Point of Embarkation
We can also see if perhaps variables we don’t expect to impact survival actually have an impact, for example the point of embarkation. Again we run a similar group by query.
which results in the following.
Those passengers who embarked from Cherborg (C) have a higher surival rate that those who embarked from Queenstown or Southampton. But Why?
one possible answer might be that the majority of 1st class passengers joined the ship in Cherborg? lets see if this was the case. If we filter the dataframe to each embarking location we can plot the following.
Embarked From Southampton
Note that a high amount of 3rd class passengers and low amount of 1st class passengers embarked here.
Embarked From Cherborg
Note that proportionally the number of 1st class passengers was highest embarking from Cherborg.
Embarked From Queenstown
The vast majority of passengers embarking from Queenstown were third class.
So to conclude, while it does seem that Cherborg passengers had a better survival rate on the Titanic, the location itself wasn’t causing this it is simply that the majority of passengers who embarked in Cherborg were 1st class wealthy passengers.
Data Analysis – Numeric variables
Lets compare all the numerical values obtained via a heatmap showing the correlation across all of them to look for possible correlations. The code below.
returns the following.
if we look at the top row – we can see only significant correlation with survived is fare (0.24). Meaning that as the fare of the ticket increased the chances of survival increased. This is pretty much the same outcome we noticed earlier looking at the class of passenger (as 1st class is more expensive than 2nd and 3rd class).
There are other correlations however they are very small, it seems that the number of siblings you had caused a sight negative correlation and the number of parents or children you had caused a slight positive correlation.
Age as you would expect has a negative correlation, being that the younger you are the higher the chance of survival, but its not a significant correlation at -0.06.
We can graph the age distributions / survival age distributions as follows.
and the resulting graph is shown below.
the Orange line shows the age of those passengers who survived. Whereas the blue line shows the age of passengers who died. you can see that younger passengers (16 and below) were more likely to survive, shown on the bump on the start of the orange line.
Conclusions
So from our analysis we can state that you were more likely to survive the titanic if;
- you were female
- Your ticket fare was expensive / the class of ticket was better
- you were a child
these might all seem fairly sensible, but we have underlying data showing us these factors and can demonstrate them. Perhaps in a future post I will show how these values could be used to build a predictive model to determine if a passenger would have survived or not based on their details.