MR DataSci - Titanic

Let’s have a look at the famous Kaggle dataset and build two different models to determine if one would have survived the Titanic sinking. We will see if one model is significantly more accurate than the other, and then create the following tool that will allow you to make your own predictions:

Let’s look at the dataset:

What is of interest to us is the gender, class and age of each passenger and whether they made it or not. The rest will not be useful to train our model so we can drop them. We first need to encode the gender and the class. Let’s first use the get_dummies() method on the Class column. We have no use for the third class column (if a passenger is in neither first or second class, it means that they are in third), so let’s drop it.:

Then mapping the gender

We should check if there are any empty lines.

Looks like there are just a few in the age column. Let’s fill them with the mean age of the passengers.

Now let’s split our data and standardize our inputs.

Looks like we’re good to go. Let’s try two different solutions:

Logistic Regression

K-Nearest Neighbors

It seems like K-Nearest Neighbors is performing a bit better than Logistic Regression, so we will use it to feed it new data and see how me and my better half are fairing.

I’m 33, male, and my wife would not let me book us anything other than first class (assumedly because it would enhance our survival chances, which is hard to argue against). Let’s see how we did:

Well… I don’t like my odds. She, on the other hand, will be fine (after a period of mourning).

I assume my probable demise is a consequence of the “women & children first” rule aboard. Let’s check this assumption: what is the survival rate of women vs. men, and the survival rate of passengers under 14?

Survival rate of women was 74.2%, while men’s was only 18.9%. As for passengers under 14, it was 58.44%, although it should be noted that among those who did not make it, almost all of them were in third class.

Where you were staying definitely had some incidence on the outcome, although not as much, clearly, as your gender.

I’ll just book a flight next time…

Dataset from Kaggle.