In this analysis, we explore the Titanic Dataset from Kaggle, marking my first experience applying machine learning techniques outside of academia. It opened my eyes to the practical applications and extended my understanding beyond theoretical concepts. This dataset allowed me to independently delve into the analysis, coinciding with my Categorical Analysis course, emphasizing the analysis of categorical variables.
What is the Titanic Dataset?
The Titanic data set contains columns for different parameters that might potentially affect the survival of the passenger. Parameter such as survived or not, sex of passenger, passenger class, if sibling or spouse is onboard if children are onboard, etc.
The data set also comes with a train and test set. The training set is for training the model, and test set is for testing the accuracy of the model.
Titanic Dataset Available on Kaggle
Visualization of Dataset
Now we can start and look at some fancy graphs, visualizations are a great way to look at data from a different perspective.
Number of Passengers Survived vs Not Survived
Let’s start with something simple like looking at the number of passengers who survived vs not survived.
The graph above, it shows there are way more passengers that did not survive, than the passenger that did survive.
Taking a look from a proportion of the graph above
Survived | Proportion |
---|---|
Yes | 38.4% |
No | 61.6% |
Over 61.6% of passengers did not survive, and only 38.4% of passengers survived.
Different Passenger Classes
Now that we have seen how many passengers survived vs not survived, let’s take a look at other variables. There is a variable called Pclass, which is the passenger class. Since Pclass in a way is also representing the passenger’s socio-economical levels, then Pclass is an ordinal categorical variable. It can be represented with 3 levels as the following:
- Pclass of 1: Upper Class Passengers
- Pclass of 2: Middle Class Passengers
- Pclass of 3: Lower Class Passengers
Interestingly, there is more first-class passenger than second-class. I was expecting more of a 50/50 for the first and second classes. As expected, there are more passengers in the Lower Classes than the Upper and Middle classes.
A large number of lower-class passengers did not survive compared to lower-class passengers that did survive. The middle class has slightly more passengers that did not survive compared to middle-class passengers that did survive. Interestingly, slightly more upper-class passengers did survive compared to upper-class passengers that did not survive.
Distribution of Male and Female Passengers
From samples in the dataset, this histogram shows the number of male passengers and the number of female passengers on the Titanic. There are a lot more male passengers on the Titanic than female Passengers.
The number of male passengers onboard the Titanic that did not survive is much greater than the number of female passengers onboard the titanic that did not survive.
Distribution of Age Groups for the Passengers
There are quite a bit of 0-5 Years old passengers on board the Titanic, but there are fewer 5-10 and 10-15 years old passengers compared to 0-5 years old passengers. From the graph, it almost looks like the age has somewhat of a right skew to the data.
Number of Sibling onboard Titanic with or without their Spouse
Many of the lower-class passengers that were onboard the Titanic without their siblings and spouse did not survive compared to middle and upper-class passengers that were onboard the Titanic without their siblings and spouse onboard. There were a lot of passengers with 2+ Siblings with or without their spouse onboard who did not survive as well. From the graph, it seems more upper-class passengers survived with 0 siblings without their spouse, and 1 sibling or spouse compared to middle and lower-class passengers.
Number of Parents onboard the Titanic with or without Their Children
I believe there is starting to be a trend here from all the previous graphs. It seems like lower-class passengers are more likely to not survive, and possibly that middle-class passengers are also more likely to not survive. Upper-class passengers seem to have a higher chance of survival compared to the middle and lower class.
Embark Location and Survival
There are three different embark location
- C: Cherbourg
- Q: Queenstown
- S: Southampton
Large proportion of passengers embarked on Southampton compared to Queenstown and Cherbourg. For the passengers embarked on Southampton and did not survive, more lower class passengers did not survive compared to the upper and middle class passengers. There are slightly more middle class passengers that did not survive than middle class passengers that did survive from all 3 locations. There are more upper class passengers that did survive than upper class passengers that did not survive in all 3 locations. There were little to no middle class passengers embarked on Queenstown.
Data set Processing
From the data visualization section above, we have a general idea of our data, and there are some variables with missing values. This section is about transforming our data so it becomes ready to use for our data modeling step.
> summary(titanic_data)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891 Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character Mode :character Median :28.00 Median :0.000 Median :0.0000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309 Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :891.0 Max. :1.0000 Max. :3.000 Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Fare Cabin Embarked
Min. : 0.00 Length:891 Length:891
1st Qu.: 7.91 Class :character Class :character
Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
Looking at the summary data, there are a few things we need to do first before we can use the data for modeling.
First remove the Passenger ID and Cabin column because Passenger ID is just the index, and Cabin column have too much missing data for it to be useful.
# Drop passengerid and cabin column
titanic_data <- select(titanic_data, -c("PassengerId", "Cabin"))
Next, we can change the name of the passengers to keep only their title.
The names on the titanic data set have the following pattern Surname + Title + First Name. What the keep_title function does is that it first keeps everything after Title + First Name, then a second sub function only keeps the Title. The values get returned at the end.
# Function that only keeps the title of the person
keep_title <- function(data){
return(sub('\\..*', '', sub('.*, \\.*', '', data)))
}
# Change the name to title of the person
titanic_data$Name <- keep_title(titanic_data$Name)
names(titanic_data)[names(titanic_data) == 'Name'] <- 'Title'
The following is the result of this function.
> titanic_data$Title[1:5]
[1] "Mr" "Mrs" "Miss" "Mrs" "Mr"
Next, we have to make it so the model can understand what male and female passengers mean. To do this, we can just simply encode the male passengers as 0, and the female passengers as 1.
# Encode male as 0, female as 1
titanic_data %>% mutate(Sex = ifelse(Sex == 'female', 1, 0)) -> titanic_data
The result of this transformation.
> titanic_data$Sex[1:5]
[1] 0 1 1 1 0
We have to one hot encode our Embarked and Title variables because they are not ordinal.
Using the dummVars functions in the caret library, we can one hot encode the SibSp, Parch, Embarked, and Title variable.
library(caret)
# Set Sex, Embarked, Title as factor
titanic_data$Sex <- as.factor(titanic_data$Sex)
titanic_data$Embarked <- as.factor(titanic_data$Embarked)
titanic_data$Title <- as.factor(titanic_data$Title)
# Create a new dummy variable
dummy <- dummyVars('~ Embarked + Title', data=titanic_data)
# Add the dummy variable into the titanic data set
titanic_data <- cbind(titanic_data, data.frame(predict(dummy, newdata=titanic_data)))
# Drop the Embarked, Title column
titanic_data <- select(titanic_data, -c('Embarked', 'Title'))
# Drop the Embarked.S and Title.the.Countess Columns
titanic_data <- select(titanic_data, -c('Embarked.S', 'Title.the.Countess'))
The result is a data frame with expanded columns.
> colnames(titanic_data)
[1] "Survived" "Pclass" "Sex" "Age" "SibSp" "Parch" "Embarked.C" "Embarked.Q" "Title.Capt"
[10] "Title.Col" "Title.Don" "Title.Dr" "Title.Jonkheer" "Title.Lady" "Title.Major" "Title.Master" "Title.Miss" "Title.Mlle"
[19] "Title.Mme" "Title.Mr" "Title.Mrs" "Title.Ms" "Title.Rev" "Title.Sir"
Lastly, here is our age-old question, how do we deal with missing values in the Age columns? some of the ways to deal with missing values are either by calculating the mean or median of the Age column and filling the missing value with the respective methods.
Here we will use the mean to fill the missing values in the Age column.
# Fill the missing value using the mean
titanic_data$Age <- ifelse(is.na(titanic_data$Age),
mean(titanic_data$Age, na.rm = T),
titanic_data$Age)
As we can see here, NA are being filled by using the mean value of Age column.
> mean(titanic_data$Age)
[1] 29.69912
> titanic_data$Age[1:10]
[1] 22.00000 38.00000 26.00000 35.00000 35.00000 29.69912 54.00000 2.00000 27.00000 14.00000
Data set Modeling
Now we can finally get to the fun part, the model we are going to use is a simple logistic model. This simple model only includes Pclass, Sex, Age, SibSp, and Parch.
# Using the processed data as our training data
training_data <- titanic_data
# Simple logistic model
logistic_model_1 <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch,
data=training_data, family = binomial)
> # Summary for logistic model 1
> summary(logistic_model_1)
Call:
glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch,
family = binomial, data = training_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6536 -0.6147 -0.4224 0.6133 2.4324
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.473207 0.425219 5.816 6.02e-09 ***
Pclass -1.172848 0.119687 -9.799 < 2e-16 ***
Sex1 2.768189 0.198718 13.930 < 2e-16 ***
Age -0.040103 0.007778 -5.156 2.52e-07 ***
SibSp -0.334326 0.108557 -3.080 0.00207 **
Parch -0.081621 0.114688 -0.712 0.47666
---
Signif. codes: 0 โ***โ 0.001 โ**โ 0.01 โ*โ 0.05 โ.โ 0.1 โ โ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 790.33 on 885 degrees of freedom
AIC: 802.33
Number of Fisher Scoring iterations: 5
From the results above, we have a basic regression model as following:
logit = 2.473207 - 1.172848*Pclass + 2.768189*Female - 0.040103*Age
If we examine the odds of female passengers surviving but taking the exp(2.768189), we get a value of 15.92976. This means that odds of female passengers surviving is 15 times of male passengers.
If we take the exponential value of the Pclass coefficient, exp(-1.172848) = 0.3094843, then this means that odds of survival for increase in Pclass is decreased by 0.3094843 times.
If we take the exponential value of SibSp, exp(-0.334326) = 0.7158204, then this means that odds of survival decreases by 0.7158204 times for every 1 Sibling and/or spouse on the Titanic.
If we take the exponential value of Age, exp(-0.040103) = 0.9606905, then this means that odds of survival decreases by 0.9606905 times for every 1 year increase in Age.
If we take the exponential value of Parch, exp(-0.081621) = 0.9216212, then this means that odds of survival decreases by 0.9216212 times for parents with every additional children/grandchildren onboard the titanic. However, this effect to Survived is not significant, we cannot be certain of the result of this interpretation.
To predict the probability of survival using the logistic model, we can use the built in predict function in R.
# Predicting the survival using the logistic model
test_prediction <- predict(logistic_model_1, training_data, type = 'response')
> test_prediction[1:10]
1 2 3 4 5 6 7 8 9 10
0.09432554 0.90117135 0.66377772 0.91138118 0.07951598 0.09653252 0.29625319 0.09884362 0.61699889 0.88079005
The predict function in R outputs a probability, in order to predict a passenger survived or not, we can use the ROC to determine some threshold.
Using the pROC package in R, we can find the threshold corresponds to a specific true positive%.
library(pROC)
# ROC values and plotting ROC curve
roc_values <- cbind(roc(training_data$Survived, logistic_model_1$fitted.values)$sensitivities,
roc(training_data$Survived, logistic_model_1$fitted.values)$threshold)
roc(training_data$Survived, logistic_model_1$fitted.values,
legacy.axes = T,
plot = T,
percent = T,
print.auc = T,
xlab = "False Positive %",
ylab = "True Positive %",
col = rainbow(1))
This results in a ROC plot.
What we can do is pick on the upper left of the curve that can yield good true positive % and low false positive%. Let say we pick a point around the 75% True positive% (This is somewhat arbitrary), and this corresponds to 0.4409982. Therefore, any probability greater than 0.4409982, we will say they survived.
Let’s then calculate how well does our model do compared to the actual value in the training set.
> # Compare the correct number of prediction to the training set
> sum(test_prediction == training_data$Survived)/dim(training_data)[1]
[1] 0.7934905
From this number, it suggests that our model got 79% of the prediction right. Out of 100 passengers, it can correctly predict the outcome of 79 passengers. However, this is only prediction on the training set. Realistically, there should have been a validation set to test our model, and used for refining the hyperparameters. It was quite fun coming back to this data set, there might be some mistakes in the interpretation of the data, and I am open to feedback.