Exploring the Titanic Dataset Using R

An image depicting a rusty ship in close proximity to an iceberg, highlighting the potential dangers and risks associated with maritime navigation.

In this analysis, we explore the Titanic Dataset from Kaggle, marking my first experience applying machine learning techniques outside of academia. It opened my eyes to the practical applications and extended my understanding beyond theoretical concepts. This dataset allowed me to independently delve into the analysis, coinciding with my Categorical Analysis course, emphasizing the analysis of categorical variables.

What is the Titanic Dataset?

The Titanic data set contains columns for different parameters that might potentially affect the survival of the passenger. Parameter such as survived or not, sex of passenger, passenger class, if sibling or spouse is onboard if children are onboard, etc.

The data set also comes with a train and test set. The training set is for training the model, and test set is for testing the accuracy of the model.

Titanic Dataset Available on Kaggle

Visualization of Dataset

Now we can start and look at some fancy graphs, visualizations are a great way to look at data from a different perspective.

Number of Passengers Survived vs Not Survived

Let’s start with something simple like looking at the number of passengers who survived vs not survived.

Number of people not survived aboard the titanic is higher than number of people survived.
Comparison between number of passenger survived and number of passenger not survived

The graph above, it shows there are way more passengers that did not survive, than the passenger that did survive.

Taking a look from a proportion of the graph above

SurvivedProportion
Yes38.4%
No61.6%
Proportion of Survival computed using R

Over 61.6% of passengers did not survive, and only 38.4% of passengers survived.

Different Passenger Classes

Now that we have seen how many passengers survived vs not survived, let’s take a look at other variables. There is a variable called Pclass, which is the passenger class. Since Pclass in a way is also representing the passenger’s socio-economical levels, then Pclass is an ordinal categorical variable. It can be represented with 3 levels as the following:

  • Pclass of 1: Upper Class Passengers
  • Pclass of 2: Middle Class Passengers
  • Pclass of 3: Lower Class Passengers
More Lower class passengers onboard than middle and upper class passengers.
Graph of number of passenger in each Pclass

Interestingly, there is more first-class passenger than second-class. I was expecting more of a 50/50 for the first and second classes. As expected, there are more passengers in the Lower Classes than the Upper and Middle classes.

More lower passenger class did not survive than middle and upper class. More upper class passenger survived than lower and middle class.
Graph of different Pclass that survived or not survived

A large number of lower-class passengers did not survive compared to lower-class passengers that did survive. The middle class has slightly more passengers that did not survive compared to middle-class passengers that did survive. Interestingly, slightly more upper-class passengers did survive compared to upper-class passengers that did not survive.

Distribution of Male and Female Passengers

From samples in the dataset, this histogram shows the number of male passengers and the number of female passengers on the Titanic. There are a lot more male passengers on the Titanic than female Passengers.

There are more male passengers onboard Titanic than female passengers.
Histogram of male and female passengers on the Titanic

The number of male passengers onboard the Titanic that did not survive is much greater than the number of female passengers onboard the titanic that did not survive.

More Male passengers did not survive than female passengers.
Graph of male and female survived vs not survived

Distribution of Age Groups for the Passengers

There are quite a bit of 0-5 Years old passengers on board the Titanic, but there are fewer 5-10 and 10-15 years old passengers compared to 0-5 years old passengers. From the graph, it almost looks like the age has somewhat of a right skew to the data.

Most passengers onboard Titanic are around the age of 20-35 years old.
Graph of number of passengers in each age groups

Number of Sibling onboard Titanic with or without their Spouse

Many of the lower-class passengers that were onboard the Titanic without their siblings and spouse did not survive compared to middle and upper-class passengers that were onboard the Titanic without their siblings and spouse onboard. There were a lot of passengers with 2+ Siblings with or without their spouse onboard who did not survive as well. From the graph, it seems more upper-class passengers survived with 0 siblings without their spouse, and 1 sibling or spouse compared to middle and lower-class passengers.

Many passengers onboard Titanic did not bring their siblings or parents.
Graph of passenger that was onboard the Titanic with or without their sibling and spouse

Number of Parents onboard the Titanic with or without Their Children

I believe there is starting to be a trend here from all the previous graphs. It seems like lower-class passengers are more likely to not survive, and possibly that middle-class passengers are also more likely to not survive. Upper-class passengers seem to have a higher chance of survival compared to the middle and lower class.

Graph of passengers onboard Titanic with or without their children by pclass

Embark Location and Survival

There are three different embark location

  • C: Cherbourg
  • Q: Queenstown
  • S: Southampton
Most passengers onboard the Titanic embarked on Southhampton.
Graph of embarked location and survived or not by pclass

Large proportion of passengers embarked on Southampton compared to Queenstown and Cherbourg. For the passengers embarked on Southampton and did not survive, more lower class passengers did not survive compared to the upper and middle class passengers. There are slightly more middle class passengers that did not survive than middle class passengers that did survive from all 3 locations. There are more upper class passengers that did survive than upper class passengers that did not survive in all 3 locations. There were little to no middle class passengers embarked on Queenstown.

Data set Processing

From the data visualization section above, we have a general idea of our data, and there are some variables with missing values. This section is about transforming our data so it becomes ready to use for our data modeling step.

> summary(titanic_data)
  PassengerId       Survived          Pclass          Name               Sex                 Age            SibSp           Parch           Ticket         
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891         Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character   Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character   Mode  :character   Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                                         Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                                         3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                                         Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
                                                                                        NA's   :177                                                        
      Fare           Cabin             Embarked        
 Min.   :  0.00   Length:891         Length:891        
 1st Qu.:  7.91   Class :character   Class :character  
 Median : 14.45   Mode  :character   Mode  :character  
 Mean   : 32.20                                        
 3rd Qu.: 31.00                                        
 Max.   :512.33                                                    

Looking at the summary data, there are a few things we need to do first before we can use the data for modeling.

First remove the Passenger ID and Cabin column because Passenger ID is just the index, and Cabin column have too much missing data for it to be useful.

# Drop passengerid and cabin column
titanic_data <- select(titanic_data, -c("PassengerId", "Cabin"))

Next, we can change the name of the passengers to keep only their title.

The names on the titanic data set have the following pattern Surname + Title + First Name. What the keep_title function does is that it first keeps everything after Title + First Name, then a second sub function only keeps the Title. The values get returned at the end.

# Function that only keeps the title of the person
keep_title <- function(data){
  
  return(sub('\\..*', '', sub('.*, \\.*', '', data)))
  
}

# Change the name to title of the person
titanic_data$Name <- keep_title(titanic_data$Name)
names(titanic_data)[names(titanic_data) == 'Name'] <- 'Title'

The following is the result of this function.

> titanic_data$Title[1:5]
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"  

Next, we have to make it so the model can understand what male and female passengers mean. To do this, we can just simply encode the male passengers as 0, and the female passengers as 1.

# Encode male as 0, female as 1
titanic_data %>% mutate(Sex = ifelse(Sex == 'female', 1, 0)) -> titanic_data

The result of this transformation.

> titanic_data$Sex[1:5]
[1] 0 1 1 1 0

We have to one hot encode our Embarked and Title variables because they are not ordinal.

Using the dummVars functions in the caret library, we can one hot encode the SibSp, Parch, Embarked, and Title variable.

library(caret)

# Set Sex, Embarked, Title as factor
titanic_data$Sex <- as.factor(titanic_data$Sex)
titanic_data$Embarked <- as.factor(titanic_data$Embarked)
titanic_data$Title <- as.factor(titanic_data$Title)

# Create a new dummy variable
dummy <- dummyVars('~ Embarked + Title', data=titanic_data)

# Add the dummy variable into the titanic data set
titanic_data <- cbind(titanic_data, data.frame(predict(dummy, newdata=titanic_data)))

# Drop the Embarked, Title column
titanic_data <- select(titanic_data, -c('Embarked', 'Title'))

# Drop the Embarked.S and Title.the.Countess Columns
titanic_data <- select(titanic_data, -c('Embarked.S', 'Title.the.Countess'))

The result is a data frame with expanded columns.

> colnames(titanic_data)
 [1] "Survived"       "Pclass"         "Sex"            "Age"            "SibSp"          "Parch"          "Embarked.C"     "Embarked.Q"     "Title.Capt"    
[10] "Title.Col"      "Title.Don"      "Title.Dr"       "Title.Jonkheer" "Title.Lady"     "Title.Major"    "Title.Master"   "Title.Miss"     "Title.Mlle"    
[19] "Title.Mme"      "Title.Mr"       "Title.Mrs"      "Title.Ms"       "Title.Rev"      "Title.Sir" 

Lastly, here is our age-old question, how do we deal with missing values in the Age columns? some of the ways to deal with missing values are either by calculating the mean or median of the Age column and filling the missing value with the respective methods.

Here we will use the mean to fill the missing values in the Age column.

# Fill the missing value using the mean
titanic_data$Age <- ifelse(is.na(titanic_data$Age), 
                           mean(titanic_data$Age, na.rm = T),
                           titanic_data$Age)

As we can see here, NA are being filled by using the mean value of Age column.

> mean(titanic_data$Age)
[1] 29.69912
> titanic_data$Age[1:10]
[1] 22.00000 38.00000 26.00000 35.00000 35.00000 29.69912 54.00000  2.00000 27.00000 14.00000

Data set Modeling

Now we can finally get to the fun part, the model we are going to use is a simple logistic model. This simple model only includes Pclass, Sex, Age, SibSp, and Parch.

# Using the processed data as our training data
training_data <- titanic_data

# Simple logistic model
logistic_model_1 <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch, 
                        data=training_data, family = binomial)
> # Summary for logistic model 1
> summary(logistic_model_1)

Call:
glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch, 
    family = binomial, data = training_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6536  -0.6147  -0.4224   0.6133   2.4324  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.473207   0.425219   5.816 6.02e-09 ***
Pclass      -1.172848   0.119687  -9.799  < 2e-16 ***
Sex1         2.768189   0.198718  13.930  < 2e-16 ***
Age         -0.040103   0.007778  -5.156 2.52e-07 ***
SibSp       -0.334326   0.108557  -3.080  0.00207 ** 
Parch       -0.081621   0.114688  -0.712  0.47666    
---
Signif. codes:  0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  790.33  on 885  degrees of freedom
AIC: 802.33

Number of Fisher Scoring iterations: 5

From the results above, we have a basic regression model as following:

logit = 2.473207 - 1.172848*Pclass + 2.768189*Female - 0.040103*Age

If we examine the odds of female passengers surviving but taking the exp(2.768189), we get a value of 15.92976. This means that odds of female passengers surviving is 15 times of male passengers.

If we take the exponential value of the Pclass coefficient, exp(-1.172848) = 0.3094843, then this means that odds of survival for increase in Pclass is decreased by 0.3094843 times.

If we take the exponential value of SibSp, exp(-0.334326) = 0.7158204, then this means that odds of survival decreases by 0.7158204 times for every 1 Sibling and/or spouse on the Titanic.

If we take the exponential value of Age, exp(-0.040103) = 0.9606905, then this means that odds of survival decreases by 0.9606905 times for every 1 year increase in Age.

If we take the exponential value of Parch, exp(-0.081621) = 0.9216212, then this means that odds of survival decreases by 0.9216212 times for parents with every additional children/grandchildren onboard the titanic. However, this effect to Survived is not significant, we cannot be certain of the result of this interpretation.

To predict the probability of survival using the logistic model, we can use the built in predict function in R.

# Predicting the survival using the logistic model
test_prediction <- predict(logistic_model_1, training_data, type = 'response')
> test_prediction[1:10]
         1          2          3          4          5          6          7          8          9         10 
0.09432554 0.90117135 0.66377772 0.91138118 0.07951598 0.09653252 0.29625319 0.09884362 0.61699889 0.88079005 

The predict function in R outputs a probability, in order to predict a passenger survived or not, we can use the ROC to determine some threshold.

Using the pROC package in R, we can find the threshold corresponds to a specific true positive%.

library(pROC)

# ROC values and plotting ROC curve
roc_values <- cbind(roc(training_data$Survived, logistic_model_1$fitted.values)$sensitivities,
      roc(training_data$Survived, logistic_model_1$fitted.values)$threshold)

roc(training_data$Survived, logistic_model_1$fitted.values, 
    legacy.axes = T,
    plot = T,
    percent = T,
    print.auc = T,
    xlab = "False Positive %", 
    ylab = "True Positive %", 
    col = rainbow(1))

This results in a ROC plot.

What we can do is pick on the upper left of the curve that can yield good true positive % and low false positive%. Let say we pick a point around the 75% True positive% (This is somewhat arbitrary), and this corresponds to 0.4409982. Therefore, any probability greater than 0.4409982, we will say they survived.

Let’s then calculate how well does our model do compared to the actual value in the training set.

> # Compare the correct number of prediction to the training set
> sum(test_prediction == training_data$Survived)/dim(training_data)[1]
[1] 0.7934905

From this number, it suggests that our model got 79% of the prediction right. Out of 100 passengers, it can correctly predict the outcome of 79 passengers. However, this is only prediction on the training set. Realistically, there should have been a validation set to test our model, and used for refining the hyperparameters. It was quite fun coming back to this data set, there might be some mistakes in the interpretation of the data, and I am open to feedback.

Code for Exploring the Titanic Dataset Using R