1 Introduction

The sinking of the RMS Titanic in 1912 is arguably one of the most famous disastrous events in world history. It has inspired a wide variety of work in numerous fields, including a famous Oscar-winning movie directed by James Cameron, a great number of modern Internet memes, and now a Kaggle predictive modeling competition. On that note, the goal of this project is to utilize methods in nonparametric statistics and statistical learning to predict whether or not a passenger survived the Titanic disaster. There are many factors that contribute to how likely a Titanic passenger would survive, from their age and gender to travel class, cabin number, and number of family members on board. For this project, we build a random forest model using various explanatory variables to make predictions of Titanic survival.


2 Data

2.1 Original Data

The data for this project is provided by the Kaggle competition “Titanic - Machine Learning from Disaster”. We are given two data files: a training set and a test set. The training set consists of 891 observations, where each row represents a passenger with the following attributes: whether or not they survived, travel class (first, second, or third), name, sex, age, number of sibling or spouses on board, number of parents or children on board, ticket number, passenger fare, cabin number, and port of embarkation (Cherbourg, Queenstown, or Southampton). The test set consists of 418 cases and its variables are the same as the training set, except there is no survival status, which we are going to predict.

2.2 New Features

From the given information, we create additional features to our data. We first extract every passenger’s title from their name, and divide this variable into 5 categories: Mr, Mrs, Miss, Master, and Other. From the Cabin variable in our original data, we obtain the deck for the Titanic travelers, which is the first letter in each cabin. We also calculate the family size for the passengers, by adding the number of siblings or spouses with the number of parents or children, plus one for a passenger themselves. Upon further inspection, we decide to group the family size into three different levels: single, small (size of 2 or 3), and large (size bigger than 3). The table below displays the possible predictors for survival outcome.

Pclass Sex Age SibSp Parch Fare Embarked Title Deck FamSize FamType
3 male 22 1 0 7.2500 S Mr NA 2 Small
1 female 38 1 0 71.2833 C Mrs C 2 Small
3 female 26 0 0 7.9250 S Miss NA 1 Single
1 female 35 1 0 53.1000 S Mrs C 2 Small
3 male 35 0 0 8.0500 S Mr NA 1 Single

2.3 Exploratory Data Analysis

To get started, we conduct EDA with focus on the relationships between whether or not a traveler survived and other variables in our data. Numerical and visual summaries of our data can be found in the tabs below. First, it is clear that women (specifically those with titles Mrs and Miss) are more likely to survive that men, which is not surprising. Interestingly, there does not seem to be a big difference in age between those survived (mean = 28.3, median = 28) and those did not survived (mean = 30.6, median = 28). We would normally expect the distribution of the survived passengers group to be more skewed to the right, since younger people (children) would be more likely to get on lifeboats and hence have a higher chance of surviving than older people. However, it is worth nothing that there are 177 observations with missing age values in our train data, which we will deal with shortly.

Regarding passenger class and fare, we observe a trend that a higher class and fare is associated with a greater chance of surviving the Titanic. As one may expect, the rich people - those with first class status and paid more for fare - are more likely to remain alive after the ship hit the iceberg than people of lower classes. In terms of port of embarkation, people departed from Cherbourg are more likely to survive than those embarked from the other two ports, Queenstown and Southampton, which have about the same survival proportions. We also observe that people traveling in a small family group have greater chance of surviving than passengers traveling alone or with a large group of family members.

Sex

Sex Survived Count Proportion
female 0 81 0.2579618
female 1 233 0.7420382
male 0 468 0.8110919
male 1 109 0.1889081

Passenger Class

Pclass Survived Count Proportion
1 0 80 0.3703704
1 1 136 0.6296296
2 0 97 0.5271739
2 1 87 0.4728261
3 0 372 0.7576375
3 1 119 0.2423625

Age

Survived min Q1 median Q3 max mean sd n missing
0 1.00 21 28 39 74 30.62618 14.17211 424 125
1 0.42 19 28 36 80 28.34369 14.95095 290 52

Fare

Survived min Q1 median Q3 max mean sd n missing
0 0 7.8542 10.5 26 263.0000 22.11789 31.38821 549 0
1 0 12.4750 26.0 57 512.3292 48.39541 66.59700 342 0

Embarked

Embarked Survived Count Proportion
C 0 75 0.4464286
C 1 93 0.5535714
Q 0 47 0.6103896
Q 1 30 0.3896104
S 0 427 0.6630435
S 1 217 0.3369565

Title

Title Survived Count Proportion
Master 0 17 0.4250000
Master 1 23 0.5750000
Miss 0 55 0.2972973
Miss 1 130 0.7027027
Mr 0 438 0.8439306
Mr 1 81 0.1560694
Mrs 0 26 0.2047244
Mrs 1 101 0.7952756
Other 0 13 0.6500000
Other 1 7 0.3500000

Family

FamType Survived Count Proportion
Large 0 60 0.6593407
Large 1 31 0.3406593
Single 0 374 0.6964618
Single 1 163 0.3035382
Small 0 115 0.4372624
Small 1 148 0.5627376

3 Methods

3.1 Missing Data

As mentioned before, since there are missing values in our data, we impute our train and test sets separately, each with a single imputation using random forests. In the train data frame, the variables that has missing values are Age and Deck (extracted from Cabin), whereas alongside those two, Fare is the additional column in the test set with empty values, but not a lot. In order to prevent data leakage, we do not include the response variable, Survived, when imputing the variables in the train data.

3.2 Model Building

We choose random forest as our statistical learning technique to predict Titanic survivorship. After imputing and obtaining completed versions of our data sets, we split the train data into two subsets, training and holdout, using a 70-30 split, and use the training set to train our model. We end up with the following model randomForest(Survived ~ ., data = training, importance = TRUE, mtry = 3, maxnodes = 4, ntree = 1000). To validate the model, we obtain a confusion matrix in order to determine how many observations in the holdout data were correctly or incorrectly classified. Our model correctly predicted that 153 passengers would not survive and 62 would, for a total of 215 correct predictions for the holdout data set. Accordingly, the overall proportion of correct predictions is 215/268 = 0.802.

Actual
Predicted 0 1
0 153 39
1 14 62

4 Results

We fit a final model using data from the entire train set, with the same parameters as in the trained model. The figure below shows variable importance plots for the Titanic train data. We see that the most important predictor variable in our random forest model is sex, followed by title, and passenger class, as they are the covariates with highest mean decrease accuracy and mean decrease Gini score, hence removing them (especially Sex) would significantly reduce the overall prediction accuracy of our model.

Our final model’s predictions were submitted to Kaggle, and we get an accuracy of 78.708%, which ranks in the top 13% of the competition.


5 Conclusion and Discussion

Overall, we utilized the method of random forest to build a statistical model and make predictions of survival of passengers on the RMS Titanic. We obtained a somewhat decent result, as our submission returned a prediction score of almost 79% accuracy. Our results also indicate that gender is the most important variable in predicting survival status for Titanic travelers.

In the future, it is certainly possible for us to build a better model and improve our prediction accuracy of whether or not Titanic passengers survived. We could extract even more information from the given data, as features like ticket number or cabin number could end up being helpful in making Titanic survival predictions. Another thing we should definitely look into is to use other classification techniques such as CART, logistic regression, LDA, SVM, or naive Bayes to predict survival outcome and examine how well they perform compared to the random forest model that we implemented in this project.


Appendix: R Code

library(tidyverse)
library(randomForest)
library(mice)
theme_set(theme_bw())
train <- read_csv("~/Desktop/R/Titanic/train.csv")
test <- read_csv("~/Desktop/R/Titanic/test.csv") %>% mutate(Survived = NA)

########################################################################################
# Data wrangling
all <- train %>% full_join(test)

# Titles
Mr <- c("Mr", "Don", "Jonkheer")
Mrs <- c("Mrs", "Dona", "Mme", "Countess")
Miss <- c("Miss", "Mlle", "Ms")
Other <- c("Capt", "Col", "Dr", "Major", "Rev", "Sir", "Lady")

all <- all %>% 
  mutate(Title = str_extract(Name, "[a-zA-Z0-9]+\\."),
         Title = gsub("[.]", "", Title),
         Title = if_else(Title %in% Mr, "Mr",
                         if_else(Title %in% Mrs, "Mrs",
                                 if_else(Title %in% Miss, "Miss", 
                                         if_else(Title %in% Other, "Other", Title)))),
         Title = if_else(Sex == "female" & Title == "Mr", "Mrs", Title),
         Embarked = if_else(is.na(Embarked), "S", Embarked),
         Pclass = as.factor(Pclass),
         Survived = as.factor(Survived),
         Deck = substr(Cabin, 1, 1),
         Deck = if_else(Deck == "T", "A", Deck),
         Deck = as.factor(Deck),
         FamSize = SibSp + Parch + 1,
         FamType = if_else(FamSize == 1, "Single",
                           if_else(FamSize %in% 2:3, "Small", "Large")))
all2 <- all %>% 
  select(Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Title, Deck, FamSize, FamType)

########################################################################################
# EDA

head(all2, 5)
train <- train %>% mutate(Survived = as.factor(Survived))

# Survived vs Sex
train %>% 
  ggplot(aes(x = Sex, fill = Survived)) + 
  geom_bar(position = "fill") 
train %>% 
  group_by(Sex, Survived) %>% 
  summarize(Count = n()) %>% 
  mutate(Proportion = Count/sum(Count))

# Survived vs Pclass
train %>% 
  ggplot(aes(x = Pclass, fill = Survived)) + 
  geom_bar(position = "fill")
train %>% 
  group_by(Pclass, Survived) %>% 
  summarize(Count = n()) %>% 
  mutate(Proportion = Count/sum(Count))

# Survived vs Age
train %>% 
  ggplot(aes(x = Survived, y = Age)) + 
  geom_boxplot()
mosaic::favstats(Age ~ Survived, data = train)

# Survived vs Fare
all %>% 
  filter(PassengerId <= 891) %>% 
  ggplot(aes(x = Survived, y = Fare)) + 
  geom_boxplot()
mosaic::favstats(Fare ~ Survived, data = train)

# Survived vs Embarked
train %>% 
  filter(!is.na(Embarked)) %>% 
  ggplot(aes(x = Embarked, fill = Survived)) + 
  geom_bar(position = "fill")
train %>% 
  filter(!is.na(Embarked)) %>% 
  group_by(Embarked, Survived) %>% 
  summarize(Count = n()) %>% 
  mutate(Proportion = Count/sum(Count))

# Survived vs Title
all %>% 
  filter(PassengerId <= 891) %>% 
  ggplot(aes(x = Title, fill = Survived)) + 
  geom_bar(position = "fill")
all %>% 
  filter(PassengerId <= 891) %>% 
  group_by(Title, Survived) %>% 
  summarize(Count = n()) %>% 
  mutate(Proportion = Count/sum(Count))

# Survived vs Family Size
all %>% 
  filter(PassengerId <= 891) %>% 
  ggplot(aes(x = FamType, fill = Survived)) + 
  geom_bar(position = "fill")
all %>% 
  filter(PassengerId <= 891) %>% 
  group_by(FamType, Survived) %>% 
  summarize(Count = n()) %>% 
  mutate(Proportion = Count/sum(Count))

########################################################################################
# Imputation
library(naniar)
vis_miss(train) + ggtitle("Missingness in train data")
vis_miss(test) + ggtitle("Missingness in test data")

train_imp <- all2[1:891,]
test_imp <- all2[892:1309,]
set.seed(99)
train_new <- train_imp %>% 
  mice(m = 1, method = "rf", print = FALSE) %>% 
  complete() %>% 
  mutate(Survived = train$Survived,
         PassengerId = train$PassengerId)
set.seed(99)
test_new <- test_imp %>% 
  mice(m = 1, method = "rf", print = FALSE) %>% 
  complete() %>% 
  mutate(PassengerId = test$PassengerId)

########################################################################################
# Modeling
# Cross validation
# get training and holdout sets 
set.seed(99)
training <- train_new %>% slice_sample(prop = 0.7)
holdout <- train_new %>% anti_join(training)

# model: RF
set.seed(99)
train_rf <- training %>%
  randomForest(
    as.factor(Survived) ~ . - PassengerId,
    data = .,
    importance = TRUE,
    mtry = 3,
    maxnodes = 4,
    ntree = 1000
  )
varImpPlot(train_rf)
table(predict(train_rf, holdout) == holdout$Survived)

# final model: RF
set.seed(99)
final_rf <- training %>% 
  bind_rows(holdout) %>% 
  randomForest(
    as.factor(Survived) ~ . - PassengerId,
    data = .,
    importance = TRUE,
    mtry = 3,
    maxnodes = 4,
    ntree = 1000
  )

########################################################################################
# Prediction
test_new %>% 
  mutate(Survived = predict(final_rf, test_new)) %>%
  select(PassengerId, Survived) %>%
  write_csv("titanic_pred_RF_final.csv")