The sinking of the RMS Titanic in 1912 is arguably one of the most famous disastrous events in world history. It has inspired a wide variety of work in numerous fields, including a famous Oscar-winning movie directed by James Cameron, a great number of modern Internet memes, and now a Kaggle predictive modeling competition. On that note, the goal of this project is to utilize methods in nonparametric statistics and statistical learning to predict whether or not a passenger survived the Titanic disaster. There are many factors that contribute to how likely a Titanic passenger would survive, from their age and gender to travel class, cabin number, and number of family members on board. For this project, we build a random forest model using various explanatory variables to make predictions of Titanic survival.
The data for this project is provided by the Kaggle competition “Titanic - Machine Learning from Disaster”. We are given two data files: a training set and a test set. The training set consists of 891 observations, where each row represents a passenger with the following attributes: whether or not they survived, travel class (first, second, or third), name, sex, age, number of sibling or spouses on board, number of parents or children on board, ticket number, passenger fare, cabin number, and port of embarkation (Cherbourg, Queenstown, or Southampton). The test set consists of 418 cases and its variables are the same as the training set, except there is no survival status, which we are going to predict.
From the given information, we create additional features to our data. We first extract every passenger’s title from their name, and divide this variable into 5 categories: Mr, Mrs, Miss, Master, and Other. From the Cabin variable in our original data, we obtain the deck for the Titanic travelers, which is the first letter in each cabin. We also calculate the family size for the passengers, by adding the number of siblings or spouses with the number of parents or children, plus one for a passenger themselves. Upon further inspection, we decide to group the family size into three different levels: single, small (size of 2 or 3), and large (size bigger than 3). The table below displays the possible predictors for survival outcome.
To get started, we conduct EDA with focus on the relationships between whether or not a traveler survived and other variables in our data. Numerical and visual summaries of our data can be found in the tabs below. First, it is clear that women (specifically those with titles Mrs and Miss) are more likely to survive that men, which is not surprising. Interestingly, there does not seem to be a big difference in age between those survived (mean = 28.3, median = 28) and those did not survived (mean = 30.6, median = 28). We would normally expect the distribution of the survived passengers group to be more skewed to the right, since younger people (children) would be more likely to get on lifeboats and hence have a higher chance of surviving than older people. However, it is worth nothing that there are 177 observations with missing age values in our train data, which we will deal with shortly.
Regarding passenger class and fare, we observe a trend that a higher class and fare is associated with a greater chance of surviving the Titanic. As one may expect, the rich people - those with first class status and paid more for fare - are more likely to remain alive after the ship hit the iceberg than people of lower classes. In terms of port of embarkation, people departed from Cherbourg are more likely to survive than those embarked from the other two ports, Queenstown and Southampton, which have about the same survival proportions. We also observe that people traveling in a small family group have greater chance of surviving than passengers traveling alone or with a large group of family members.