Aim: Predicting The Survival of Titanic Passengers
Aim: Predicting The Survival of Titanic Passengers
Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Problem Statement:
In the Titanic disaster, 1502 people died and 722 survived. The baseline survival rate
therefore is 32.46%. We have a training dataset with labels (survived or not) and a test dataset
to predict on, with the same features (attributes) but no label. The hypothesis is that passenger
characteristics carry information that has predictive power for the outcome. This hypothesis
is valid, yet only to a certain degree.
In this project, we will explore the data, create informal hypotheses to help modelling, build
machine-learning model, evaluating it on its accuracy.
Data Description:
The training set is used to build the machine learning model. For the training set, the outcome
(also known as the “ground truth”) is provided for each passenger. The test set is used to see
how well the model performs on unseen data. For the test set, the ground truth for each
passenger is not provided. Our model predicts these outcomes. For each passenger in the test
set, the trained model is used to predict whether or not they survived the sinking of the
Titanic.
There are 891 passengers and 11 attributes (PassengerId is just an index not an attribute of a
passenger). The attributes are:
Here we can observe that Age and Fare have 263 and 1 missing values respectively.
Survived is showing missing values because they are the values of the test dataset which is
unlabelled.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Here, we have plotted the distribution of age of all the passengers on board on the day when
titanic sank. We can observe the mean age of the passengers was around 26 years.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Here, we have plotted the histogram for the gender and survived column. We can observe
that a lot of male didn’t survive the accident whereas most of those who survived where
females.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Here, we have plotted a histogram of the survived and pclass attributes. We can observe that
many of those who didn’t survive where of the type 3 Pclass and those who survived where
mostly of the type 1 and 2 Pclass.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
According to the titanic certificate of clearance, if you were 12 or older, you would be
labelled as an adult. The median life expectancy age in 1912 (when titanic sank) was 51.5 for
both men and women.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Correlation matrix is helpful is visualizing the entire dataset at one glance and recognizing
which attributes are positively correlated and which attributes are negatively correalted.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Pre-processing:
The algorithm used for creating the machine learning model is RandomForestClassifier
and the pre-processing done is specific for the algorithm.
In the dataset we have some categorical values which are required to be encoded before
applying an algorithm toit because most of the algorithms don’t allow categorical values.
It is necessary to fill all the missing values because having null values in the dataset might
affect the performance of the model. Mice package in R is used for imputation as it contains
popular methods for imputation. Mice stands for multivariate imputation by chained
equations.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
As the algorithm we are using is RandomForest the method “rf” of mice will be used.
We are imputing the fare column with the median of the particular column. The reason being
the red line in the graph indicates the mean of the column and the green line indicates the
median. We can observe that the mean is far off of the normal distribution and the median
will be the correct value.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Feature Engineering:
Feature engineering is the process of creating new attributes using the existing attributes.
Family Size:
We can observe from the plot that many of those on the ship where travelling single or in a
small group.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Title:
The title of a particular person tells us a lot about the status of the person. We can see that a
lot of different titles were on the board like Lady, Sir, Major, Mr, Mrs, Miss, etc.
Family Name:
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Sector:
We are finished with our pre-processing and now its time to create a model for out dataset
and make some predictions.
Modelling:
Splitting the combined dataset back into train and test sets. RandomForest algorithm is used
to create the model.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Visualizing the created model:
After creating the model lets visualize which features were the most important in making
predictions and which were the least important. We can observe that “Title” which was
produced after feature engineering is the most used attribute in making predictions followed
by fare, sex, age, etc.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Making Predictions:
Now its time to make some predictions using the predict() function.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Evaluating the model:
A confusion matrix is a table that is often used to describe the performance of a classification
model (or “classifier”) on a set of test data for which the true values are known. It allows the
visualization of the performance of an algorithm.
We can observe that our model predicted 134 (97+37) correct values and 44 (31+13) wrong
values out of the total 178 values used for testing.
Let us check which of the predicted values didn’t match with the actual ground values.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Finally, let us calculate the accuracy of our model.
Conclusion: Thus we have created a Random Forest classifier that predicts whether a person
will survive the titanic disaster or not with an accuracy of 75%.