0% found this document useful (0 votes)
40 views20 pages

Aim: Predicting The Survival of Titanic Passengers

This document describes a project to predict survival of Titanic passengers using machine learning. The dataset contains passenger information from both those who survived and did not survive the sinking. Exploratory data analysis is performed which includes visualizing missing data, distributions of age, survival rates by gender and class, and a correlation matrix. The random forest algorithm is selected for modeling. Data preprocessing includes imputing missing values, encoding categorical variables, and feature engineering such as calculating family size and extracting name titles.

Uploaded by

Letehen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views20 pages

Aim: Predicting The Survival of Titanic Passengers

This document describes a project to predict survival of Titanic passengers using machine learning. The dataset contains passenger information from both those who survived and did not survive the sinking. Exploratory data analysis is performed which includes visualizing missing data, distributions of age, survival rates by gender and class, and a correlation matrix. The random forest algorithm is selected for modeling. Data preprocessing includes imputing missing values, encoding categorical variables, and feature engineering such as calculating family size and extracting name titles.

Uploaded by

Letehen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Name: Kaushik.A.

Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Aim: Predicting the Survival of Titanic Passengers

Problem Statement:

In the Titanic disaster, 1502 people died and 722 survived. The baseline survival rate
therefore is 32.46%. We have a training dataset with labels (survived or not) and a test dataset
to predict on, with the same features (attributes) but no label. The hypothesis is that passenger
characteristics carry information that has predictive power for the outcome. This hypothesis
is valid, yet only to a certain degree.
In this project, we will explore the data, create informal hypotheses to help modelling, build
machine-learning model, evaluating it on its accuracy.

Data Description:

The data has been split into two groups:


 Training set (train.csv)
 Test set (test.csv)

The training set is used to build the machine learning model. For the training set, the outcome
(also known as the “ground truth”) is provided for each passenger. The test set is used to see
how well the model performs on unseen data. For the test set, the ground truth for each
passenger is not provided. Our model predicts these outcomes. For each passenger in the test
set, the trained model is used to predict whether or not they survived the sinking of the
Titanic.

There are 891 passengers and 11 attributes (PassengerId is just an index not an attribute of a
passenger). The attributes are:

 Survived, integer, binary indicator (Survived = 1) and the target outcome or


dependent variable we are to predict.
 Pclass, integer, an ordinal variable for the passenger class.
 Name, Factor w/ 891 levels (one level per passenger).
 Sex, Factor with two levels: “female”, “male”.
 Age, numerical, has 177 missing values coded as NA.
 SibSp, integer, an ordinal variable for the number of siblings or spouses.
 Parch, integer, an ordinal variable for the number of parents or children.
 Ticket, Factor w/ 681 levels.
 Fare, numerical, is in Pounds Sterling, a proxy for wealth or social status.
 Cabin, Factor w/ 147 levels, has 687 missing values.
 Embarked, Factor w/ 3 levels: “C”, “Q”, and “S” for the port of embarkation
(Cherbourg, Queenstown, and Southhampton), has 2 missing values.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Approach:

Exploratory Data Analysis:

Set the working directory and import the required packages:

 Setwd command is used to set the working directory.


 Library command is user to import the required libraries.
The following libraries are used:
1. Ggplot2 – Create elegant Data Visualization.
2. Visdat – helps with preliminary visualization of whole dataset.
3. caTools – Moving window statistics.
4. dplyr – Package for data manipulation.
5. corrplot – Visualization of a correlation matrix.

Loading the dataset:

 read_csv() command is used to load the dataset.


 Bind_rows() command is used to merge dataset.
Here we have loaded the training dataset in “train” variable and the test dataset in “test”
variable. We have combined the two dataset into “data” variable for an in-depth exploratoey
data analysis.

Check the top 6 values of the dataset:


 Head() command is used to check the dataframe.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Check the summary of the dataset:


 Summary() function is used to produce result summaries.
We can get various statistical components of our dataset like minimum value, mean, median,
etc to get a better understanding of the dataset.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Getting a list of all the columns in the dataset:


 Names() function is used to get a list of all the attributes.

Check for missing values in the dataset:


 Missing() function is used to get all the missing values in the dataset,

Here we can observe that Age and Fare have 263 and 1 missing values respectively.
Survived is showing missing values because they are the values of the test dataset which is
unlabelled.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Visualizing missing values:


 Vis_miss() function is used to visualize these missing values.

Plotting the distribution of Age column:


 Ggplot() function is used to plot various graphics

Here, we have plotted the distribution of age of all the passengers on board on the day when
titanic sank. We can observe the mean age of the passengers was around 26 years.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Plotting Survived vs Gender:

Here, we have plotted the histogram for the gender and survived column. We can observe
that a lot of male didn’t survive the accident whereas most of those who survived where
females.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Plotting Survival vs PClass:

Here, we have plotted a histogram of the survived and pclass attributes. We can observe that
many of those who didn’t survive where of the type 3 Pclass and those who survived where
mostly of the type 1 and 2 Pclass.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Plotting Survival vs Age:

According to the titanic certificate of clearance, if you were 12 or older, you would be
labelled as an adult. The median life expectancy age in 1912 (when titanic sank) was 51.5 for
both men and women.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Plotting Fare Distribution:

Here, we have plotted the distribution of the fare attribute.


Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Plotting Correlation Matrix:


 Cor() function is used to plot the correlation matrix.

Correlation matrix is helpful is visualizing the entire dataset at one glance and recognizing
which attributes are positively correlated and which attributes are negatively correalted.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Pre-processing:

The algorithm used for creating the machine learning model is RandomForestClassifier
and the pre-processing done is specific for the algorithm.

Set the working directory and import the required packages:

Importing the dataset:


Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Check the top 6 values of the dataframe:

Check for missing values:

Converting categorical variables to factors:


 As.factor() function is used to encode a vector.

In the dataset we have some categorical values which are required to be encoded before
applying an algorithm toit because most of the algorithms don’t allow categorical values.

Filling Missing Values:

It is necessary to fill all the missing values because having null values in the dataset might
affect the performance of the model. Mice package in R is used for imputation as it contains
popular methods for imputation. Mice stands for multivariate imputation by chained
equations.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
As the algorithm we are using is RandomForest the method “rf” of mice will be used.

Imputing Age Column:


As the algorithm we are using is RandomForest the method “rf” of mice will be used.

Imputing Fare Column:

We are imputing the fare column with the median of the particular column. The reason being
the red line in the graph indicates the mean of the column and the green line indicates the
median. We can observe that the mean is far off of the normal distribution and the median
will be the correct value.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Feature Engineering:

Feature engineering is the process of creating new attributes using the existing attributes.

Family Size:

We can observe from the plot that many of those on the ship where travelling single or in a
small group.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Title:

The title of a particular person tells us a lot about the status of the person. We can see that a
lot of different titles were on the board like Lady, Sir, Major, Mr, Mrs, Miss, etc.

Family Name:
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Sector:

We are finished with our pre-processing and now its time to create a model for out dataset
and make some predictions.

Modelling:

Splitting the combined dataset back into train and test sets. RandomForest algorithm is used
to create the model.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Visualizing the created model:

Visualizing the variable importance:

After creating the model lets visualize which features were the most important in making
predictions and which were the least important. We can observe that “Title” which was
produced after feature engineering is the most used attribute in making predictions followed
by fare, sex, age, etc.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT

Making Predictions:

Now its time to make some predictions using the predict() function.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Evaluating the model:

Now let us evaluate the model the model we have created.

Making the confusion matrix:

A confusion matrix is a table that is often used to describe the performance of a classification
model (or “classifier”) on a set of test data for which the true values are known. It allows the
visualization of the performance of an algorithm.

We can observe that our model predicted 134 (97+37) correct values and 44 (31+13) wrong
values out of the total 178 values used for testing.

Let us check which of the predicted values didn’t match with the actual ground values.
Name: Kaushik.A.Jadhav Batch: A
Roll No: XIEIT161710 Class: BE-IT
Finally, let us calculate the accuracy of our model.

We can observe that our model gives an accuracy of 75 %.

Conclusion: Thus we have created a Random Forest classifier that predicts whether a person
will survive the titanic disaster or not with an accuracy of 75%.

You might also like