0% found this document useful (0 votes)
102 views

Titanic Survival Prediction

The document describes a machine learning project to predict Titanic passenger survival using a dataset. It details the Python libraries, data cleaning steps, exploratory data analysis including visualizations of survival rates by sex and age, and a random forest model to make predictions of survival.

Uploaded by

Tanisha Chouhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views

Titanic Survival Prediction

The document describes a machine learning project to predict Titanic passenger survival using a dataset. It details the Python libraries, data cleaning steps, exploratory data analysis including visualizations of survival rates by sex and age, and a random forest model to make predictions of survival.

Uploaded by

Tanisha Chouhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

FOR AFAME TECHNOLOGIES

TITANIC
SURVIVAL
PREDICTION
PROJECT REPORT

By - Rahul Badola

PROJECT BY RAHUL BADOLA


TABLE OF CONTENT
1. Introduction

2. Python Libraries Used

3. Data Overview

4. Data Cleaning

5. Data Visualization(EDA)

6. Predictions

7. Conclusion

INTRODUCTION
The RMS Titanic's tragic sinking in 1912 stands as a stark reminder of the
vulnerability of human life in the face of natural disasters. Despite its opulence and
perceived invincibility, the ship's collision with an iceberg led to the loss of over
1,500 lives. Today, leveraging data science techniques, we delve into the Titanic
dataset, examining factors such as passenger class, age, and gender to discern
patterns in survival rates. This project report explores the human stories within the
data, aiming to unravel the mysteries surrounding Titanic's fateful voyage and shed
light on the factors that influenced survival aboard the iconic vessel.

PROJECT BY RAHUL BADOLA


PYTHON LIBRARIES USED
1. pandas (pd): Pandas is a powerful library for data manipulation and analysis
in Python. It provides data structures like DataFrames, which are ideal for
handling structured data.
2. numpy (np): NumPy is a fundamental package for scientific computing
with Python. It provides support for large, multi-dimensional arrays and
matrices, along with a collection of mathematical functions to operate on
these arrays efficiently.
3. seaborn (sns): Seaborn is a statistical data visualization library based on
Matplotlib. It provides a high-level interface for drawing attractive and
informative statistical graphics.
4. matplotlib.pyplot (plt): Matplotlib is a comprehensive library for creating
static, animated, and interactive visualizations in Python. Pyplot is a module
within Matplotlib that provides a MATLAB-like interface for plotting.
5. warnings: The warnings module is used to handle warnings that occur
during program execution. In this case, you've suppressed all warnings from
being displayed with warnings.filterwarnings('ignore').
6. imblearn.under_sampling.RandomUnderSampler: Imbalanced-learn is a
library used for dealing with imbalanced datasets in machine learning.
RandomUnderSampler is a technique used to balance class distribution by
randomly eliminating samples from the majority class(es).
7. sklearn.preprocessing.LabelEncoder: LabelEncoder is used to convert
categorical labels (e.g., strings) into numerical labels for machine learning
algorithms.
8. sklearn.model_selection.train_test_split: This function is used to split
datasets into random train and test subsets. It's commonly used for model
evaluation and validation.
9. sklearn.ensemble.RandomForestClassifier: RandomForestClassifier is an
ensemble learning method used for classification tasks. It fits a number of
decision tree classifiers on various sub-samples of the dataset and uses
averaging to improve the predictive accuracy and control over-fitting.
10. sklearn.metrics.accuracy_score: accuracy_score is a function used to
compute the accuracy classification score, which is the fraction of correct
predictions among the total number of predictions made.

PROJECT BY RAHUL BADOLA


11. sklearn.metrics.classification_report: classification_report is a function used
to build a text report showing the main classification metrics (precision, recall,
F1-score, and support) for each class in a classification task.

DATA OVERVIEW

0 - Not Survived 1 - Survived

891 - Rows 11 - Columns

DATA CLEANING
Dtypes
remove the 'PassengerId' column
from the dataset since it does not
contribute to the prediction task.
This code will drop the
'PassengerId' column from the
DataFrame df in place, meaning
the change will be applied
directly to the DataFrame
without the need to reassign it to
a new variable.

PROJECT BY RAHUL BADOLA


DEALING WITH
NULL VALUES
Null values

Since the Age contains values in


float so we can fill null values
with its mean.

Null
values(After
Replacing) Since the Age contains values in
float so we can fill null values
with its mean.

Null Values in the column


‘Embarked’ has been replaced
with its mode (most repeated
value).

PROJECT BY RAHUL BADOLA


Balancing of the Dataset

Data is imbalanced, which results in


Not Survived - 549
biased results, so, we have to use
under-sampeling technique to balance
the Dataset.

Survived - 342

EDA - (EXPLORATORY DATA


ANALYSIS)

This pie chart visualizes the


percentage of survived persons
from the Titanic dataset. It shows
that 38.4% of individuals survived
the disaster, while the majority,
comprising 61.6%, did not survive.
This stark contrast highlights the
tragic outcome of the Titanic's
sinking and underscores the
importance of understanding
factors that influenced survival
rates.

PROJECT BY RAHUL BADOLA


This bar plot visualizes the survival count by sex from the Titanic dataset. It reveals that a
larger number of males did not survive the disaster, whereas a greater proportion of
females survived. This discrepancy suggests a potential correlation between sex and
survival on the Titanic, highlighting the importance of considering gender dynamics in
analyzing historical events like this.

This pie chart illustrates the


distribution of males and females
in the Titanic dataset. It indicates
that approximately 64.8% of the
individuals in the dataset are male,
while around 35.2% are female.
This gender distribution provides
valuable insight into the
composition of passengers aboard
the Titanic and underscores the
importance of considering gender
dynamics in analyzing historical
events.

PROJECT BY RAHUL BADOLA


This histogram with density plot overlays depicts the distribution of ages among male and
female passengers in the Titanic dataset. The histogram bars represent the count of
individuals within different age groups, while the density curves provide an estimation of
the probability density function for each gender.

From the visualization, it can be observed that the age distribution is skewed towards
younger individuals, with a notable peak in the early 20s. Additionally, the density curves
reveal subtle differences between the age distributions of males and females, suggesting
potential variations in age demographics between the two genders aboard the Titanic.

This bar plot illustrates the


count of passengers based on
their port of embarkation (S =
Southampton, C = Cherbourg,
Q = Queenstown) and their
respective genders.
It's evident that the majority of
passengers embarked from
Southampton, with a higher
proportion of males compared
to females. Cherbourg had the
second highest number of
passengers, with a relatively

PROJECT BY RAHUL BADOLA


balanced distribution between males and females. Queenstown had the fewest
passengers, with more males than females.

PREDICTION

STEP - 1

Here, we instantiate a LabelEncoder object and then use the `fit_transform` method to
encode the 'Embarked' and 'Sex' columns of the DataFrame `df`. This process assigns
numerical labels to each unique category in these columns, making them suitable for
machine learning algorithms that require numerical input.

STEP - 2

Here, x is assigned the DataFrame df after dropping the columns 'Survived' and 'Name'
along the columns axis (axis=1), making it the independent variable matrix. y is assigned
the 'Survived' column from the DataFrame df, representing the dependent variable. This
separation is crucial for training machine learning models, where x contains the features
used for prediction and y contains the target variable to be predicted.

STEP - 3

PROJECT BY RAHUL BADOLA


Here, a RandomUnderSampler object `us` is instantiated, which will randomly
undersample the majority class (0 for 'Not Survived') to balance the class distribution. The
`fit_resample` method is then applied to the independent variable matrix `x` and the
dependent variable `y` to perform the undersampling. The resulting `x_resample` and
`y_resample` contain the balanced dataset, where both classes are represented equally,
thus mitigating bias in the classification results.

STEP - 4

The code `x_resample.reset_index(drop=True)` resets the index of the DataFrame


`x_resample` after undersampling, dropping the old index. This ensures that the index is
reset to consecutive integers starting from 0, maintaining the integrity of the data after
undersampling.

The resulting DataFrame shows 684 rows and 7 columns, indicating that the index has
been successfully reset, and the DataFrame is ready for further analysis or modeling tasks.

PROJECT BY RAHUL BADOLA


STEP - 5

Here, the `train_test_split` function from scikit-learn is used to split the independent
variable matrix `x` and the dependent variable `y` into training and testing sets. The
parameter `test_size=0.20` specifies that 20% of the data will be used for testing, while
the remaining 80% will be used for training. Additionally, `random_state=46` sets the
random seed for reproducibility, ensuring that the same random split is obtained each time
the code is run.

STEP - 6

Here, the RandomForestClassifier from scikit-learn is instantiated with parameters


`n_estimators=100` and `max_depth=5`. This creates a random forest classifier with 100
decision trees (estimators) and a maximum depth of 5 for each tree. The `rfc` variable now
holds this classifier object, which can be used for training and making predictions on the
dataset.

STEP - 7

Here, the fit method of the RandomForestClassifier rfc is called with the training data
x_train and y_train as arguments. This step trains the random forest classifier on the
training data, allowing it to learn the patterns and relationships between the features and
the target variable.

PROJECT BY RAHUL BADOLA


STEP - 8

Here, the predict method of the trained Random Forest Classifier rfc is used to predict the
target variable based on the features in the testing dataset x_test. The predicted values
are stored in the variable y_pred, which can be used for evaluating the model's
performance and comparing it against the actual values in y_test.

STEP - 9

Here, the accuracy_score function from scikit-learn is used to compute the accuracy of the
model by comparing the predicted values y_pred with the actual values y_test. This
function returns the accuracy score, which represents the proportion of correctly
classified instances out of the total number of instances in the testing dataset.

Classification Report

PROJECT BY RAHUL BADOLA


Feature Imp.

Feature Imp.

PROJECT BY RAHUL BADOLA


CONCLUSION
In conclusion, this project aimed to predict the survival of passengers aboard the Titanic
using machine learning techniques.
The dataset was first explored and preprocessed to handle missing values and encode
categorical variables. Various visualizations were employed to understand the data
distribution and relationships between different features.
Next, a Random Forest Classifier model was trained on the preprocessed data. The model
was evaluated using metrics such as accuracy, precision, recall, and F1-score.
The classification report revealed that the model performed reasonably well, with an
overall accuracy of 84%. It achieved high precision and recall for predicting both survival
and non-survival cases, indicating a good balance between correctly identifying survivors
and non-survivors.
Overall, this project demonstrates the application of machine learning algorithms in
predicting survival outcomes based on historical data, offering insights into factors
influencing survival rates on the Titanic. Further improvements and optimizations could be
explored to enhance the model's performance and robustness.

THANK YOU
END OF THE PROJECT

PROJECT BY RAHUL BADOLA

You might also like