Titanic Survival Prediction
Titanic Survival Prediction
TITANIC
SURVIVAL
PREDICTION
PROJECT REPORT
By - Rahul Badola
3. Data Overview
4. Data Cleaning
5. Data Visualization(EDA)
6. Predictions
7. Conclusion
INTRODUCTION
The RMS Titanic's tragic sinking in 1912 stands as a stark reminder of the
vulnerability of human life in the face of natural disasters. Despite its opulence and
perceived invincibility, the ship's collision with an iceberg led to the loss of over
1,500 lives. Today, leveraging data science techniques, we delve into the Titanic
dataset, examining factors such as passenger class, age, and gender to discern
patterns in survival rates. This project report explores the human stories within the
data, aiming to unravel the mysteries surrounding Titanic's fateful voyage and shed
light on the factors that influenced survival aboard the iconic vessel.
DATA OVERVIEW
DATA CLEANING
Dtypes
remove the 'PassengerId' column
from the dataset since it does not
contribute to the prediction task.
This code will drop the
'PassengerId' column from the
DataFrame df in place, meaning
the change will be applied
directly to the DataFrame
without the need to reassign it to
a new variable.
Null
values(After
Replacing) Since the Age contains values in
float so we can fill null values
with its mean.
Survived - 342
From the visualization, it can be observed that the age distribution is skewed towards
younger individuals, with a notable peak in the early 20s. Additionally, the density curves
reveal subtle differences between the age distributions of males and females, suggesting
potential variations in age demographics between the two genders aboard the Titanic.
PREDICTION
STEP - 1
Here, we instantiate a LabelEncoder object and then use the `fit_transform` method to
encode the 'Embarked' and 'Sex' columns of the DataFrame `df`. This process assigns
numerical labels to each unique category in these columns, making them suitable for
machine learning algorithms that require numerical input.
STEP - 2
Here, x is assigned the DataFrame df after dropping the columns 'Survived' and 'Name'
along the columns axis (axis=1), making it the independent variable matrix. y is assigned
the 'Survived' column from the DataFrame df, representing the dependent variable. This
separation is crucial for training machine learning models, where x contains the features
used for prediction and y contains the target variable to be predicted.
STEP - 3
STEP - 4
The resulting DataFrame shows 684 rows and 7 columns, indicating that the index has
been successfully reset, and the DataFrame is ready for further analysis or modeling tasks.
Here, the `train_test_split` function from scikit-learn is used to split the independent
variable matrix `x` and the dependent variable `y` into training and testing sets. The
parameter `test_size=0.20` specifies that 20% of the data will be used for testing, while
the remaining 80% will be used for training. Additionally, `random_state=46` sets the
random seed for reproducibility, ensuring that the same random split is obtained each time
the code is run.
STEP - 6
STEP - 7
Here, the fit method of the RandomForestClassifier rfc is called with the training data
x_train and y_train as arguments. This step trains the random forest classifier on the
training data, allowing it to learn the patterns and relationships between the features and
the target variable.
Here, the predict method of the trained Random Forest Classifier rfc is used to predict the
target variable based on the features in the testing dataset x_test. The predicted values
are stored in the variable y_pred, which can be used for evaluating the model's
performance and comparing it against the actual values in y_test.
STEP - 9
Here, the accuracy_score function from scikit-learn is used to compute the accuracy of the
model by comparing the predicted values y_pred with the actual values y_test. This
function returns the accuracy score, which represents the proportion of correctly
classified instances out of the total number of instances in the testing dataset.
Classification Report
Feature Imp.
THANK YOU
END OF THE PROJECT