0% found this document useful (0 votes)
4 views

Classification ML Project Report

A classification model was built to predict accident survival based on factors like age, gender, speed, and safety gear usage, using a dataset of 200 cases. The Random Forest model outperformed others with a balanced F1 score of 55%, while Logistic Regression had the highest precision but lower recall. Key insights indicated that helmet and seatbelt use significantly affect survival rates, and recommendations for model improvement include adding more variables and exploring different algorithms.

Uploaded by

elliptiicclips
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Classification ML Project Report

A classification model was built to predict accident survival based on factors like age, gender, speed, and safety gear usage, using a dataset of 200 cases. The Random Forest model outperformed others with a balanced F1 score of 55%, while Logistic Regression had the highest precision but lower recall. Key insights indicated that helmet and seatbelt use significantly affect survival rates, and recommendations for model improvement include adding more variables and exploring different algorithms.

Uploaded by

elliptiicclips
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

I chose to build a classification model to predict whether a person involved in an accident

survives based on age, gender, speed, helmet and seatbelt use. This prediction could be used
to help stakeholders improve safety regulations.

There were 200 accident cases in a csv file I downloaded from Kaggle. I cleaned the data
changing it to binary values, filled in missing info, and created a training testing split 80% to
20%.

I trained it using logistic regression, decision tree, and random forest. The
accuracy/Precision/Recall/F1 score are as follows:

Model 1: Logistic Regression

●​ Accuracy: 57.5%
●​ Precision: 60%
●​ Recall: 45%
●​ F1 Score: 51.4%

Model 2: Decision Tree

●​ Accuracy: 47.5%
●​ Precision: 47.4%
●​ Recall: 45%
●​ F1 Score: 46.1%

Model 3: Random Forest

●​ Accuracy: 55%
●​ Precision: 55%
●​ Recall: 55%
●​ F1 Score: 55%

The Random Forest model performed the best overall, with a balanced precision and recall
(55%), making it the most reliable predictor. While Logistic Regression had higher precision
(60%), it suffered from lower recall (45%), meaning it failed to correctly identify some survival
cases. The Decision Tree performed the worst, likely due to overfitting.

Some key findings and insights were that helmet and seatbelt significantly impacted the survival
rates, higher speed is correlated with lower survival, age plays a moderate role, younger people
survive more, and gender did not really have much of an impact.

Some recommendations were adding more conditions such as weather conditions or time of
day, adjusting the model's hyperparameters, and trying other models like gradient boosting or
neural networks. Using SMOTE. And eventually deploy the model to help people.
This analysis provides valuable insights into factors affecting road accident survival. The
Random Forest model proved to be the best predictor, but further improvements can be made
with additional data and tuning. These insights can be used to inform public safety policies,
vehicle safety features, and accident response strategies.


My python code:​
from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score,


f1_score

from sklearn.model_selection import train_test_split

# Load your dataset

import pandas as pd

df = pd.read_csv("C:\\Users\sigle\OneDrive\Desktop\\accident.csv") # Replace
with your file path

# Preprocess the data

df["Gender"].fillna(df["Gender"].mode()[0], inplace=True)

df["Speed_of_Impact"].fillna(df["Speed_of_Impact"].median(), inplace=True)

df["Gender"] = df["Gender"].map({"Male": 0, "Female": 1})

df["Helmet_Used"] = df["Helmet_Used"].map({"No": 0, "Yes": 1})

df["Seatbelt_Used"] = df["Seatbelt_Used"].map({"No": 0, "Yes": 1})


# Define features and target variable

X = df.drop(columns=["Survived"])

y = df["Survived"]

# Split dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42, stratify=y)

# Initialize models

log_reg = LogisticRegression()

decision_tree = DecisionTreeClassifier(random_state=42)

random_forest = RandomForestClassifier(random_state=42)

# Train models

log_reg.fit(X_train, y_train)

decision_tree.fit(X_train, y_train)

random_forest.fit(X_train, y_train)

# Make predictions

log_reg_preds = log_reg.predict(X_test)

decision_tree_preds = decision_tree.predict(X_test)

random_forest_preds = random_forest.predict(X_test)

# Function to evaluate models

def evaluate_model(y_true, y_pred):

return {
"Accuracy": accuracy_score(y_true, y_pred),

"Precision": precision_score(y_true, y_pred),

"Recall": recall_score(y_true, y_pred),

"F1 Score": f1_score(y_true, y_pred),

# Evaluate models

log_reg_results = evaluate_model(y_test, log_reg_preds)

decision_tree_results = evaluate_model(y_test, decision_tree_preds)

random_forest_results = evaluate_model(y_test, random_forest_preds)

# Print results

print("Logistic Regression:", log_reg_results)

print("Decision Tree:", decision_tree_results)

print("Random Forest:", random_forest_results)

You might also like