Decision Tree
Decision Tree
The initial dataset preprocessing included handling missing values, particularly in the Age and
Embarked columns, where average imputation and mode imputation were applied, respectively.
To prepare categorical variables, we used one-hot encoding for the "Sex" and "Embarked"
columns, creating separate binary columns and avoiding arbitrary numerical assignments that
might mislead the model. Non-predictive columns, such as PassengerID, Name, Ticket, and
Cabin, were removed from the feature set, as they do not directly influence survival outcomes.
Thus, the selected feature set consisted of Pclass, Sex, Age, Parch, Fare, and Embarked.
Two decision tree models were implemented: one based on the Gini Index and the other on Gain
Ratio. The models were trained on 80% of the data and evaluated on the remaining 20%. Key
performance metrics—accuracy, precision, recall, and F1-score—were used to assess each
model’s effectiveness.
In conclusion, both Gini Index and Gain Ratio are effective criteria for decision tree
classification on this dataset, with minimal differences in performance; however the Gini Index
model slightly outperformed the Gain Ratio model, making it a marginally better choice for
survival prediction in this case. The Gini Index model achieved an accuracy of 0.810, with
precision, recall, and F1-score all at 0.770. In comparison, the Gain Ratio model reached an
accuracy of 0.804, with a precision of 0.760, recall of 0.770, and F1-score of 0.765. Future work
could explore ensemble methods like random forests to potentially increase accuracy and
generalizability. This study highlights the utility of decision trees in survival analysis and their
adaptability to different criteria, offering a foundational approach for further predictive analytics
in similar datasets.