0% found this document useful (0 votes)
22 views

Dataset Source Kaggle-1

The dataset from Kaggle contains 773 unique diseases and 377 symptoms, structured using one-hot encoding, with a high usability score of 10.00. It offers 246,000 samples and is designed to mimic real-world data, making it suitable for developing robust disease classification models. The custom Gaussian Naïve Bayes model achieved an accuracy of 86.65% in predicting diseases based on symptoms, demonstrating its effectiveness for healthcare applications.

Uploaded by

reqqwe647
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Dataset Source Kaggle-1

The dataset from Kaggle contains 773 unique diseases and 377 symptoms, structured using one-hot encoding, with a high usability score of 10.00. It offers 246,000 samples and is designed to mimic real-world data, making it suitable for developing robust disease classification models. The custom Gaussian Naïve Bayes model achieved an accuracy of 86.65% in predicting diseases based on symptoms, demonstrating its effectiveness for healthcare applications.

Uploaded by

reqqwe647
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Dataset Breakdown

Dataset Source: Kaggle


Dataset: https://www.kaggle.com/code/dhivyeshrk/disease-symtom-eda
Dataset Introduction:
The dataset utilized consists of 773 unique diseases with the support of 377 different
symptoms. The symptoms have been structured with the utilization of One hot encoding
meaning that each symptoms is represented in the form of 0's or 1's based on the presence or
absence of a given symptoms for the disease in question. The dataset origination is
determined to be artificial with careful consideration placed to the relationships between the
symptoms and its respective diseases. A proper measure have been taken to mimic the
realistic interaction along with ensuring of severity of symptoms and last but not the least the
probability of the disease's occurrence has been properly taken into the account which led to a
rating of 10.00 in the usability scale of Kaggle.
With the boasting of 2,46,000 different samples this dataset presents an opportunity for the
vast amounts of data to work behind with. Despite its artificial nature, the methodologies
under utilization helps it mirror the complexity of the real world data. Thus, presenting itself
as a good foundation for the model to be built upon.

Reason for selection:


1) High usability score: One of the primary reason for its selection is the presence of 10.00
usability score based on Completeness, Credibility and Compatibility parameters given to it
by Kaggle meaning that the dataset is fully populated with relevant data with minimal or
none missing values, trustworthy source with reliable and accurate data and easy integration
of dataset for different applications.
2) Exploratory Data Analysis: Upon the exploration of the dataset, there was no presence of
null data further solidifying its position as clean data is essential for accurate and reliable
training of the model further down the process.
3) Rich set of diseases: The dataset boasts 773 unique diseases, providing a diverse set of
medical conditions which allows for the development of a more generalized and robust
models for carrying out the identification and classification of diseases based on symptoms
being provided.
4) Quarterly Updates: The quarterly updates of dataset presented itself as a attractive option
as it was up to date with the latest trends and updates of ever evolving changes of the
environment and human evolutions.
Exploratory Data Analysis:
Exploring the step by step process of preparing the data for usage.

1. Importing Dependencies: Import the necessary libraries and modules for data
analysis and model implementation.

2. Load Dataset: Load the dataset from a CSV file containing information about
various diseases and their associated symptoms.

3. Checking for null data: Although there is no presence of null or dirty data as per Kaggle
score, we doubled checked it again.

4. Dataset Information: Understanding dataset with different measures.

5. Splitting data for the model:

6. Findings from distribution of symptoms:

Here, we found that distribution of symptoms for some disease is greater than
compared to others. It is found to be of intentional purpose as their probability of
occurrence in the real world is comparatively negligible also the dataset is tuned to
mimic the real world occurrence and distribution
7. Custom naïve bayes code and Fitting the model:

model2 = CustomGaussianNB()
model2.fit(XX_train.to_numpy(), YY_train.to_numpy())

8. Symptoms feeding and classification:

9. Accuracy:

The analysis of user symptoms through the custom Gaussian Naïve Bayes model yielded insightful
results, effectively mapping the provided symptoms feeling ill, vomiting, headache, nausea, and
diarrhea to the most probable diseases. The model's prediction ranked ileus as the most likely
condition with a probability of 65.98%, followed by hypovolemia at 34.01%, and gastritis at 0.01%,
demonstrating a robust decision-making capability in differential diagnosis.

The classification process achieved an impressive accuracy of 86.65%, validating the effectiveness
of the custom implementation in predicting disease outcomes based on symptom patterns. This high
level of accuracy underscores the importance of a well-designed probabilistic model for healthcare
applications, where quick and reliable predictions are crucial for patient care. The result reflects the
model’s ability to handle noisy symptom data while offering valuable insights for clinical decision-
making. Further optimization and real-world validation could enhance the model's precision and
adaptability across diverse datasets.

You might also like