96.Thyroid Disease Detection Using Supervised Machine Learning Techniques (1)
96.Thyroid Disease Detection Using Supervised Machine Learning Techniques (1)
Abstract- Thyroid ailment is a common endocrine can be time-consuming and costly, and they may
disease that affects millions of people around the have limits in identifying early stages of the disease
world. For effective treatment, accurate diagnosis and [8,13]. By evaluating massive volumes of patient
prediction are essential. The efficiency of various
data, discovering patterns and risk factors that
machine learning models for predicting thyroid disease
human specialists may not be able to uncover,
is investigated in this research study. Decision Tree,
Logistic Regression, Random Forest, Support Vector machine learning algorithms have showed promise
Machine (SVM), k-Nearest Neighbor (kNN), Naive in increasing the speed and accuracy of thyroid
Bayes, Kmeans, and an ensemble model were among disease ailment.
the models used. The results showed that the Logistic
Regression model was 96% accurate, while the The aim of this research is to analyse the
Decision Tree and Random Forest models were both performance of several machine learning algorithms
99% accurate. SVM had an 87% accuracy rate, kNN in forecast thyroid illness and to identify the best
had a 95% accuracy rate, and the ensemble model had
accurate and efficient algorithm for thyroid disease
a 97% accuracy rate. The accuracy rates of Naive
detection. The findings of the study have the
Bayes and K-Means were lower. The outcomes of this
study emphasise the potential of machine learning potential to influence clinical decision-making and
algorithms, notably Logistic Regression, decision tree, improve outcomes in thyroid disease diagnosis and
random forest, and ensemble models, in accurately management in patients.
diagnosing and predicting thyroid disease.
Keywords: Machine Learning, Classification, Random II. LITERATURE REVIEW
Forest Model, kNN, Decision Tree Model, Logistic
Regression Model, SVM Model, and Naive Bayes Model By examining vast amounts of patient data quickly
and reliably, supervised machine learning
I. INTRODUCTION algorithms have shown considerable promise in the
Thyroid illness is a common endocrine ailment that identification of thyroid illness. The use of
affects millions of individuals throughout the world. multiclass classification algorithms to give one of
Machine learning algorithms have showed potential multiple potential labels to each instance of input
in assisting in the detection of thyroid disease by data is a typical technique to thyroid disease
analysing large volumes of patient data rapidly and diagnosis [16].
consistently[5,19]. This study compares the Logistic
EMMANUEL F. WERR et al in [1] has clubbed
Regression model to other commonly used machine
the ‘Thyroid Disease Diagnosis’ dataset and has
learning approaches in predicting thyroid illness.
done major part of prepressing using the collected
The Decision Tree and random Forest model were
data. Khalid salman and Emrullah Sonuç et al in
the most accurate, with a 99% accuracy rate,
[2] observed an accuracy of 98.9% accuracy using
followed by Logistic Regression 96% and KNN
Random Forest Model,98.4 using decision tree
95% model. These findings highlight the future of
model, 92.27% using SVM, 90.9% using K-NN
machine learning algorithms in the detection of
classifier. Lerina Aversano, Mario Luca Bernardi,
thyroid illnesses, as well as the need of selecting
Marta Cimitile, Martina Iammarino, Paolo Emidio
proper algorithms for medical diagnosis [6].
Macchia, Immacolata Cristina Nettore, Chiara
Conventional thyroid disease diagnostic procedures
Verdone et al in [3] observed an accuracy of 82%
include a physical examination, blood testing, and
using K-Nearest Neighbor Model. Chaganti R,
imaging investigations. These approaches, however,
Rustam F, De La Torre Díez I, Mazón JLV,
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on January 10,2024 at 06:17:08 UTC from IEEE Xplore. Restrictions apply.
Rodríguez CL, Ashraf I. et al observed the accuracy evaluate massive amounts of patient data and
of 98% using Random Forest Model, Logistic uncover patterns and risk factors that human
Regression 85% and 85% using SVM Model. specialists may not be able to identify.
Chaganti R, Rustam F, De La Torre Díez I, Mazón
JLV, Rodríguez CL, Ashraf I. et al in [4] observed an Data pre-processing, feature selection, model
accuracy of 85% using logistic regression model and training, and model assessment are all components
SVM model in thyroid disease prediction research. of the system. Cleaning and normalising the patient
Alpaydin, E. et al in [22] has defined the machine data, as well as detecting any missing or incorrect
learning concepts, like supervised learning, use of data points, are all the crucial parts of the data pre-
data in supervised machine learning, classification, processing component. Based on their predictive
and its application, regression as well as potential for thyroid illness, the feature selection
unsupervised machine learning algorithms and its component picks the most relevant characteristics
applications. [22] also discussed the reinforcement from the data. The proposed system flow chart is
learning and other variations like neural networks, it shown in Fig. 1.
has also compared the neural networks with nervous
system. H Wang, Z Lei, X Zhang, B Zhou, J Peng et
al in [20] formulated the mathematical expressions
in machine learning and generalized equations for
performance, errors. This study also covers concepts
of overfitting, underfitting, optimal capacity, Curse
of dimensionality.
Mahesh B et al in [24], Jordan, M. I., Mitchell, T. M.
et al in [16] discussed the algorithms and statistical
task that computer systems perform for the scientific
task in machine learning. Zhou, Z et al in [25],
Mitchell, T. M. [14] introduces the basics of machine
learning, popular machine learning methods, and
other advanced topics. El Naqa, I., & Murphy, M. J.
et al in [18] explored the further expansion of
machine learning in computer vision.
III. METHODOLOGY
A. Data Set
A dataset refers to an assemblage of data utilized for
academic or scientific investigations. In a research
document, it is imperative to provide comprehensive
details regarding the dataset employed, including its
Fig. 1: Suggested Accuracy Prediction System
origin, dimensions, characteristics, and any
procedures for data refinement or purification that
were undertaken [19,21]. IV. DATA PRE-PROCESSING
The study made use of a thyroid illness dataset [1]
obtained from Kaggle, which had 9,172 In data analysis, the data pre-processing part
observations and 21 attributes. The information was becomes a very important step. that entails cleaning,
utilised to train eight different models with a view of converting, and preparing data for analysis in order
predicting thyroid ailment outcomes using machine to assure correctness and consistency. It comprises
learning techniques. The study's goal was to activities like data cleansing, transformation, feature
compare the performance of these models using selection, and dimensionality reduction, and it is
assessment measures and to yeild insights into the required for accurate and dependable outcomes.
usefulness of numerous machine learning A. Before Pre-processing
approaches for the supplied dataset. There were 9172 observations and 31 characteristics
B. Proposed System in the dataset. Demographic information, medical
history, and different laboratory test results were
The suggested thyroid illness detection system based among the features. Unfortunately, several of the
on supervised machine learning algorithms intends characteristics had missing values or outliers, which
to improve the precision and effectiveness of thyroid might have an impact on the machine learning
illness diagnosis. The technology is intended to
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on January 10,2024 at 06:17:08 UTC from IEEE Xplore. Restrictions apply.
algorithms' performance. Before pre-processed data
has been shown in Fig. 2 F1 Score: To signify the success of any certain
Machine Learning Model, the F1 Score is used. It
can be computed by equation 4.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
F1 Score = 2* (4)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
D. Implementation of given classifiers
A Kaggle dataset of 9172 observations and 21
characteristics was cast-off to train the models and
evaluate their performance. The eight models used
will be detailed in the next section.
a) SVM Model
SVM is a machine learning technique that utilize a
hyperplane to optimize the margin between two
Fig. 2: Unprocessed data bar graph classes while classifying data. By using a kernel
function to translate it into a higher-dimensional
B. After Pre-Processing space to handle linear and nonlinear data [7, 9, 11,
Missing value management, outlier reduction, and 15]. Prediction matrices for SVM model is shown in
feature selection were among the data preparation table 1.
TABLE 1: PREDICTION MATRICES FOR SVM MODEL
strategies used. Furthermore, feature engineering
Accuracy Precision Recall F1-Score
was used to create new attributes based on current
ones, resulting in better machine learning model 0.87 0.93 0.95 0.94
accuracy [23]. After pre-processing data is shown in
the Fig. 3. SVM Model gives accuracy of 87%
b) Decision Tree Model
The Decision Tree is a type of machine learning
model that makes judgements based on input
information by employing a tree-like architecture. It
creates a tree-like structure by iteratively splitting
data into subsets depending on the most relevant
qualities. The leaf node is used to determine the best
choice [7, 9, 10]. Equation to calculate Information
Gain and Gini coefficient or Gini Index is shown
below
Information Gain = Entropy(parent) - [Weighted
Average] * Entropy(children)
Fig. 3: Processed data bar graph Gini coefficient = 1 - [Sum of Squared Probabilities]
C. Attributes used in the modals Prediction matrices for Decision Tree model is
Accuracy: The probablity of correctly predicted test shown in table 2.
outcomes. It can be calculated by equation 1.
𝑇𝑃+𝑇𝑁 TABLE 2: PREDICTION MATRICES FOR DECISION TREE
Accuracy = (1) MODEL
𝑇𝑃+𝑇𝑁+𝐹𝑝+𝐹𝑁
Accuracy Precision Recall F1-Score
Precision: Precision is defined as the proportion of 0.99 0.99 0.99 0.99
properly categorised data to total positive data
(True). It can be calculated by equation 2.
The accuracy achieved by the decision tree model is
𝑇𝑃 approximately 99.49%
Precision = (2) c) Logistic Regression Classifier
𝑇𝑃+𝐹𝑃
Recall: The proportion of accurately categorized The Logistic Regression (LogReg) Classifier is a
data to total data in the class is referred to as recall.. binary classification technique in order to predicts
It can be calculated by equation 3. the likelihood of a category dependent variable
given one or more independent factors [9,12,17].
𝑇𝑃
Recall = (3)
𝐹𝑁+𝑇𝑃
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on January 10,2024 at 06:17:08 UTC from IEEE Xplore. Restrictions apply.
Prediction matrices obtained for Log-Reg model
represented in table 3 h) Ensemble Voting Modal
TABLE 3: PREDICTION MATRICES FOR LOGISTIC The Ensemble Voting Model is a machine learning
REGRESSION
approach that merge different models to improve
Accuracy Precision Recall F1-Score
predictability. It operates by aggregating numerous
0.96 0.97 0.99 0.98 models' predictions and picking the most frequent
forecast as the final result. The algorithms used in
Logistic Regression (LogReg) Classifier gives the models might be distinct, or they can be the same
accuracy of 96% method with various hyperparameters. The
d) Naive Bayes Classifier Ensemble Voting Model combines the qualities of
The attributes assumed in the Naive Bayes Model each individual model, improving accuracy and
exhibit a combination of strong and weak resilience.
independence[9,12,15]. It can be calculated by
Bayes Theorem given below. Ensemble Modal gives 97.4% accuracy
Bayes Theorem Eq: P(A|B) = P(A|B) *P(A) / P(B)
V. RESULT AND COMPARISON
Prediction matrices for Naïve Bayes model is shown
in table 4 Using a dataset of 3,772 patient records, researchers
TABLE 4: PREDICTION MATRICES FOR NAÏVE BAYES assessed the efficacy of assorted supervised machine
Accuracy Precision Recall F1-Score learning algorithms for forecast thyroid illness. Age,
0.26 0.98 0.20 0.33 gender, TSH level, T3 level, T4 level, thyroid
disease history, and symptoms are all included in the
Naïve Bayes Classifier gives accuracy of 26% dataset.
e) Random Forest The following algorithms were tested: Decision
A bracket of supervised Machine Learning Tree, Random Forest, Logistic Regression, Support
framework, is Random Forest Model, which is Vector Machine (SVM), k-Nearest Neighbor (kNN),
classification-based model [9, 12]. Prediction and Naive Bayes. Fig. 4 depicts a comparison of the
matrices for Random Forest model is shown in table accuracy of several models used in the study.
5
TABLE 5: PREDICTION MATRICES FOR RANDOM
FOREST
Accuracy Precision Recall F1-Score
0.99 0.99 0.99 0.99
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on January 10,2024 at 06:17:08 UTC from IEEE Xplore. Restrictions apply.
TABLE 8: PREDICTION MATRICES OF DIFFERENT holds great promise for improving public health
MODELS outcomes and reducing the burden of thyroid-related
Model Accuracy Precision Recall F1- illnesses.
Name Score
Decision 0.99 0.99 0.99 0.99
Tree
Logistic 0.96 0.97 0.99 0.98
Regression
Classifier
Random 0.99 0.99 0.99 0.99
Forest
K-NN 0.95 0.96 0.99 0.97
K-Means 0.42 0.85 0.45 0.59
Naïve 0.26 0.98 0.20 0.33
Bayes
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on January 10,2024 at 06:17:08 UTC from IEEE Xplore. Restrictions apply.
overfitting, making them versatile in data analysis. regression and neural network models. Journal of family
medicine and primary care, 9(3), 1470–1476.
It was closely followed by, Logistic Regression
[9] Soni, K. M., Gupta, A., & Jain, T. (2021, September).
Classifier, which is a classification algorithm that Supervised Machine Learning Approaches for Breast Cancer
estimates probabilities for binary or multiclass Classification and a high performance Recurrent Neural Network.
outcomes, has achieved an admirable 96% accuracy. In 2021 Third International Conference on Inventive Research in
Computing Applications (ICIRCA) (pp. 1-7). IEEE.
SVM, on the other hand, which is effective not only
[10] Jain, T., Verma, V. K., Agarwal, M., Yadav, A., & Jain, A.
in linearly separable data but can also handle non- (2020, July). Supervised machine learning approach for the
linear classification tasks through the use of kernel prediction of breast cancer. In 2020 International Conference on
functions, lagged slightly behind with an accuracy System, Computation, Automation and Networking
(ICSCAN) (pp. 1-6). IEEE.
of 87%. While kNN k-nearest neighbors algorithm's
[11] Srivastava, Akhilesh & Chaudhary, Aanchal & Gautam,
simplicity and intuitive approach make it popular for Aayush & Singh, Deepanshu & Khan, Rizwan. (2020). Prediction
both classification and regression tasks in machine of students performance using KNN and decision tree- A Machine
learning has demonstrated a respectable 95% Learning Approach.
[12] Jain, T., Jain, A., Hada, P. S., Kumar, H., Verma, V. K., &
accuracy rate.
Patni, A. (2021, September). Machine Learning Techniques for
While considering the time the study took for each Prediction of Mental Health. In 2021 Third International
algorithm to train and test the dataset, a pattern has Conference on Inventive Research in Computing Applications
emerged. In terms of computation, the Decision Tree (ICIRCA) (pp. 1606-1613). IEEE.
[13] Begum, A.M., & Parkavi, A. (2019). Prediction of thyroid
Model, Logistic Regression Classifier, and Random
Disease Using Data Mining Techniques. 2019 5th International
Forest algorithms proved to be the fastest. In Conference on Advanced Computing & Communication Systems
contrast, SVM and K-means were comparatively (ICACCS), 342-345.
slower. Additionally, it is crucial to note that the [14] Mitchell, T. M. (1997). Machine learning
[15] Riajuliislam, M., Rahim, K. Z., & Mahmud, A. (2021,
Decision Tree model and Random Forest model
February). Prediction of Thyroid Disease (Hypothyroid) in Early
were more complex modals than the other methods Stage Using Feature Selection and Classification Techniques.
used in the research, resulting in somewhat simpler In 2021 International Conference on Information and
models with the latter. Communication Technology for Sustainable Development
(ICICT4SD) (pp. 60-64). IEEE.
In conclusion, the findings of the research suggest
[16] Jordan, M. I., & Mitchell, T. M. (2015). Machine learning:
that supervised machine learning algorithms hold Trends, perspectives, and prospects. Science, 349(6245), 255-
promise for effectively identifying thyroid illness. 260.
When considering accuracy and efficiency, the [17] Sankar, S., Potti, A., Chandrika, G. N., & Ramasubbareddy,
S. (2022). Thyroid Disease Prediction Using XGBoost
Logistic Regression, kNN, Decision Tree, and
Algorithms. Journal of Mobile Multimedia.
Random Forest algorithms outperform the others. [18] El Naqa, I., & Murphy, M. J. (2015). What is machine
Nonetheless, the decision to employ a particular learning (pp. 3-11). Springer International Publishing.
method should be contingent upon the precise [19] Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature
selection in machine learning: A new
demands of the diagnostic assignment. Factors such
perspective. Neurocomputing, 300, 70-79.
as the need for a simple or complex model, as well [20] Wang, H., Lei, Z., Zhang, X., Zhou, B., & Peng, J. (2016).
as the availability of resources in a computer, should Machine learning basics. Deep learning, 98-164.
be used during the selection of the most suitable [21] Rabby, G., & Berka, P. (2022). Multi-class classification of
COVID-19 documents using machine learning algorithms.
algorithm or model for thyroid illness identification.
Journal of Intelligent Information Systems.
VII. REFERENCES [22] Alpaydin, E. (2021). Machine learning. Mit Press.
[23] Tech G. FeatureSelection. [(accessed on 25 March 2023)].
[1] Emmanuel F. Werr. “Thyroid Prediction Dataset” Thyroid
[24] Mahesh, B. (2020). Machine learning algorithms-a
Prediction Dataset | Kaggle June 2022
review. International Journal of Science and Research
[2] Khalid salman and Emrullah Sonuç 2021 J. Phys.: Conf.
(IJSR).[Internet], 9(1), 381-386.
Ser. 1963 012140
[25] Zhou, Z. H. (2021). Machine learning. Springer Nature.
[3] Aversano, L., Bernardi, M. L., Cimitile, M., Iammarino,
M., Macchia, P. E., Nettore, I. C., & Verdone, C. (2021).
Thyroid Disease Treatment prediction with machine learning
approaches. Procedia Computer Science, 192, 1031–1040.
[4] Chaganti, R., Rustam, F., De La Torre Díez, I., Mazón, J. L.
V., Rodríguez, C. L., & Ashraf, I. (2022). Thyroid Disease
Prediction Using Selective Features and Machine
LearningTechniques. Cancers, 14(16), 3914.
[5] Azar, A. T., Member, I. S., Hassanien, A. E., & Kim, T. (2012).
Expert System Based on Neural-Fuzzy Rules for Thyroid
Diseases Diagnosis. Communications in Computer and
Information Science, 94–105.
[6] Hosseinzadeh, M., Ahmed, O. H., Ghafour, M. Y., Safara, F.,
Hama, H. K., Ali, S., ... & Chiang, H. S. (2021). A multiple
multilayer perceptron neural network with an adaptive learning
algorithm for thyroid disease diagnosis in the internet of medical
things. The Journal of Supercomputing, 77, 3616-3637.
[7] Al-muwaffaq, I., & Bozkus, Z. (2016). MLTDD: use of
machine learning techniques for diagnosis of thyroid gland
disorder. Comput Sci Inf Technol, 67-3.
[8] Borzouei, S., Mahjub, H., Sajadi, N. A., & Farhadian, M.
(2020). Diagnosing thyroid disorders: Comparison of logistic
Authorized licensed use limited to: UNIVERSITY TENAGA NASIONAL. Downloaded on January 10,2024 at 06:17:08 UTC from IEEE Xplore. Restrictions apply.