Predicting Autism Spectrum Disorder Using Machine Learning Technique
Predicting Autism Spectrum Disorder Using Machine Learning Technique
By:
Ruxosh Mohammad
Supervised by:
Dr. Awaz K. Muhammad
April-2024
Certification of the Supervisor
I certify that this work was prepared under my supervision at the Department of
Mathematics / College of Education / Salahaddin University- Erbil in partial
fulfillment of the requirements for the degree of Bachelor of philosophy of Science
in Mathematics.
Signature:
Supervisor: Dr. Awaz K. Muhammad
Scientific grade: Lecturer
Date: 8 / 4 / 2024
Signature:
Name: Dr. Rashad Rashid Haji
Scientific grade: Assistant Professor
Chairman of the Mathematics Department
Date: 8/ 4 / 2024
i
ACKNOWLEDGMENT
In the name of Allah, the Most Gracious, the Most Merciful, first of all. I want to
express my deepest praise to Allah Almighty for giving me patience,
I want to extend my sincere thanks to my math teacher, (Dr. Awaz K. Muhammad),
for her dedication and support in helping me complete my project. Her diligent
guidance and encouragement made all the difference.
Plus, I'm grateful for her patience as she guided me step-by-step through the math
project.
Also, I would like to express my special appreciation to our head of department
(Assi. Prof. Dr. Rashad Rashid) for his support for me during these years, and
thanks
to the entire staff of mathematics department.
I would like to express my gratitude to my parents, family, and friends for their
invaluable support in completing this project within a tight deadline. Thank you!
ii
ABSTRACT
iii
Table of Contents
Certification of the Supervisor.................................................................................................................... i
ACKNOWLEDGMENT ............................................................................................................................ ii
ABSTRACT ................................................................................................................................................ iii
List of tables................................................................................................................................................. 1
List of figures ............................................................................................................................................... 2
INTRODUCTION....................................................................................................................................... 3
CHAPTER ONE ......................................................................................................................................... 5
Classification Methods: .............................................................................................................................. 5
Naive Bayes Classifier: ............................................................................................................................... 5
Support Vector Machine (SVM): ........................................................................................................... 6
K-nearest neighbors (KNN): .................................................................................................................. 6
Decision Tree Classifier: ............................................................................................................................. 7
Logistic Regression (LR): ........................................................................................................................... 8
Classifier Evaluation Measures ................................................................................................................. 9
Accuracy: ................................................................................................................................................. 9
Sensitivity: ............................................................................................................................................... 9
Specificity: ............................................................................................................................................... 9
Precision: ............................................................................................................................................... 10
Recall: .................................................................................................................................................... 10
F-measure: ............................................................................................................................................. 10
ROC(Receiver Operating Curve): ....................................................................................................... 10
Confusion Matrix: ..................................................................................................................................... 11
CHAPTER TWO ...................................................................................................................................... 12
Dataset........................................................................................................................................................ 12
Description of Data Set Attributes: ......................................................................................................... 14
Representation graph for the data set ..................................................................................................... 16
CHAPTER THREE .................................................................................................................................. 19
Experiments And Results ..................................................................................................................... 19
Representation graph for the results ....................................................................................................... 22
CONCLUSION ......................................................................................................................................... 24
Reference ................................................................................................................................................... 25
iv
List of tables
1
List of figures
Figure 1. Target: ASD class ------------------------------------------------------------------------------------------------------------ 16
Figure 2:Histogram of A1 in Q-CHAT-A10 ----------------------------------------------------------------------------------4
Figure 3:Histogram Ethnicity ------------------------------- ---------------------------------------------------------------------- 16
Figure 4:Histogram Genetic Disorders---------------------------------------------------------------------------------------4
Figure 5:Histogram Qchat 10 Score ------------------------------------------------------------------------------------------------ 16
Figure 6Figure 5: Age Distribution --------------------------------------------------------------------------------------------------- 17
Figure 7 Gendar distribution --------------------------------------------------------------------------------------------------------- 17
Figure 8:Histogram Childhood Autism Rating Scale --------------------------------------------------------------------------- 17
Figure 10: sex distribution -----------------------------------------------------------------------------------------------------6
Figure 11: Histogram Anxiety disorder -------------------------------------------------------------------------------------------- 18
Figure 12:Histogram Jaundice -----------------------------------------------------------------------------------------------6
Figure 13:patient has jaundice at the birth ---------------------------------------------------------------------------------------- 18
Figure 14: Histogram Family member with ASD -------------------------------------------------------------------------------- 18
Figure 15: Precision of all Classification Algorithms.
Figure 16: Recall of all Classification Algorithms.-------------------------------------------------------------------------------- 22
Figure 17: F-measure of all Classification Algorithms. --------------------------------------------------------------------------9
Figure 18: Accuracy of all Classification Algorithms. ---------------------------------------------------------------------------- 22
Figure 19: ROC of all Classification Algorithms. ---------------------------------------------------------------------------------- 22
Figure 21: correctly classified instances Algorithm ------------------------------------------------------------------------------ 23
Figure 20:incorrectly classified instances Algorithm ---------------------------------------------------------------------------- 23
2
INTRODUCTION
3
Measure algorithms they obtained the best classifier is WCBA which Accuracy
97% .
This project contains three chapter. In chapter one we illustrate several
classification algorithms and materials. This classifiers such as support vector
machine (SVM), Decision Tree (DT), K- nearest neighbors (K-nn), Logistic
Regression (LR) and Naive Bayes (NB) are used to predict and analysis the
ASD problem.
Chapter two includes the dataset which is composed of survey results of people
who filled an app form. There are labels portraying whether the person received
a diagnosis of autism. This dataset contains factors involved in developing ASD
in children. It consists of the following features: A10 Autism Spectrum
Quotient, Social Responsiveness Scale, Age Years, Qchat_10_Score, Speech
Delay/Language Disorder, Learning disorder, Genetic Disorders, Depression,
Global developmental delay/intellectual disability, Social/Behavioural Issues,
Childhood Autism Rating Scale, Anxiety disorder, Sex, Ethnicity, Jaundice,
Family mem with ASD along with various information. The data here consists
of different quantities and factors which have characteristics of ASD in
children. It is collected from AUTISM RESEARCH: UNIVERSITY OF
ARKANSAS Computer Science Department.
Chapter three, includes all the experiments and results. We use WEKA tools to
obtain the results. Naive Bayes machine learning classification algorithms,
SVM, Decision Tree, KNN and LR are used. The experimental performances of
all five algorithms are compared on different metrics such as Accuracy, F-
Measure, ROC and Recall. Accuracy is measured on correctly and incorrectly
classified instances. The obtained results show that LR outperforms other
algorithms with the highest accuracy of 0.983%.
4
CHAPTER ONE
Classification Methods:
5
Support Vector Machine (SVM):
SVM is one of the standard set of supervised machine learning model employed
in classification. Given a two-class training sample the aim of a support vector
machine is to find the best highest-margin separating hyperplane between the
two classes. For better generalization hyperplane should not lies closer to the
data points belong to the other class. Hyperplane should be selected which is far
from the data points from each category. The points that lie nearest to the
margin of the classifier are the support vectors. The Accuracy of the experiment
is evaluated using WEKA interface. The SVM finds the optimal separating
hyperplane by maximizing the distance between the two decision boundaries.
Mathematically, we will maximize the distance between the hyperplane which is
defined by 𝑤 𝑇 𝑥 + 𝑏 = −1 and the hyperplane defined by 𝑤 𝑇 𝑥 + 𝑏 = 1
2
This distance is equal to This means we want to solve max
||𝑤||
2 ||𝑤||
Equivalently we want min . The SVM should also correctly classify all
||𝑤|| 2
Knn is a type of supervised learning algorithm used for both regression and
classification. KNN tries to predict the correct class for the test data by
calculating the distance between the test data and all the training points. Then
select the K number of points which is closet to the test data. The KNN
algorithm calculates the probability of the test data belonging to the classes of
'K' training data and class holds the highest probability will be selected. In the
case of regression, the value is the mean of the 'K' selected training points.
d(x,y)=√∑𝑛𝑖=1(𝑥𝑖 − 𝑌𝑖 )2
6
The K-NN working can be explained on the basis of the below algorithm:
7
𝑛
|𝑃𝑗 | |𝑃𝑗 |
𝐸(𝑃) = − ∑ log
|𝑃| |𝑃|
𝑗=1
|𝑃𝑗 | |𝑃𝑗 |
𝐸 (𝑗 |𝑃 ) = log
|𝑃| |𝑃|
𝐵 = 𝑚 ∗ (𝑘 − 1)
(1)
The probability for class j with the exception of the last class is stated in (8) and
the last class probability given in (3).
∑𝑘−1
𝑗=1 𝑋𝑖 𝐵𝑖
𝑒
𝑃𝑗 (𝑋𝑖 ) = ∑𝑘−1
(2)
(1+𝑒) 𝑗=1 𝑋𝑖 𝐵𝑖
1
𝑃′𝑗 (𝑋𝑖 ) = ∑𝑘−1 𝑋 𝐵
(3)
(1+𝑒) 𝑗=1 𝑖 𝑖
𝑟𝑖𝑑𝑔𝑒 ∗ 𝐵2 (4)
8
Classifier Evaluation Measures
= probability of a positive test given that the patient has the disease
F-measure: also known as the F1 score, is a metric that is used to evaluate the
performance of a binary classification model. It is defined as the harmonic mean
of precision and recall, and is used to balance the precision and recall of a
model in a single metric.
The F-measure is calculated using the following formula:
F-measure = 2 * (Precision * Recall) / (Precision + Recall)
10
Confusion Matrix:
The confusion matrix is a matrix used to determine the performance of the
classification models for a given set of test data. It can only be determined if the
true values for test data are known. The matrix itself can be easily understood,
but the related terminologies may be confusing. Since it shows the errors in the
model performance in the form of a matrix, hence also known as an error
matrix.
True False
(TP) (FP)
(FN) (TN)
False True
11
CHAPTER TWO
Dataset
Autistic Spectrum Disorder (ASD) is a neurodevelopment condition associated
with significant healthcare costs, and early diagnosis can significantly reduce
these. Unfortunately, waiting times for an ASD diagnosis are lengthy and
procedures are not cost effective. The economic impact of autism and the
increase in the number of ASD cases across the world reveals an urgent need for
the development of easily implemented and effective screening methods.
Therefore, a time-efficient and accessible ASD screening is imminent to help
health professionals and inform individuals whether they should pursue formal
clinical diagnosis. The rapid growth in the number of ASD cases worldwide
necessitates datasets related to behavior traits. However, such datasets are rare
making it difficult to perform thorough analyses to improve the efficiency,
sensitivity, specificity and predictive accuracy of the ASD screening process.
Presently, very limited autism datasets associated with clinical or screening are
available and most of them are genetic in nature.
In this work we use a dataset related to autism screening of adults that contained
20 features to be utilized for further analysis especially in determining
influential autistic traits and improving the classification of ASD cases. In this
dataset, ten behavioral features (AQ-10-Adult (see table below)) plus ten
individuals characteristics that have proved to be effective in detecting the ASD
cases from controls in behavior science.
This dataset contains factors involved in developing ASD in children. It consists
of the following features: A10 Autism Spectrum Quotient, Social
Responsiveness Scale, Age Years, Qchat_10_Score, Speech Delay/Language
Disorder, Learning disorder, Genetic Disorders, Depression, Global
developmental delay/intellectual disability, Social/Behavioral Issues, Childhood
Autism Rating Scale, Anxiety disorder, Sex, Ethnicity, Jaundice, Family mem
with ASD along with various information. The data here consists of different
quantities and factors which have characteristics of ASD in children. It is
collected from Autism Research: University of Arkansas Computer Science
Department. Around 77% of the target is ASD negative (0) and 23% ASD
positive (1)
12
Table 2: Q-CHAT-10
13
Description of Data Set Attributes:
Table 3. Description of dataset for attributes.
14
12. Social Behavioral acting out and showing unwanted behavior towards others.
Issues
13. Childhood Autism There is a lot of assessment tools to help in the diagnosis of autism
Rating Scale available. One of them is the Childhood Autism Rating Scale (CARS).
14. Anxiety disorder is a type of mental health condition. If he has anxiety disorders, he may
respond to certain things and situations with fear. She may also experience
physical symptoms of anxiety, such as palpitations and sweating.
15. Family mem with affects the entire family—parents, siblings, and in some families,
ASD grandparents, aunts, uncles, and cousins
16. Who completed Family member , Health Care Professional
the test
17. ASD traits People with ASD often have problems with social communication and
interaction, and restricted or repetitive behaviors or interests.
15
Representation graph for the data set
16
Figure 6Figure 5: Age Distribution
17
Figure 9: sex distribution Figure 10: Histogram Anxiety disorder
18
CHAPTER THREE
A B
A B
A B
A B
19
Table 8: Confusion Matrix of Naive Bayes.
A B
20
Table 10: Classifier's Performance on The Basis of Classified Instances.
Table 11 shows the Classification Algorithms for Correctly Classified Instances and
Incorrectly Classified Instances. We can see that for LR we have 1953 correctly instance
while for Knn we have 1912 correctly instance. Figure-19 represents ROC area of all
classification algorithms.
Figure-20-21 represents Correctly Classified Instances and incorrectly Classified
Instances of all classification algorithms
21
Representation graph for the results
Figure 14: Precision of all Classification Algorithms. Figure 15: Recall of all Classification Algorithms.
Figure 16: F-measure of all Classification Algorithms. Figure 17: Accuracy of all Classification Algorithms.
22
Figure 19: correctly classified instances Algorithm
23
CONCLUSION
In this work we want to predict autism spectrum disorder (ASD) using machine
learning technique. we use a dataset related to autism screening of adults that
contained 20 features to be utilized for analysis especially in determining
influential autistic traits and improving the classification of ASD cases. This
dataset contains 1985 participants which was collected from Autism Research:
University of Arkansas Computer Science Department. Around 77% of the
target is ASD negative (0) and 23% ASD positive (1).
In this work, we apply five datamining classifiers. Data mining techniques
constitute essential aids in the decision-making processes in many critical areas
such as the medical field, online phishing prevention, text analysis, social
media, and many others. Based on that, all of these algorithms showed good
performance in serving the autism patients, in addition to enhance the prediction
process that decide if the person has autism spectrum disorder or not. algorithms
in terms of seven common statistical measures: Accuracy, Recall, Precision and
F-Measure and ROC and sensitivity and specificity in order to achieve high
level of accuracy when applied on such related domains from the confusion
matrix, the accuracy is calculated from there we found that the LR algorithm
with the best accuracy in percentage split evaluation test with an accuracy of
0.983%.
24
Reference
Sharma, S.R., Gondi, X. and Taraji, F.I., 2018. Autism spectrum disorder: classification,
diagnosis and therapy. Pharmacology & therapeutics, 190, pp.91-104.
Dunhill, F., Comp tour, A., Marlon, R., Mermillod, M., Pereira, B., Baker, J.S., Charkha,
M., Clinch amps, M. and Boarded, N., 2021. Autism spectrum disorder and air pollution:
A systematic review and meta-analysis. Environmental Pollution, 278, p.116856.
Sega to, A., Manzullo, A., Claimer, F. and De Mimi, E., 2020. Artificial intelligence for
brain diseases: A systematic review. APL bioengineering, 4(4).
Sharma, T. and Shah, M., 2021. A comprehensive review of machine learning techniques
on diabetes detection. Visual Computing for Industry, Biomedicine, and Art, 4, pp.1-16.
Sajed, F., Hassan, M.A., Khan, A.A., Rizwan, M., Kryvinska, N., Vincent, K. and Khan,
I.U., 2022. Secure and efficient data storage operations by using intelligent classification
technique and RSA algorithm in Io T-based cloud computing. Scientific Programming,
2022, pp.1-10.
Uddin, S., Aqua, I., Lu, H., Moni, M.A. and Gide, E., 2022. Comparative performance
analysis of K-nearest neighbor (KNN) algorithm and its different variants for disease
prediction. Scientific Reports, 12(1), p.6256.
Khanam, J.J. and Foo, S.Y., 2021. A comparison of machine learning algorithms for
diabetes prediction. Ict Express, 7(4), pp.432-439.
Kamal, S.R. and Yaghoubzadeh, R., 2021. Feature selection using grasshopper
optimization algorithm in diagnosis of diabetes disease. Informatics in Medicine
Unlocked, 26, p.100707.
25
پوختە
26