Autistic_Spectrum_Disorder_Screening_Prediction_with_Machine_Learning_Models
Autistic_Spectrum_Disorder_Screening_Prediction_with_Machine_Learning_Models
Abstract—Autistic Spectrum Disorder (ASD) is a indicate ASD, to observing genetic makeup of individuals,
developmental disorder that can be observed in all age groups. the previous researches on the topic have paved way for
This paper uses ASD screening dataset for analysis and better diagnosis of ASD. Anibal et al [1] used a large brain
prediction of probable cases in adults, children and imaging dataset to identify ASD patients with deep learning
adolescents. The dataset for each of the age groups are
algorithms. They showed the anterior-posterior
analyzed and inferences are drawn from them. Machine
learning algorithms like Artificial Neural Networks (ANN), underconnectivity autistic brains. Hassan et al [2] utilized
Random Forest, Logistic Regression, Decision Tree and the decision tree algorithm for analysis of National
Support Vector Machines (SVM) are used for prediction and Database for Autism Research (NDAR) dataset. Thabtah et
comparison. al [3] aimed at extracting the most influential features
which contribute in ASD prediction. For this purpose, they
Keywords—Autism Spectrum Disorder, Machine Learning, used Variable Analysis which extracted features from
Classification, Medical, Diagnosis child, adolescent and adult dataset. Choudhery et al [6]
I. INTRODUCTION utilized gene expression dataset. K-means clustering was
used to cluster genes and then a support vector machine
Autism Spectrum Disorders bring about certain model classified functional connectivity changes
challenges in social, behavioral, communication and associated with ASD. Pagnozzi et al [7] investigated brain
emotional understanding in an individual. People changes linked to ASD patients through MRI image data.
diagnosed with ASD have a range of symptoms. That is They identified certain biomarkers by studying brain
why it is termed as a ‘spectrum’ disorder. Since ASD is a morphology. Stevens et al [8] aimed at discovering certain
neurological developmental disorder, there is no specific phenotypes which heavily impact ASD diagnosis and
medical test for it, thus making the diagnosis of ASD an examine treatment responses related to them. They found
arduous task. Although now these disorders can be 16 genetic subgroups along with 2 behavioral phenotypes.
perceived in early childhood, there are some cases in which The ASD prevalence has increased in the past two decades
the symptoms are not diagnosed until adolescent or owing to the increased research on autism. Park et al [9]
adulthood. ASD currently has no standard treatment. An concluded that amygdala and nucleus accumbens are two
early diagnosis and a head start in therapies can potentially affected components of brain in ASD diagnosed
lead to better results. individuals. Further research is required for better
understanding treatment of the disorder.
This paper focuses on proposing a model which would
assist in prediction of ASD in an individual so that Wang et al [10] used the ASD dataset to gain 99%
diagnosis can be done and further treatments may be sensitivity and specificity. They used only deep learning
followed. The dataset used is the Autistic Spectrum techniques to build their model by proposing a neural
Disorder Screening Data [4]. The datasets provide insights network architecture on the dataset. Islam et al [11] devised
into various factors affecting the prediction of the disorder. a merged model of Random Forest Classification and
Machine learning algorithms like decision tree, random Regression Trees (CART) and Random Forest Iterative
forest, logistic regression, support vector classifier and Dichotomiser-3 (ID3). Two datasets for each age group
artificial neural networks are used for finding out the were taken, one was the AQ-10 dataset by Thabtah [4] and
optimal model for each dataset. Several performance the other was a set of real data containing 250 records with
metrics are used in order to analyze and compare each both ASD and non ASD individuals. The CART model
model from every angle possible. gave 97.10% accuracy with adult AQ-10 dataset and the
ID3 model achieved 85.10% accuracy with the adult real
II. RELATED WORK dataset. As per Hyde et al [12], supervised learning
algorithms like support vector machines, logistic
Various methods for determining ASD have already regression, random forest, neural networks, selection
been used. From employing image processing techniques operators like lasso regression are some of the few
to gather abnormalities in brain structure which may
Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
978-1-7281-4142-8/$31.00 ©2020 IEEE
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)
prevalent techniques. Neuroimaging data has been used to binary 0/1 values to ease the implementation of
deploy several models as well. classification algorithms.
The missing data in attributes like ethnicity, relation
This paper aims to create an optimal model for autism and age are dealt with by removing these records
spectrum disorder prediction based on the autism screening
altogether from the datasets.
datasets for three age groups, viz., child, adolescent and
adult, contributed by Fadi Fayez Thabtah [4] through his
ASD screening application, ‘asdtests’.
B. Data Cleaning
The attributes like gender, jaundice, autism,
used_app_before and Class_ASD are converted into
Fig. 2 Box plot of age against gender in adult autism dataset.
Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)
A score of more than 7 in Q & A for screening test spectrum. So, the score variable is redundant and can be
automatically results in a positive ASD classification. removed as the correlation between Class_ASD and score
is 0.83, which is a really high value. It can lead to
multicollinearity.
E. Feature Importance
Lasso regression feature importance model is used for
feature importance determination. After shrinkage, the
‘especially influential’ factors are the ones with highest
coefficients. Although, if the coefficient is really high,
there is a chance of multicollinearity. The variables with
Fig. 4 Mean number of records diagnosed as ASD for each ethnic group
in adult autism dataset. coefficient zero are eliminated. It is clear that for all age
group, the Q and A of screening test is the major deciding
2) Adolescent Autism Screening Dataset: The factor.
individuals ranging from ages 12 to 16 lie in this age group.
The number of instances classified to have ASD is twice
more.
3) Child Autism Screening Dataset: The children in the
dataset are 4 to 11 years of age. The median is 6. The
number of instances classified to have ASD is slightly
more. A score of more than 7 in Q & A for screening test
automatically results in a positive ASD classification.
Black ethnic group seems to have higher scores in
screening test.
D. Data Analysis
An autism screening score of more than 7
automatically classifies the patient to be lying in autism
Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)
E. F1 Score
F1 Score is a better measure than accuracy because of
the uneven distribution of classes among the instances. The
records with Class_ASD as 0 cover more than half of the
instances.
F. Accuracy Score
Accuracy score is the fraction of predictions the
Fig. 6 Lasso model on adult autism screening dataset.
classification model predicted correctly.
As per the model on the adult ASD screening dataset, TP+TN
the answer A9, corresponding to the question, ‘I find it Accuracy Score=
TP+FN+TN+FP
easy to work out what someone is thinking or feeling just
by looking at their face’, is the most important attribute in G. ROC (Receiver Operating Characteristic) curve and
ASD diagnosis. The Lasso model picks up 18 variables AUC (Area Under the Curve)
and eliminates 11 variables. ROC is a probability curve plotted with the True
Positive Rate (y-axis) against the False Positive Rate (x-
The Lasso model picks up 14 variables while axis). AUC is the measure of how much the model is
capable of distinguishing between the classes. Higher the
eliminating the other 16 variables in adolescent dataset. AUC value, better is the model at correctly predicting
The question to the answer A5: ‘S/he frequently finds that whether the individual has ASD. When AUC is 0.5, it is the
s/he doesn’t know how to keep a conversation going’, is worst case. Because then the model could not distinguish
the most important feature. between the positive and negative classes reliably. At AUC
zero, the model is just reciprocating the classes.
For the child dataset, the response A4, ‘S/he finds it H. Sensitivity/ Recall
easy to go back and forth between different activities’ is Sensitivity is the fraction of individuals who will be
of utmost importance. correctly predicted as positive class. High sensitivity
means more number of individuals have been correctly
IV. PERFORMANCE METRICS
predicted to have ASD.
A. True positive (TP)
The number of records that were actually positive and TP
Sensitivity=
were classified positive. TP+FN
B. False Negative (FN) I. Specificity
The number of records that were positive but were Sensitivity is the fraction of individuals who will be
classified negative. correctly predicted as negative class. High specificity
means a greater number of individuals have been correctly
C. True Negative (TN)
predicted not to have ASD.
The number of records that were actually negative and
were classified negative.
ܶܰ
ܵ ݕݐ݂݅ܿ݅݅ܿ݁ൌ
D. False Positive (FP) ܶܰ ܲܨ
The number of records that were positive but were
classified negative.
Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)
Fig. 9 Comparison graph for adult autism screening dataset between all
the algorithms in terms of F1 Score, Accuracy, AUC, Sensitivity and
Specificity.
Fig. 8 ROC graph for Decision Tree (left) and ANN (right) for adult
autism screening dataset.
Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)
Fig. 10 ROC graph for Decision Tree (left) and Logistic Regression
(right) for adolescent autism screening dataset.
Fig. 12 ROC graph for Decision Tree (left) and Support Vector (right)
for child autism screening dataset.
Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)
Fig. 13 Comparison graph for child autism screening dataset between all Decision Tree results in an overfitted model in all the
the algorithms in terms of F1 Score, Accuracy, AUC, Sensitivity and
datasets. ANN performs the best on the adult autism
Specificity.
dataset. Logistic Regression gives the optimal result for
Decision tree and random forest models perform adolescent autism dataset. Support Vector is the best for
poorly when compared to other models in terms of all the child autism screening dataset. This paper proposed a
time-conserving method to screen potential ASD
performance metrics. Support Vector Classifier performs
individuals for self-screening.
the best, giving the best F1 score, accuracy, sensitivity and Although, the data we have so far is insufficient. The
specificity rates. datasets are really small to derive a suitable model for
prediction. Certain conclusions can still be derived through
VI. DISCUSSIONS data analysis of the datasets. A proper diagnosis method for
Image processing on MRIs, evaluation of gene the disorder is crucial.
expression, the risk factors involving ASD and study on ACKNOWLEDGMENT
various biomarkers which may relate to the disorder have
We would like to express our gratitude to the School of
been studied previously. The ABIDE (Autism Brain Information and Technology Engineering, VIT, Vellore for
Imaging Data Exchange) dataset used by Heinsfield [1] providing us this opportunity to contribute our part.
aims to identify the areas of brain which may define if an
individual has ASD. Hassan et al [2] used decision tree to REFERENCES
identify the genetic and environmental risk factors that [1] Anibal Sólon Heinsfeld, Alexandre Rosa Franco, R. Cameron
Craddock, Augusto Buchweitz and Felipe Meneguzzi, “Identification
may contribute to ASD. Similarly, Choudhery et al [6] of autism spectrum disorder using deep learning and the ABIDE
used gene expression data for ASD patients to observe dataset,” NeuroImage: Clinical, vol. 17, pp. 16-23, 2018.
[2] Mariam M. Hassan and Hoda M. O. Mokhtar, “Investigating autism
gene expression changes which may predict changes in etiology and heterogeneity by decision tree algorithm,” Informatics in
brain regions. Several papers used AQ-10 ASD screening Medicine Unlocked, vol. 16, 100215, 2019.
[3] Fadi Thabtah, Firuz Kamalov and Khairan Rajab, “A new
dataset [4] for building models for optimized predictions. computational intelligence approach to detect autistic features for
[3][5][10][11] autism screening,” International Journal of Medical Informatics, vol.
117, pp. 112-124, sep 2018.
[4] Fadi Thabtah, “Autistic Spectrum Disorder Screening Datasets,” UCI
This paper focuses on the data collected from a self- machine learning repository, 2017. [Online].Available:
screening application for ASD, called as “asdtests”, https://archive.ics.uci.edu/ml
created by Thabtah et al [4]. The test can easily be taken [5] Fadi Thabtah, “Autism Spectrum Disorder Screening: Machine
Learning Adaptation and DSM-5 Fulfillment,” Proceedings of the 1st
by the individuals if there is any probability of them being International Conference on Medical and Health, pp.1-6, may 2017
diagnosed with ASD. All other datasets deal with [6] Sanjeevani Choudhery, Chuan Huang and Daifeng Wang, “T253. A
Machine Learning Approach to Predict the Changes of Brain
heterogeneous data from multiple resources, which makes Functional Connectivity in Autism Spectrum Disorder From the Gene
diagnosis of ASD a time-consuming and costly affair. In Expression Data,” Biological Psychiatry, vol. 83, issue 9, supplement,
pp. S227-S228, 1 May 2018.
this paper, machine learning models are proposed and [7] Alex M. Pagnozzi, Eugenia Conti, Sara Calderoni, Jurgen Fripp and
compared to attain the most optimal model for each Stephen E. Rose, “A systematic review of structural MRI biomarkers
in autism spectrum disorder: A machine learning perspective,”
dataset in terms of f1 score, accuracy, sensitivity, International Journal of Developmental Neuroscience, vol. 71, pp.
specificity, ROC and AUC. The datasets, however, 68-82, dec 2018.
especially the child and adolescent datasets, are really [8] Elizabeth Stevens, Dennis R.Dixon, Marlena N. Novack, Doreen
Granpeesheh, Tristram Smith, Erik Linstead, “Identification and
small in size and are not suitable for building machine analysis of behavioral phenotypes in autism spectrum disorder via
learning models. The adult autism screening dataset itself unsupervised machine learning,” International Journal of Medical
Informatics, vol. 129, pp. 29-36, sep 2019
is biased towards the class 0 for Class_ASD. More data is [9] Hye Ran Park, Jae Meen Lee, Hyo Eun Moon, Dong Soo Lee, Bung-
required for reliable ASD prediction. Nyun Kim, Jinhyun Kim, Dong Gyu Kim, Sun Ha Paek, “A Short
Review on the Current Understanding of Autism Spectrum
Disorders,” Experimental Neurobiology, vol. 25, pp. 1-13, feb 2016
VII. CONCLUSION [10]Haishuai Wang, Li LiLianhua Chi, Ziping Zhao, “Autism Screening
An individual with Autism Spectrum Disorder needs Using Deep Embedding Representation,” International Conference
on Computational Science, Lecture Notes in Computer Science, vol
early treatment and a progressive learning curve. The 11537, pp. 160-173, jun 2019
sooner ASD is diagnosed, the better are the results in long [11]Muhammad Nazrul Islam, Kazi Shahrukh Omar, Prodipta Mondal,
term. Often times ASD does not even get diagnosed until Nabila Shahnaz Khan, “A Machine Learning Approach to Predict
adulthood. This paper, based on the three autism screening Autism Spectrum Disorder,” International Conference on Electrical,
Computer and Commmunication Engineering, feb 2019
datasets contributed by Thabtah [4], adopted various [12] Kayleigh K. Hyde, Marlena N. Novack, Nicholas LaHaye, Chelsea
machine learning algorithms to find the optimal models Parlett-Pelleriti, Raymond Anden, Dennis R. Dixon, Erik Linstead,
for each of the datasets. Data analysis is done to figure out “Applications of Supervised Machine Learning in Autism Spectrum
the relation between the attributes. Disorder Research: a Review,” Review Journal of Autism and
Developmental Disorders, vol. 6, pp. 128–146, feb 2019
Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.