0% found this document useful (0 votes)
5 views

Autistic_Spectrum_Disorder_Screening_Prediction_with_Machine_Learning_Models

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Autistic_Spectrum_Disorder_Screening_Prediction_with_Machine_Learning_Models

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)

Autistic Spectrum Disorder Screening: Prediction


with Machine Learning Models
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE) 978-1-7281-4142-8/20/$31.00 ©2020 IEEE 10.1109/ic-ETITE47903.2020.186

Astha Baranwal Vanitha M.


School of Information and Technology Engineering School of Information and Technology Engineering
Vellore Institute of Technology Vellore Institute of Technology
Vellore, India Vellore, India
[email protected] [email protected]

Abstract—Autistic Spectrum Disorder (ASD) is a indicate ASD, to observing genetic makeup of individuals,
developmental disorder that can be observed in all age groups. the previous researches on the topic have paved way for
This paper uses ASD screening dataset for analysis and better diagnosis of ASD. Anibal et al [1] used a large brain
prediction of probable cases in adults, children and imaging dataset to identify ASD patients with deep learning
adolescents. The dataset for each of the age groups are
algorithms. They showed the anterior-posterior
analyzed and inferences are drawn from them. Machine
learning algorithms like Artificial Neural Networks (ANN), underconnectivity autistic brains. Hassan et al [2] utilized
Random Forest, Logistic Regression, Decision Tree and the decision tree algorithm for analysis of National
Support Vector Machines (SVM) are used for prediction and Database for Autism Research (NDAR) dataset. Thabtah et
comparison. al [3] aimed at extracting the most influential features
which contribute in ASD prediction. For this purpose, they
Keywords—Autism Spectrum Disorder, Machine Learning, used Variable Analysis which extracted features from
Classification, Medical, Diagnosis child, adolescent and adult dataset. Choudhery et al [6]
I. INTRODUCTION utilized gene expression dataset. K-means clustering was
used to cluster genes and then a support vector machine
Autism Spectrum Disorders bring about certain model classified functional connectivity changes
challenges in social, behavioral, communication and associated with ASD. Pagnozzi et al [7] investigated brain
emotional understanding in an individual. People changes linked to ASD patients through MRI image data.
diagnosed with ASD have a range of symptoms. That is They identified certain biomarkers by studying brain
why it is termed as a ‘spectrum’ disorder. Since ASD is a morphology. Stevens et al [8] aimed at discovering certain
neurological developmental disorder, there is no specific phenotypes which heavily impact ASD diagnosis and
medical test for it, thus making the diagnosis of ASD an examine treatment responses related to them. They found
arduous task. Although now these disorders can be 16 genetic subgroups along with 2 behavioral phenotypes.
perceived in early childhood, there are some cases in which The ASD prevalence has increased in the past two decades
the symptoms are not diagnosed until adolescent or owing to the increased research on autism. Park et al [9]
adulthood. ASD currently has no standard treatment. An concluded that amygdala and nucleus accumbens are two
early diagnosis and a head start in therapies can potentially affected components of brain in ASD diagnosed
lead to better results. individuals. Further research is required for better
understanding treatment of the disorder.
This paper focuses on proposing a model which would
assist in prediction of ASD in an individual so that Wang et al [10] used the ASD dataset to gain 99%
diagnosis can be done and further treatments may be sensitivity and specificity. They used only deep learning
followed. The dataset used is the Autistic Spectrum techniques to build their model by proposing a neural
Disorder Screening Data [4]. The datasets provide insights network architecture on the dataset. Islam et al [11] devised
into various factors affecting the prediction of the disorder. a merged model of Random Forest Classification and
Machine learning algorithms like decision tree, random Regression Trees (CART) and Random Forest Iterative
forest, logistic regression, support vector classifier and Dichotomiser-3 (ID3). Two datasets for each age group
artificial neural networks are used for finding out the were taken, one was the AQ-10 dataset by Thabtah [4] and
optimal model for each dataset. Several performance the other was a set of real data containing 250 records with
metrics are used in order to analyze and compare each both ASD and non ASD individuals. The CART model
model from every angle possible. gave 97.10% accuracy with adult AQ-10 dataset and the
ID3 model achieved 85.10% accuracy with the adult real
II. RELATED WORK dataset. As per Hyde et al [12], supervised learning
algorithms like support vector machines, logistic
Various methods for determining ASD have already regression, random forest, neural networks, selection
been used. From employing image processing techniques operators like lasso regression are some of the few
to gather abnormalities in brain structure which may

978-1-7281-4141-1/$31.00 ©2020 IEEE 1

Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
978-1-7281-4142-8/$31.00 ©2020 IEEE
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)

prevalent techniques. Neuroimaging data has been used to binary 0/1 values to ease the implementation of
deploy several models as well. classification algorithms.
The missing data in attributes like ethnicity, relation
This paper aims to create an optimal model for autism and age are dealt with by removing these records
spectrum disorder prediction based on the autism screening
altogether from the datasets.
datasets for three age groups, viz., child, adolescent and
adult, contributed by Fadi Fayez Thabtah [4] through his
ASD screening application, ‘asdtests’.

The diagnostic processes for ASD are seldom cheap;


and often take up a lot of time. Screening by the models
created in this paper will be a faster approach, especially
when it comes to preliminary screening. Analyzing various
algorithms for each dataset allows flexibility in establishing
the best model possible for each dataset, thus providing a
reliable initial self-screening for potential ASD patients.
Since the optimal model for each dataset is concluded
based on numerous performance metrics, this paper ensures
accurate diagnosis for ASD.
III. METHODOLOGY
Fig. 1 Box Plot of Age.
A. The Dataset
The datasets used are the Autism Screening Datasets In the adult dataset, an outlier in age is discovered
for adult, adolescent and child age groups. There are 20 having value 383. This is not a feasible age and must be
attributes in each dataset having continuous, categorical some typo. So, this is changed to 38.
and binary values. The dependent attribute is Class_ASD
which determines if an individual has ASD (1) or not (0). C. Data Analysis
The adult dataset has 704 records, the adolescent dataset A score of more than 7 in Q & A for screening test
has 104 records and the child dataset has 292 records. automatically results in a positive ASD classification.
Clearly, the adult dataset is more suited for building There is no significant relationship between a person who
machine learning models. was born with jaundice and being diagnosed with ASD.
There is no significant relationship between a person
TABLE I. DATASET DESCRIPTION having a relative diagnosed with ASD and probability of
the person lying in the autism spectrum himself. These
Seri Attribute Description Data type
al facts prevail for all the three datasets.
no. 1) Adult Autism Screening Dataset: The adult autism
1-10 A1_Score to Answer code of the Binary
A9_Score corresponding question.
dataset has records with wide spanning, covering the young
11 age Age of the individual Integer adult phase to the senile phase for both the genders. It is
12 gender Gender of the individual String (f, observed that ASD is more widely distributed in males in
m) terms of age.
13 ethnicity Ethnic group the individual String
belongs to
14 jaundice If the person had jaundice at String (no,
birth yes)
15 autism If any relative of the individual String (no,
was diagnosed with autism yes)
16 country_of_re Native country String
sidence
17 used_app_bef If the screening test app has String (no,
ore been used by the person before yes)
18 score Score out of 10 based on the Integer
screening test answers
19 relation Who is answering the questions String
of screening test
20 Class_asd ASD diagnosis of individual by String (NO,
the screening app YES)

B. Data Cleaning
The attributes like gender, jaundice, autism,
used_app_before and Class_ASD are converted into
Fig. 2 Box plot of age against gender in adult autism dataset.

Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)

A score of more than 7 in Q & A for screening test spectrum. So, the score variable is redundant and can be
automatically results in a positive ASD classification. removed as the correlation between Class_ASD and score
is 0.83, which is a really high value. It can lead to
multicollinearity.

Fig. 3 Box plot of score against gender in adult autism dataset.

The White European ethnicity accounts for nearly one-


third of the data. The ethnic groups who score more in
screening test are White Europeans and Asians.
On an average, Latino, White European and Black
ethnic groups seem to have more cases of ASD positive
records in the dataset. Fig. 5 Correlation matrix of the adult autism screening dataset.

The ethnicity attribute has 10 categories. Similarly,


relation attribute has 5 categories. In order to perform
classification, these two attributes can be turned to integer
representation of categories or one-hot encoding can be
used. In one-hot encoding each category of the categorical
attributes is converted to binary values to facilitate
machine learning algorithms on the dataset.

E. Feature Importance
Lasso regression feature importance model is used for
feature importance determination. After shrinkage, the
‘especially influential’ factors are the ones with highest
coefficients. Although, if the coefficient is really high,
there is a chance of multicollinearity. The variables with
Fig. 4 Mean number of records diagnosed as ASD for each ethnic group
in adult autism dataset. coefficient zero are eliminated. It is clear that for all age
group, the Q and A of screening test is the major deciding
2) Adolescent Autism Screening Dataset: The factor.
individuals ranging from ages 12 to 16 lie in this age group.
The number of instances classified to have ASD is twice
more.
3) Child Autism Screening Dataset: The children in the
dataset are 4 to 11 years of age. The median is 6. The
number of instances classified to have ASD is slightly
more. A score of more than 7 in Q & A for screening test
automatically results in a positive ASD classification.
Black ethnic group seems to have higher scores in
screening test.
D. Data Analysis
An autism screening score of more than 7
automatically classifies the patient to be lying in autism

Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)

E. F1 Score
F1 Score is a better measure than accuracy because of
the uneven distribution of classes among the instances. The
records with Class_ASD as 0 cover more than half of the
instances.

Fig. 7 Class distribution in adult autism screening dataset.

F. Accuracy Score
Accuracy score is the fraction of predictions the
Fig. 6 Lasso model on adult autism screening dataset.
classification model predicted correctly.
As per the model on the adult ASD screening dataset, TP+TN
the answer A9, corresponding to the question, ‘I find it Accuracy Score=
TP+FN+TN+FP
easy to work out what someone is thinking or feeling just
by looking at their face’, is the most important attribute in G. ROC (Receiver Operating Characteristic) curve and
ASD diagnosis. The Lasso model picks up 18 variables AUC (Area Under the Curve)
and eliminates 11 variables. ROC is a probability curve plotted with the True
Positive Rate (y-axis) against the False Positive Rate (x-
The Lasso model picks up 14 variables while axis). AUC is the measure of how much the model is
capable of distinguishing between the classes. Higher the
eliminating the other 16 variables in adolescent dataset. AUC value, better is the model at correctly predicting
The question to the answer A5: ‘S/he frequently finds that whether the individual has ASD. When AUC is 0.5, it is the
s/he doesn’t know how to keep a conversation going’, is worst case. Because then the model could not distinguish
the most important feature. between the positive and negative classes reliably. At AUC
zero, the model is just reciprocating the classes.
For the child dataset, the response A4, ‘S/he finds it H. Sensitivity/ Recall
easy to go back and forth between different activities’ is Sensitivity is the fraction of individuals who will be
of utmost importance. correctly predicted as positive class. High sensitivity
means more number of individuals have been correctly
IV. PERFORMANCE METRICS
predicted to have ASD.
A. True positive (TP)
The number of records that were actually positive and TP
Sensitivity=
were classified positive. TP+FN
B. False Negative (FN) I. Specificity
The number of records that were positive but were Sensitivity is the fraction of individuals who will be
classified negative. correctly predicted as negative class. High specificity
means a greater number of individuals have been correctly
C. True Negative (TN)
predicted not to have ASD.
The number of records that were actually negative and
were classified negative.
ܶܰ
ܵ‫ ݕݐ݂݅ܿ݅݅ܿ݁݌‬ൌ 
D. False Positive (FP) ܶܰ ൅ ‫ܲܨ‬
The number of records that were positive but were
classified negative.

Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)

V. RESULTS AND EVALUATION Support 0.8627 0.9235 0.9886 0.8148 0.9690


Vector
A. Model Building on Adult autism screening dataset Artificial 0.9815 0.9891 0.9887 0.9815 0.9922
Neural
Five classification models are used: Decision Tree, Network
Random Forest, Logistic Regression, Support Vector
Classifier, Artificial Neural Network (ANN). The decision
tree model is an overfitted model, with train dataset
accuracy of 1 and test dataset accuracy of 0.8798. It is
also the least optimal model as per the ROC curve. The
AUC is the least for this model, indicating the separability
between the two classification classes is poor. The ROC
curve is also the least optimal. The F1 score is the least at
78.85%.

Fig. 9 Comparison graph for adult autism screening dataset between all
the algorithms in terms of F1 Score, Accuracy, AUC, Sensitivity and
Specificity.

ANN, Logistic Regression and Random forest


(hyperparameter tuned) give exceptional F1 scores. While
Decision tree gives the worst. All in all, ANN performs
the best and is the optimal model.

B. Model Building on Adolescent autism screening


dataset
Since this is a small dataset, building machine learning
models is not advisable. Overfitting is observed in the
decision tree and the random forest models. The logistic
regression model performs the best with the best ROC
curve and the highest AUC.

Fig. 8 ROC graph for Decision Tree (left) and ANN (right) for adult
autism screening dataset.

Random Forest, Logistic Regression and Artificial


Neural Network give near optimal ROC curve with a high
AUC. The F1 Score is the highest for the ANN model at
98.15%.

TABLE II. COMPARISON ACROSS TABLE FOR ADULT AUTISM


DATASET

Algorithm F1 Score Accura AUC Sensiti Specificit


cy vity y
Decision 0.7885 0.8798 0.8447 0.7593 0.9302
Tree
Random 0.8544 0.9180 0.9812 0.8148 0.9690
Forest
Random 0.9524 0.9727 0.9977 0.9259 0.9922
Forest
(hyperpara
meter)
Logistic 0.9725 0.9836 0.9959 0.9814 0.9845
Regression

Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)

Fig. 10 ROC graph for Decision Tree (left) and Logistic Regression
(right) for adolescent autism screening dataset.

TABLE III. COMPARISON ACROSS TABLE FOR ADOLESCENT


AUTISM DATASET

Algorith F1 Accurac AUC Sensitivit Specificit


m Score y y y
Decision 0.857 0.8000 0.746 0.9375 0.4444
Tree 1 5
Random 0.900 0.8800 0.944 0.8750 0.8889
Forest 0 4
Logistic 0.941 0.9200 0.993 1.0000 0.7778
Regressio 2 1
n
Support 0.864 0.8000 0.986 1.0000 0.4444
Vector 9 1
Artificial 0.812 0.7600 0. 0.9815 0.8125
Neural 5 9861
Network

Fig. 12 ROC graph for Decision Tree (left) and Support Vector (right)
for child autism screening dataset.

TABLE IV. COMPARISON ACROSS TABLE FOR CHILD AUTISM


DATASET

Algorithm F1 Accuracy AUC Sensitivity Specificity


Score
Decision 0.8136 0.8226 0.82 0.7742 0.8710
Fig. 11 Comparison graph for adolescent autism screening dataset Tree 25
between all the algorithms in terms of F1 Score, Accuracy, AUC, Random 0.8197 0.8226 0.94 0.8065 0.8387
Sensitivity and Specificity. Forest 33
Random 0.9062 0.9032 0.99 0.9355 0.8701
Random forest model gives high specificity but fails to Forest 77
achieve the optimal results in all other performance (hyperpara
meter)
measures. Decision tree and random forest models do not Logistic 0.9062 0.9032 0.98 0.9354 0.8710
perform well. They are overfitted models too. Regression 65
Support 0.9688 0.9677 0.98 1.0000 0.9355
Vector 96
Logistic Regression is the best model when all the Artificial 0.9508 0.9516 0.98 0.9677 0.9355
metrics are taken into account. Neural 96
Network
C. Model Building on Child autism screening dataset
Both the decision tree and the random forest models
are overfitted and are hence, poor models. The random
forest model after hyperparameter tuning is a fair model.
SVM is the most optimal model, followed by ANN.

Decision tree seems to give the worst ROC curve and


AUC is the lowest for it. Support vector gives one of the
best ROC curves and AUC value.

Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE)

Fig. 13 Comparison graph for child autism screening dataset between all Decision Tree results in an overfitted model in all the
the algorithms in terms of F1 Score, Accuracy, AUC, Sensitivity and
datasets. ANN performs the best on the adult autism
Specificity.
dataset. Logistic Regression gives the optimal result for
Decision tree and random forest models perform adolescent autism dataset. Support Vector is the best for
poorly when compared to other models in terms of all the child autism screening dataset. This paper proposed a
time-conserving method to screen potential ASD
performance metrics. Support Vector Classifier performs
individuals for self-screening.
the best, giving the best F1 score, accuracy, sensitivity and Although, the data we have so far is insufficient. The
specificity rates. datasets are really small to derive a suitable model for
prediction. Certain conclusions can still be derived through
VI. DISCUSSIONS data analysis of the datasets. A proper diagnosis method for
Image processing on MRIs, evaluation of gene the disorder is crucial.
expression, the risk factors involving ASD and study on ACKNOWLEDGMENT
various biomarkers which may relate to the disorder have
We would like to express our gratitude to the School of
been studied previously. The ABIDE (Autism Brain Information and Technology Engineering, VIT, Vellore for
Imaging Data Exchange) dataset used by Heinsfield [1] providing us this opportunity to contribute our part.
aims to identify the areas of brain which may define if an
individual has ASD. Hassan et al [2] used decision tree to REFERENCES

identify the genetic and environmental risk factors that [1] Anibal Sólon Heinsfeld, Alexandre Rosa Franco, R. Cameron
Craddock, Augusto Buchweitz and Felipe Meneguzzi, “Identification
may contribute to ASD. Similarly, Choudhery et al [6] of autism spectrum disorder using deep learning and the ABIDE
used gene expression data for ASD patients to observe dataset,” NeuroImage: Clinical, vol. 17, pp. 16-23, 2018.
[2] Mariam M. Hassan and Hoda M. O. Mokhtar, “Investigating autism
gene expression changes which may predict changes in etiology and heterogeneity by decision tree algorithm,” Informatics in
brain regions. Several papers used AQ-10 ASD screening Medicine Unlocked, vol. 16, 100215, 2019.
[3] Fadi Thabtah, Firuz Kamalov and Khairan Rajab, “A new
dataset [4] for building models for optimized predictions. computational intelligence approach to detect autistic features for
[3][5][10][11] autism screening,” International Journal of Medical Informatics, vol.
117, pp. 112-124, sep 2018.
[4] Fadi Thabtah, “Autistic Spectrum Disorder Screening Datasets,” UCI
This paper focuses on the data collected from a self- machine learning repository, 2017. [Online].Available:
screening application for ASD, called as “asdtests”, https://archive.ics.uci.edu/ml
created by Thabtah et al [4]. The test can easily be taken [5] Fadi Thabtah, “Autism Spectrum Disorder Screening: Machine
Learning Adaptation and DSM-5 Fulfillment,” Proceedings of the 1st
by the individuals if there is any probability of them being International Conference on Medical and Health, pp.1-6, may 2017
diagnosed with ASD. All other datasets deal with [6] Sanjeevani Choudhery, Chuan Huang and Daifeng Wang, “T253. A
Machine Learning Approach to Predict the Changes of Brain
heterogeneous data from multiple resources, which makes Functional Connectivity in Autism Spectrum Disorder From the Gene
diagnosis of ASD a time-consuming and costly affair. In Expression Data,” Biological Psychiatry, vol. 83, issue 9, supplement,
pp. S227-S228, 1 May 2018.
this paper, machine learning models are proposed and [7] Alex M. Pagnozzi, Eugenia Conti, Sara Calderoni, Jurgen Fripp and
compared to attain the most optimal model for each Stephen E. Rose, “A systematic review of structural MRI biomarkers
in autism spectrum disorder: A machine learning perspective,”
dataset in terms of f1 score, accuracy, sensitivity, International Journal of Developmental Neuroscience, vol. 71, pp.
specificity, ROC and AUC. The datasets, however, 68-82, dec 2018.
especially the child and adolescent datasets, are really [8] Elizabeth Stevens, Dennis R.Dixon, Marlena N. Novack, Doreen
Granpeesheh, Tristram Smith, Erik Linstead, “Identification and
small in size and are not suitable for building machine analysis of behavioral phenotypes in autism spectrum disorder via
learning models. The adult autism screening dataset itself unsupervised machine learning,” International Journal of Medical
Informatics, vol. 129, pp. 29-36, sep 2019
is biased towards the class 0 for Class_ASD. More data is [9] Hye Ran Park, Jae Meen Lee, Hyo Eun Moon, Dong Soo Lee, Bung-
required for reliable ASD prediction. Nyun Kim, Jinhyun Kim, Dong Gyu Kim, Sun Ha Paek, “A Short
Review on the Current Understanding of Autism Spectrum
Disorders,” Experimental Neurobiology, vol. 25, pp. 1-13, feb 2016
VII. CONCLUSION [10]Haishuai Wang, Li LiLianhua Chi, Ziping Zhao, “Autism Screening
An individual with Autism Spectrum Disorder needs Using Deep Embedding Representation,” International Conference
on Computational Science, Lecture Notes in Computer Science, vol
early treatment and a progressive learning curve. The 11537, pp. 160-173, jun 2019
sooner ASD is diagnosed, the better are the results in long [11]Muhammad Nazrul Islam, Kazi Shahrukh Omar, Prodipta Mondal,
term. Often times ASD does not even get diagnosed until Nabila Shahnaz Khan, “A Machine Learning Approach to Predict
adulthood. This paper, based on the three autism screening Autism Spectrum Disorder,” International Conference on Electrical,
Computer and Commmunication Engineering, feb 2019
datasets contributed by Thabtah [4], adopted various [12] Kayleigh K. Hyde, Marlena N. Novack, Nicholas LaHaye, Chelsea
machine learning algorithms to find the optimal models Parlett-Pelleriti, Raymond Anden, Dennis R. Dixon, Erik Linstead,
for each of the datasets. Data analysis is done to figure out “Applications of Supervised Machine Learning in Autism Spectrum
the relation between the attributes. Disorder Research: a Review,” Review Journal of Autism and
Developmental Disorders, vol. 6, pp. 128–146, feb 2019

Authorized licensed use limited to: Zhejiang University. Downloaded on November 05,2024 at 17:13:38 UTC from IEEE Xplore. Restrictions apply.

You might also like