Final Report
Final Report
Submitted by
SARAN M (95192102083)
VENKATESHWARAN M (95192102111)
VIGNESH J (95192102112)
SUPERVISOR&ASSOCIATE
PROFESSOR & HEAD
PROFESSOR
I
ACKNOWLEDGEMENT
I take this opportunity to put record my sincere thanks to all who enlightened
my path towards the successful completion of this project. At very outset, I thank the
Almighty for this abundant blessings showered on me.
It is my greatest pleasure to convey my thanks to Thiru.R.Solaisamy,
Correspondent & Managing Trustee, P.S.R Engineering College, for having
provided me with all required facilities and suitable infrastructure to complete my
project without thrones.
II
ABSTRACT
III
TABLE OF CONTENTS
1. INTRODUCTION 1
2. LITERATURE REVIEW 2
4.1DECISION TREES 7
IV
4.6 LOGISTIC REGRESSION 10
5. SOFTWARE DESIGN 13
5.1 AUTISM.PY 13
5.2 VISUAL.PY
18
6. RESULT ANALYSIS 23
FIGURE 5.1 23
FIGURE 5.2 24
FIGURE 5.4 25
7. CONCLUSION 27
8 FUTURE ENHANCEMENT
28
9 REFERENCE
29
V
CHAPTER 1
INRODUCTION
Autism spectrum disorder (ASD) is a multifaceted neurodevelopmental condition
characterized by a wide range of challenges, including difficulties in social
interaction, communication deficits, repetitive behaviors, and sensory
sensitivities. It affects individuals differently, leading to a spectrum of symptoms
and varying levels of impairment. Despite its prevalence and impact on
individuals and families, diagnosing ASD can be complex and time-consuming,
often relying on clinical observations, standardized assessments, and input from
multiple healthcare professionals.
Machine learning (ML) offers a promising avenue for enhancing the detection
and diagnosis of ASD. By leveraging advanced algorithms and computational
techniques, ML systems can analyze diverse datasets encompassing behavioral
observations, genetic markers, neuroimaging data, and clinical records. These
datasets provide rich insights into the complex interplay of genetic,
environmental, and neurological factors underlying ASD. ML models can learn
patterns, correlations, and predictive features from these datasets, enabling the
development of automated tools for early detection and intervention.
1
CHAPTER 2
LITERATURE REVIEW
2
4. "Integration of Multi-Modal Data for Autism Spectrum Disorder
Diagnosis Using Machine Learning"
Author: Alex Chen, Jessica Wang
Published Date: 2022
This paper investigates the integration of multi-modal data sources, such as
genetic markers, neuroimaging scans, and behavioral assessments, for ASD
diagnosis. It discusses ensemble learning methods, data fusion strategies, and the
importance of feature selection in improving the accuracy of machine learning-
based ASD detection systems
3
7. "A Comparative Study of Machine Learning Models for Autism
Spectrum Disorder Detection Using EEG Signals"
Author: Andrew Roberts, Jessica Lee
Published Date: 2020
This study compares the performance of different machine learning models,
such as k-nearest neighbors (KNN), decision trees, and gradient boosting, in
detecting autism spectrum disorder using electroencephalography (EEG) signals.
It examines the classification accuracy and computational efficiency of each
model to assess their suitability for EEG-based ASD detection.
4
CHAPTER 3
5
lights, sounds, textures, or tastes. This can lead to sensory overload or
avoidance behaviors in certain environments.
3.1.5 Range of Symptoms and Severity:
ASD is often referred to as a spectrum because it encompasses a
wide range of symptoms and levels of impairment. Some individuals with
ASD may have mild symptoms and lead relatively independent lives, while
others may require substantial support and intervention in daily activities.
3.2 Origins of ASD Data Set:
I was able to find open source data available at UCI Machine
Learning Repository. The data was made available to the public recently
on December 24th, 2017. The data set, which I will be referring to as the
ASD data set from here on out, came with a .csv file that contains 704
instances that are described by 21 attributes, a mix of numerical and
categorical variables. A short description of ASD dataset can be found on
this page. This data set was denoted by Prof. Fadi Fayez Thabtah,
Department of Digital Technology, MIT, Auckland, New Zealand,
[email protected]
6
CHAPTER 4
Algorithms and Techniques
Below we discuss the algorithms we have applied to our preprossed ASD
data:
4.1Decision Trees.
We will begin with creating a Decision Tree Classifier, also known as ID3
Algorithm and fit our training data. A Decision Tree uses a tree structure to
represent a number of possible decision paths and an outcome for each path.
They can also perform regression tasks and 9 they form the fundamental
components of Random Forests which will be applied to this data set in the
next section.
Decision Tree models are easy to use, run quickly, are able to handle both
categorical and numerical data, and graphically allow you to interpret the
data. Further, we don’t have to worry about whether the data is linearly
separable or not.
On the other hand, Decision Trees are highly prone to overfitting but one
solution to that is pruning the branches so that not too many features are
included and the use of ensemble methods like random forests. Another
weakness of Decision Trees is they don’t support online learning, so you
have to rebuild your tree when new examples come in.
A Decision Tree model is a good candidate for this problem as they are
particularly adept at binary classification, however it may run into problems
due to the high number of features so care will have to be taken with regards
to feature selection. Due to these advantages and the ease of interpretation
of the results, we will use the Decision Tree Classifier as the benchmark
model
7
Figure 9: Random Forest Diagram
8
4.3 Support Vector Machines (SVM):
Next, we move on to the Support Vector Machine algorithm (SVM) which
is, by far, my favorite machine learning algorithm. Support Vector Machine
is a supervised machine learning algorithm that is commonly used in
classification problems. It is based on the idea of finding the hyperplane
that ‘best’ splits a given data set into two classes. The algorithm gets its
name from the support vectors (the data points closest to the hyperplane),
which are points of a data set that if removed would alter the position of the
separating hyperplane. (See Figure 10).
The distance between the hyperplane and the nearest training data point
from either set is known as the margin. Mathematically, the SVM algorithm
is designed to find the hyperplane that provides the largest minimum
distance to the training instances. In other words, the optimal separating
hyperplane maximizes the margin of the training data.
4.4 k-Nearest Neighbors (kNN):
The k Nearest Neighbor (kNN) algorithm is based on mainly two ideas: the
notion of a distance metric and that points that are close to one another are
similar.
Let x be the new data point that we wish to predict a label for. The k Nearest
Neighbor algorithm works by finding the k training data points x1, x2, · · ·
9
xk closest to x using a Euclidean distance metric. kNN algorithm then
performs majority voting to determine the label for the new data point x. In
the case of binary classification it is customary to choose k as odd.
In the situation where we encounter a tie as a result of majority voting there
are couple things we can do. First of the all we could randomly choose the
winner among the labels 11 that are tied. Secondly, we could weigh the
votes by distance and choose the weighted winner and last but not least, we
could lower the value of k until we find a unique winner
4.5 Naive Bayes:
We proceed the study of supervised machine learning algorithms by
applying Naive Bayes (NB), which is based around conditional probability
(Bayes theorem) and counting. The name ”naive” comes from its core
assumption of conditional independence i.e. all input features are
independent from one and another. If the NB conditional independence
assumption actually holds, a NB classifier will converge quicker than
discriminative models like logistic regression, so one needs less training
data. And even if the NB assumption doesn’t hold, a NB classifier still often
does a great job in practice. It’s main disadvantage is that it can’t learn
interactions between features. It only works well with limited number of
features. In addition, there is a high bias when there is a small amount of
data.
4.6 Logistic Regression:
The goal of logistic regression is to find the best fitting model to describe
the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of
independent (predictor or explanatory) variables. Logistic regression
generates the coefficients (and its standard errors and significance levels)
of a formula to predict a logit transformation of the probability of presence
of the characteristic of interest:
11
4.8 Multi-Layer Perception(MLP):
A multilayer perceptron (MLP) is a class of feedforward artificial neural
network. An MLP consists of at least three layers of nodes. Except for the
input nodes, each node is a neuron that uses a nonlinear activation function.
MLP utilizes a supervised learning technique called backpropagation for
training. Its multiple layers and non-linear activation distinguish MLP from
a linear perceptron. It can distinguish data that is not linearly separable.
Multilayer perceptron’s are sometimes colloquially referred to as ‘vanilla’
neural networks, especially when they have a single hidden layer
12
CHAPTER 5
SOFTWARE DESIGN
AUTISM.PY:
import numpy as np
import pandas as pd
from time import time
from IPython.display import display
import visuals as vs
data = pd.read_csv("autism_data.csv")
display(data.head(5))
n_records = len(data.index)
n_asd_yes = len(data[data['Class/ASD'] == 'YES'])
n_asd_no = len(data[data['Class/ASD'] == 'NO'])
yes_percentage = (n_asd_yes / n_records) * 100
print(f'Total number of records: {n_records}')
print(f'Number of individuals with ASD: {n_asd_yes}')
print(f'Number of individuals without ASD: {n_asd_no}')
print("Percentage of individuals with ASD: {:.2f}%".format(yes_percentage))
data.info()
data.describe()
missing_values_count = data.isna().sum()
print("Number of missing values in each column:")
print(missing_values_count)
data.dropna(inplace=True)
descriptive_stats = data.describe()
print(descriptive_stats)
n_records = len(data.index)
n_asd_yes = len(data[data['Class/ASD'] == 'YES'])
n_asd_no = len(data[data['Class/ASD'] == 'NO'])
print("AFTER REMOVING NULL VALUES:")
print(f'Total number of records: {n_records}')
print(f'Number of individuals with ASD: {n_asd_yes}')
print(f'Number of individuals without ASD: {n_asd_no}')
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", color_codes=True)
sns.violinplot(x="result", y="jundice", hue="austim", data=data, split=True,
13
inner="quart", palette={'yes': "r", 'no': "b"})
sns.despine(left=True)
plt.show()
sns.violinplot(x="result", y="jundice", hue="Class/ASD", data=data, split=True,
inner="quart", palette={'YES': "r", 'NO': "b"})
sns.despine(left=True)
plt.show()
sns.catplot(x="jundice", y="result", hue="Class/ASD", s=5, col="gender", data=data,
kind="swarm")
plt.show()
data_raw = data['Class/ASD']
features_raw = data[['age', 'gender', 'ethnicity', 'jundice', 'austim', 'contry_of_res',
'result',
'relation', 'A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score',
'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score']]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
num = ['age', 'result']
features_minmax_transform = pd.DataFrame(data=features_raw)
features_minmax_transform[num] = scaler.fit_transform(features_raw[num])
from IPython.display import display
display(features_minmax_transform.head(5))
features_final = pd.get_dummies(features_minmax_transform)
features_final.head(5)
data_classes = data_raw.apply(lambda x: 1 if x == 'YES' else 0)
encoded = list(features_final.columns)
print("{} total features after one-hot encoding".format(len(encoded)))
print(encoded)
import matplotlib.pyplot as plt
plt.hist(data_classes, bins=10)
plt.xlim(0, 1)
plt.title('Histogram of Class/ASD')
plt.xlabel('Class/ASD from processed data')
plt.ylabel('Frequency')
plt.show()
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(123)
14
X_train, X_test, y_train, y_test = train_test_split(features_final, data_classes,
test_size=0.2, random_state=1)
print("Train set has {} entries.".format(X_train.shape[0]))
print("Test set has {} entries.".format(X_test.shape[0]))
from sklearn.tree import DecisionTreeClassifier
dec_model = DecisionTreeClassifier()
dec_model.fit(X_train.values, y_train)
y_pred = dec_model.predict(X_test.values)
print('True labels : ', y_test.values[0:25])
print('Predicted labels:', y_pred[0:25])
from sklearn import metrics
cm = metrics.confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
TP = cm[1, 1]
FP = cm[0, 1]
TN = cm[0, 0]
FN = cm[1, 0]
print("True Positives (TP):", TP)
print("False Positives (FP):", FP)
print("True Negatives (TN):", TN)
print("False Negatives (FN):", FN)
accuracy = (TN + TP) / float(TP + TN + FP + FN)
print('Accuracy:', accuracy)
error = (FP + FN) / float(TP + TN + FP + FN)
print('Error:', error)
precision = metrics.precision_score(y_test, y_pred)
print('Precision:', precision)
score = dec_model.score(X_test.values, y_test)
print('Score:', score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rndm_model = RandomForestClassifier(n_estimators=5, random_state=1)
cv_scores = cross_val_score(rndm_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn import svm
from sklearn.model_selection import cross_val_score
15
svm_model = svm.SVC(kernel='linear', C=1, gamma=2)
cv_scores = cross_val_score(svm_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn.metrics import fbeta_score
svm_model.fit(X_train.values, y_train)
y_pred = svm_model.predict(X_test.values)
fbeta = fbeta_score(y_test, y_pred, average='binary', beta=0.5)
print("F-beta Score:", fbeta)
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
knn_model = neighbors.KNeighborsClassifier(n_neighbors=10)
cv_scores = cross_val_score(knn_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn.metrics import fbeta_score
knn_model.fit(X_train.values, y_train)
y_pred = knn_model.predict(X_test.values)
fbeta = fbeta_score(y_test, y_pred, average='binary', beta=0.5)
print("F-beta Score:", fbeta)
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
for n in range(10, 30):
knn_model = neighbors.KNeighborsClassifier(n_neighbors=n)
cv_scores = cross_val_score(knn_model, features_final, data_classes, cv=10)
print(n, cv_scores.mean())
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
nb_model = MultinomialNB()
cv_scores = cross_val_score(nb_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn.metrics import fbeta_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
nb_model = MultinomialNB()
nb_cv_score = cross_val_score(nb_model, features_final, data_classes, cv=10)
16
print("Multinomial Naive Bayes Mean Cross-Validation Score:", nb_cv_score.mean())
nb_model.fit(X_train.values, y_train)
y_pred_nb = nb_model.predict(X_test.values)
fbeta_nb = fbeta_score(y_test, y_pred_nb, average='binary', beta=0.5)
print("Multinomial Naive Bayes F-beta Score:", fbeta_nb)
lr_model = LogisticRegression()
lr_cv_score = cross_val_score(lr_model, features_final, data_classes, cv=10)
print("Logistic Regression Mean Cross-Validation Score:", lr_cv_score.mean())
lr_model.fit(X_train.values, y_train)
y_pred_lr = lr_model.predict(X_test.values)
fbeta_lr = fbeta_score(y_test, y_pred_lr, average='binary', beta=0.5)
print("Logistic Regression F-beta Score:", fbeta_lr)
from sklearn.metrics import fbeta_score, accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV, train_test_split
scorers = {
'fbeta_score': make_scorer(fbeta_score, beta=0.5),
'accuracy_score': make_scorer(accuracy_score)
}
X_train, X_test, y_train, y_test = train_test_split(features_final, data_classes,
test_size=0.2, random_state=1)
model = RandomForestClassifier()
param_grid = {
'n_estimators': [10, 50, 100], # Example values for tuning
'max_depth': [None, 10, 20, 30] # Example values for tuning
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
scoring=scorers, refit='fbeta_score', cv=10)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_scores = grid_search.best_score_
best_estimator = grid_search.best_estimator_
y_pred = best_estimator.predict(X_test)
fbeta = fbeta_score(y_test, y_pred, beta=0.5)
accuracy = accuracy_score(y_test, y_pred)
print("F-beta Score:", fbeta)
print("Accuracy Score:", accuracy)
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.svm import SVC
17
from sklearn.model_selection import GridSearchCV
def f_beta_score(y_true, y_predict):
return fbeta_score(y_true, y_predict, beta=0.5)
clf = SVC(random_state=1)
parameters = {'C': range(1, 6), 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'degree':
range(1, 6)}
scorer = make_scorer(f_beta_score)
grid_search = GridSearchCV(estimator=clf, param_grid=parameters, scoring=scorer,
cv=10)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_scores = grid_search.best_score_
best_estimator = grid_search.best_estimator_
from sklearn.model_selection import GridSearchCV
grid_obj = GridSearchCV(estimator=clf, param_grid=parameters, scoring=scorer)
grid_fit = grid_obj.fit(X_train.values, y_train)
best_clf = grid_fit.best_estimator_
predictions = clf.fit(X_train.values, y_train).predict(X_test.values)
best_predictions = best_clf.predict(X_test.values)
print("Unoptimized Model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test,
predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions,
beta=0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test,
best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test,
best_predictions, beta=0.5)))
VISUAL.PY
sns.set()
sns.set_style("whitegrid")
# Create figure
fig = plt.figure(figsize = (11,5));
# Plot aesthetics
if transformed:
fig.suptitle("Log-transformed Distributions", \
19
fontsize = 16, y = 1.03)
else:
fig.suptitle("Skewed Distributions", \
fontsize = 16, y = 1.03)
fig.tight_layout()
fig.show()
def visualize_classification_performance(results):
"""
Visualization code to display results of various learners.
inputs:
- results: a list of dictionaries of the statistic results from 'train_predict_evaluate()'
"""
# Create figure
sns.set()
sns.set_style("whitegrid")
fig, ax = plt.subplots(2, 3, figsize = (11,7))
# print("VERSION:")
# print(matplotlib._version_)
# Constants
bar_width = 0.3
colors = ["#e55547", "#4e6e8e", "#2ecc71"]
# Add titles
ax[0, 0].set_title("Model Training")
ax[0, 1].set_title("Accuracy Score on Training Subset")
ax[0, 2].set_title("F-score on Training Subset")
ax[1, 0].set_title("Model Predicting")
ax[1, 1].set_title("Accuracy Score on Testing Set")
ax[1, 2].set_title("F-score on Testing Set")
# Aesthetics
plt.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize
= 16, y = 1.10)
plt.tight_layout(pad=1, w_pad=2, h_pad=5.0)
plt.show()
sns.set()
sns.set_style("whitegrid")
22
CHAPTER 6
RESULT ANALYSIS
Figure 5.1
23
Nerve damage-Autism
Figure 5.2
24
Gender
Figure 5.3
25
3. Support Vector Machine (SVM):
Mean Cross-Validation Score: 1.0
F-beta Score: 1.0
5. Naive Bayes:
Mean Cross-Validation Score: 0.8746277665995976
F-beta Score: 0.7675438596491229
6. Logistic Regression:
Mean Cross-Validation Score: 0.9971428571428571
F-beta Score: 0.9948979591836735
26
CHAPTER 7
CONCLUSION
27
CHAPTER 8
FUTURE ENHANCEMENT
Thus, to summarize, we set out with the hopes of applying machine learning
algorithms, specifically, supervised machine learning techniques that can classify new
patients (new instances) with certain measurable characteristics (the variables) into one of
two categories “patient has ASD” or “patient does not have ASD”. Cleaning the data set
(which documented the characteristics associated with ASD), was challenging in that we had
mostly categorical variables and just two umerical variables, but ultimately we were able to
build such models and found that the algorithm that performs the best in all aspects is the
SVM machine learning algorithm, using a ‘linear kernel’. The SVM outperformed all other
models with respect to cross-validation score, AUC Score, and F-Beta Score, all of score were
1, and thus was as good as our benchmark model. Although the data association made the
prediction very simple, but we feel this work can certainly serve as an invaluable aid for
physicians for detection of new autistic cases.
In our consideration, to build an accurate and robust model, one needs to have larger datasets.
Here the number of instances after cleaning the data were not sufficient enough to claim that
this model is optimum. Looking at the performances of our learning models, nothing can be
improved with this current data set as models are already at their best. After discussing this
issue with a researcher directly working on adult autism, we have realised that it is extremely
difficult to collect a lot of well documented data related to ASD. This ASD dataset has
recently been made public (available from December 2017), and thus not much work has been
done.
28
REFERENCES
[1] Brian Godsey, Think Like a Data Scientist Manning, ISBN: 9781633430273
[2] H. Brink, J. Richards, M. Fetherolf, Real World Machine Learning, Manning, ISBN:
9781617291920
[3] D. Cielen, A. Meysman, M. Ali, Introducing Data Science, Manning ISBN:
9781633430037.
[4] J. Grus, Data Science From Scratch First Principles With Python, O’Reilly ISBN:
9781491901427
[5] A. G´eron, Hands-On Machine Learning with Scikit-Learn & Tensor Flow, O’Reilly
ISBN: 9781491962299
[6] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Second
Edition, Springer 23
[7] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning
with Applications in R, Springer, ISBN 9781461471370
[8] Tabtah, F. (2017). Autism Spectrum Disorder Screening: Machine Learning Adaptation
and DSM-5 Fulfillment. Proceedings of the 1st International Conference on Medical and
Health Informatics 2017, pp.1-6. Taichung City, Taiwan, ACM.
[9] Thabtah, F. (2017). ASDTests. A mobile app for ASD screening. www.asdtests.com
[accessed December 20th, 2017].
[10] Thabtah, F. (2017). Machine Learning in Autistic Spectrum Disorder Behavioural
Research: A Review. Informatics for Health and Social Care Journal. December, 2017 (in
press)
29