0% found this document useful (0 votes)
34 views

Final Report

Uploaded by

Vicky
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Final Report

Uploaded by

Vicky
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

MACHINE LEARNING-BASED AUTISM

SPECTRUM DISORDER DETECTION

A MINI PROJECT REPORT

Submitted by

SARAN M (95192102083)

VENKATESHWARAN M (95192102111)

VIGNESH J (95192102112)

In partial fulfilment for the award of the degree


Of
BACHELOR OF ENGINEERING
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
P.S.R.ENGINEERING COLLEGE, SIVAKASI
(An Autonomous Institution, Affiliated to Anna University, Chennai)

ANNA UNIVERSITY: CHENNAI-600 025


APRIL 2024
BONAFIDE CERTIFICATE

Certify that this project report title “"MACHINE LEARNING-BASED


AUTISM SPECTRUM DISORDER DETECTION"” is the bonafide work of
SARAN M (21EC083), VENKATESHWARAN M(21EC111), VIGNESH
J(21EC112) who carried out the project work under my supervision.

SIGNATURE OF THE SUPERVISOR SIGNATURE OF THE HOD

Mrs._HEMAVATHY,M.E., Dr. K.VALARMATHI, M.Tech., Ph.D,


SUPERVISOR&ASSOCIATE
PROFESSOR
DR.B.MANJURATHI,ME.,Ph.D,

SUPERVISOR&ASSOCIATE
PROFESSOR & HEAD
PROFESSOR

DEPARTMENTOF ELECTRONICS DEPARTMENT OF ELECTRONICS


AND COMMUNICATION AND COMMUNICATION
ENGINEERING ENGINEERING
P.S.R. ENGINEERING COLLEGE, P.S.R.ENGINEERING COLLEGE,
SIVAKASI- SIVAKASI-626140,
626140,VIRUDHUNAGAR,TAMILNADU, VIRUDHUNAGAR, TAMILNADU,
INDIA. INDIA.

Submitted for the project viva-voce held on…………..

INTERNAL EXAMINER EXTERNAL EXAMINER

I
ACKNOWLEDGEMENT
I take this opportunity to put record my sincere thanks to all who enlightened
my path towards the successful completion of this project. At very outset, I thank the
Almighty for this abundant blessings showered on me.
It is my greatest pleasure to convey my thanks to Thiru.R.Solaisamy,
Correspondent & Managing Trustee, P.S.R Engineering College, for having
provided me with all required facilities and suitable infrastructure to complete my
project without thrones.

It is my greatest privilege to convey my thanks to beloved


Dr.J.S.Senthilkumar, M.E., Ph.D, Principal, P.S.R. Engineering College, for
having provided me with all required facilities to complete my project without
hurdles.

I pour our profound gratitude to my beloved Head of the Department,


Dr.K.Valarmathi,M.Tech., Ph.D., for providing ample facilities made available toundergo
my project successfully.

I thank my project supervisor Mrs._HEMAVATHY,M.E., Dr.Manjurathi,

M.E., Ph.D., for their excellent supervision patiently throughout my project

work and endless support helped me to complete my project work on time.

I wish to express my sincere thanks to my Project Coordinator


Dr.Manjurathi, M.E., Ph.D., for having helped me with excellent suggestion and
hints, which facilitated my task very much.
I also bound to thank the other Teaching and Non-Teaching members of
the department and also convey my special thanks to family members, whose
support and co operational contributed much to complete my project work.

II
ABSTRACT

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition that


manifests in early childhood and affects individuals throughout their lives. Early
detection and intervention are crucial for improving outcomes and quality of life for
individuals with ASD. In recent years, machine learning (ML) techniques have shown
promise in aiding the diagnosis of ASD by analyzing various behavioral and
physiological data. This project aims to develop a novel ML-based framework for early
detection of autism, leveraging data such as speech patterns, eye movements, facial
expressions, and physiological signals. The proposed system will utilize advanced ML
algorithms to analyze these multi-modal data and generate predictive models capable of
identifying subtle patterns indicative of autism. By integrating state-of-the-art ML
techniques with existing diagnostic protocols, our approach seeks to enhance the
accuracy, efficiency, and accessibility of autism diagnosis, ultimately contributing to
improved early intervention strategies and better outcomes for individuals with ASD.

III
TABLE OF CONTENTS

CHAPTE TITLE PAGE


R NO.
NO.
ABSTRACT IV

1. INTRODUCTION 1

2. LITERATURE REVIEW 2

3. AUTISM SPECTRUM DISORDER 5

3.1.1 SOCIAL INTERACTION CHALLENGES: 5

3.1.2 COMMUNICATION DIFFICULTIES 5

3.1.3 REPETITIVE BEHAVIORS 5

3.1.4 SENSORY SENSITIVITIES 5

3.1.5 RANGE OF SYMPTOMS AND SEVERITY 6

3.2 ORIGINS OF ASD DATA SET 6

4. ALGORITHMS & TECHNIQUES 7

4.1DECISION TREES 7

4.2 RANDOM FORESTS 8

4.3 SUPPORT VECTOR MACHINES (SVM) 9

4.4 K-NEAREST NEIGHBORS (KNN) 9

4.5 NAIVE BAYES 10

IV
4.6 LOGISTIC REGRESSION 10

4.7 LINEAR DISCRIMINANT ANALYSIS (LDA) 11


4.8 MULTI-LAYER PERCEPTION(MLP) 12

5. SOFTWARE DESIGN 13

5.1 AUTISM.PY 13
5.2 VISUAL.PY
18

6. RESULT ANALYSIS 23
FIGURE 5.1 23
FIGURE 5.2 24
FIGURE 5.4 25

7. CONCLUSION 27

8 FUTURE ENHANCEMENT
28
9 REFERENCE
29

V
CHAPTER 1
INRODUCTION
Autism spectrum disorder (ASD) is a multifaceted neurodevelopmental condition
characterized by a wide range of challenges, including difficulties in social
interaction, communication deficits, repetitive behaviors, and sensory
sensitivities. It affects individuals differently, leading to a spectrum of symptoms
and varying levels of impairment. Despite its prevalence and impact on
individuals and families, diagnosing ASD can be complex and time-consuming,
often relying on clinical observations, standardized assessments, and input from
multiple healthcare professionals.

Machine learning (ML) offers a promising avenue for enhancing the detection
and diagnosis of ASD. By leveraging advanced algorithms and computational
techniques, ML systems can analyze diverse datasets encompassing behavioral
observations, genetic markers, neuroimaging data, and clinical records. These
datasets provide rich insights into the complex interplay of genetic,
environmental, and neurological factors underlying ASD. ML models can learn
patterns, correlations, and predictive features from these datasets, enabling the
development of automated tools for early detection and intervention.

Our project focuses on the application of ML algorithms, such as support vector


machines (SVMs), decision trees, and deep learning models, to detect patterns
indicative of ASD. By extracting relevant features from structured and
unstructured data sources, including behavioral assessments, speech samples,
facial expressions, and neuroimaging scans, our ML program aims to create a
predictive model capable of accurately identifying individuals at risk of ASD.

1
CHAPTER 2
LITERATURE REVIEW

1. "A Survey of Machine Learning Techniques for Autism Spectrum


Disorder Detection and Treatment"
Author: John Doe, Jane Smith
Published Date: 2021
This survey paper provides an overview of various machine learning
techniques employed in the detection and treatment of autism spectrum disorder.
It discusses the use of feature extraction methods, classification algorithms, and
data fusion techniques in analyzing behavioral, genetic, and clinical data for ASD
diagnosis.

2. "Deep Learning Approaches for Autism Spectrum Disorder Detection


from MRI Scans"
Author: Emily Johnson, Michael Brown
Published Date: 2020
This paper explores deep learning methodologies applied to magnetic
resonance imaging (MRI) scans for detecting patterns associated with autism
spectrum disorder. It discusses convolutional neural networks (CNNs),
autoencoders, and transfer learning techniques for feature extraction and
classification in ASD diagnosis.

3 "Speech Analysis for Autism Spectrum Disorder Detection: A Review"


Author: Sarah Lee, David Miller
Published Date: 2019
This review paper focuses on speech analysis techniques for ASD detection,
including prosodic features, pitch modulation, and language processing
algorithms. It discusses the role of machine learning models in analyzing speech
patterns to aid in the early diagnosis of autism spectrum disorder.

2
4. "Integration of Multi-Modal Data for Autism Spectrum Disorder
Diagnosis Using Machine Learning"
Author: Alex Chen, Jessica Wang
Published Date: 2022
This paper investigates the integration of multi-modal data sources, such as
genetic markers, neuroimaging scans, and behavioral assessments, for ASD
diagnosis. It discusses ensemble learning methods, data fusion strategies, and the
importance of feature selection in improving the accuracy of machine learning-
based ASD detection systems

5. "Predicting Autism Spectrum Disorder Using Machine Learning


Algorithms: A Review"
Author: Rachel Adams, Mark Wilson
Published Date: 2023
This review paper evaluates the performance of various machine learning
algorithms, including support vector machines (SVMs), random forests, and deep
learning models, in predicting autism spectrum disorder based on behavioral and
clinical data. It discusses the strengths and limitations of different approaches and
highlights key factors influencing prediction accuracy.

6. "Improving Early Detection of Autism Spectrum Disorder Through Deep


Learning-Based Feature Extraction from Video Recordings"
Author: Laura Thompson, Daniel Garcia
Published Date: 2021
This paper focuses on utilizing deep learning-based feature extraction
techniques from video recordings of social interactions to improve the early
detection of autism spectrum disorder. It explores the use of convolutional neural
networks (CNNs) and recurrent neural networks (RNNs) for analyzing behavioral
cues and identifying patterns indicative of ASD.

3
7. "A Comparative Study of Machine Learning Models for Autism
Spectrum Disorder Detection Using EEG Signals"
Author: Andrew Roberts, Jessica Lee
Published Date: 2020
This study compares the performance of different machine learning models,
such as k-nearest neighbors (KNN), decision trees, and gradient boosting, in
detecting autism spectrum disorder using electroencephalography (EEG) signals.
It examines the classification accuracy and computational efficiency of each
model to assess their suitability for EEG-based ASD detection.

8. "Exploring Sentiment Analysis Techniques for Autism Spectrum


Disorder Screening in Social Media Data"
Author: Sophia Baker, Matthew Clark
Published Date: 2022
This paper investigates sentiment analysis techniques applied to social media
data for screening and identifying potential cases of autism spectrum disorder. It
explores natural language processing (NLP) algorithms, sentiment lexicons, and
machine learning models for analyzing textual content and detecting linguistic
patterns associated with ASD traits.

4
CHAPTER 3

3.1 AUTISM SPECTRUM DISORDER


Autism spectrum disorder (ASD) is a complex
neurodevelopmental condition that affects individuals in varying ways,
leading to challenges in social interaction, communication, and behavior.
Here's a brief description of ASD:
3.1.1 Social Interaction Challenges:
Individuals with ASD may struggle with understanding social
cues, forming and maintaining relationships, and interpreting nonverbal
communication such as facial expressions and gestures. They may have
difficulty in initiating or participating in conversations and may prefer
solitary activities.
3.1.2 Communication Difficulties:
Communication difficulties in ASD can range from delayed
language development to atypical speech patterns, such as repetitive or
echolalic speech (repeating words or phrases without context). Some
individuals with ASD may have a limited vocabulary, while others may
have advanced language skills but struggle with pragmatic aspects of
communication, like taking turns in conversation or understanding
metaphors.
3.1.3 Repetitive Behaviors and Restricted Interests:
Many individuals with ASD engage in repetitive behaviors or
rituals, such as hand-flapping, rocking, or arranging objects in a specific
order. They may also develop intense interests in specific topics or
activities, often focusing on details or patterns related to their interests.
3.1.4 Sensory Sensitivities:
Sensory sensitivities are common in ASD, with individuals
experiencing heightened or reduced sensitivity to sensory stimuli such as

5
lights, sounds, textures, or tastes. This can lead to sensory overload or
avoidance behaviors in certain environments.
3.1.5 Range of Symptoms and Severity:
ASD is often referred to as a spectrum because it encompasses a
wide range of symptoms and levels of impairment. Some individuals with
ASD may have mild symptoms and lead relatively independent lives, while
others may require substantial support and intervention in daily activities.
3.2 Origins of ASD Data Set:
I was able to find open source data available at UCI Machine
Learning Repository. The data was made available to the public recently
on December 24th, 2017. The data set, which I will be referring to as the
ASD data set from here on out, came with a .csv file that contains 704
instances that are described by 21 attributes, a mix of numerical and
categorical variables. A short description of ASD dataset can be found on
this page. This data set was denoted by Prof. Fadi Fayez Thabtah,
Department of Digital Technology, MIT, Auckland, New Zealand,
[email protected]

6
CHAPTER 4
Algorithms and Techniques
Below we discuss the algorithms we have applied to our preprossed ASD
data:
4.1Decision Trees.
We will begin with creating a Decision Tree Classifier, also known as ID3
Algorithm and fit our training data. A Decision Tree uses a tree structure to
represent a number of possible decision paths and an outcome for each path.
They can also perform regression tasks and 9 they form the fundamental
components of Random Forests which will be applied to this data set in the
next section.
Decision Tree models are easy to use, run quickly, are able to handle both
categorical and numerical data, and graphically allow you to interpret the
data. Further, we don’t have to worry about whether the data is linearly
separable or not.
On the other hand, Decision Trees are highly prone to overfitting but one
solution to that is pruning the branches so that not too many features are
included and the use of ensemble methods like random forests. Another
weakness of Decision Trees is they don’t support online learning, so you
have to rebuild your tree when new examples come in.
A Decision Tree model is a good candidate for this problem as they are
particularly adept at binary classification, however it may run into problems
due to the high number of features so care will have to be taken with regards
to feature selection. Due to these advantages and the ease of interpretation
of the results, we will use the Decision Tree Classifier as the benchmark
model

7
Figure 9: Random Forest Diagram

4.2 Random Forests:


One way of avoiding overfitting that Decision Trees are prone to, is to apply
a technique called Random Forests, in which we build multiple decision
trees and let them vote on how to classify inputs. Random forests (or
random decision forests) are an ensemble learning method for
classification, regression and other tasks, that operate by constructing a
multitude of decision trees at training time and outputting the class that is
the mode of the classes (classification) or mean prediction (regression) of
the individual trees. Random decision forests correct for decision trees’
habit of overfitting to their training set.
A depiction of the algorithm is presented in the Figure 9.

8
4.3 Support Vector Machines (SVM):
Next, we move on to the Support Vector Machine algorithm (SVM) which
is, by far, my favorite machine learning algorithm. Support Vector Machine
is a supervised machine learning algorithm that is commonly used in
classification problems. It is based on the idea of finding the hyperplane
that ‘best’ splits a given data set into two classes. The algorithm gets its
name from the support vectors (the data points closest to the hyperplane),
which are points of a data set that if removed would alter the position of the
separating hyperplane. (See Figure 10).

Figure 10: Support Vector Machine Diagram

The distance between the hyperplane and the nearest training data point
from either set is known as the margin. Mathematically, the SVM algorithm
is designed to find the hyperplane that provides the largest minimum
distance to the training instances. In other words, the optimal separating
hyperplane maximizes the margin of the training data.
4.4 k-Nearest Neighbors (kNN):
The k Nearest Neighbor (kNN) algorithm is based on mainly two ideas: the
notion of a distance metric and that points that are close to one another are
similar.
Let x be the new data point that we wish to predict a label for. The k Nearest
Neighbor algorithm works by finding the k training data points x1, x2, · · ·
9
xk closest to x using a Euclidean distance metric. kNN algorithm then
performs majority voting to determine the label for the new data point x. In
the case of binary classification it is customary to choose k as odd.
In the situation where we encounter a tie as a result of majority voting there
are couple things we can do. First of the all we could randomly choose the
winner among the labels 11 that are tied. Secondly, we could weigh the
votes by distance and choose the weighted winner and last but not least, we
could lower the value of k until we find a unique winner
4.5 Naive Bayes:
We proceed the study of supervised machine learning algorithms by
applying Naive Bayes (NB), which is based around conditional probability
(Bayes theorem) and counting. The name ”naive” comes from its core
assumption of conditional independence i.e. all input features are
independent from one and another. If the NB conditional independence
assumption actually holds, a NB classifier will converge quicker than
discriminative models like logistic regression, so one needs less training
data. And even if the NB assumption doesn’t hold, a NB classifier still often
does a great job in practice. It’s main disadvantage is that it can’t learn
interactions between features. It only works well with limited number of
features. In addition, there is a high bias when there is a small amount of
data.
4.6 Logistic Regression:
The goal of logistic regression is to find the best fitting model to describe
the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of
independent (predictor or explanatory) variables. Logistic regression
generates the coefficients (and its standard errors and significance levels)
of a formula to predict a logit transformation of the probability of presence
of the characteristic of interest:

Logistic regression can be called as binary classification problems. A key


point to note here is that Y can have 2 classes only and not more than that.
If response variable has more than 2 classes, it would become a multi class
classification and you can no longer use the vanilla logistic regression for
that. Yet, Logistic regression is a classic predictive modelling technique
10
and still remains a popular choice for modelling binary categorical
variables. Another advantage of logistic regression is that it computes a
prediction probability score of an event. More on that when you actually
start building the models. Logistic regression achieves this by taking the
log odds of the event ln ( P 1−P ), where, P is the probability of event. So
P always lies between 0 and 1.

Figure 11: Logistic Regression Model

zi = ln Pi 1 − Pi = α + β1x1 + β2x2 + · · · βnxn,


Taking exponent on both sides of the equation gives:
Pi = E(y = 1|xi) = e z 1 + e z = e α+βixi 1 + e α+βixi
4.7. Linear Discriminant Analysis (LDA):
A classifier with a linear decision boundary, generated by fitting class
conditional densities to the data and using Bayes theorem. The model fits a
Gaussian density to each class, assuming that all classes share the same
covariance matrix. The fitted model can also be used to reduce the
dimensionality of the input by projecting it to the most discriminative
directions.

11
4.8 Multi-Layer Perception(MLP):
A multilayer perceptron (MLP) is a class of feedforward artificial neural
network. An MLP consists of at least three layers of nodes. Except for the
input nodes, each node is a neuron that uses a nonlinear activation function.
MLP utilizes a supervised learning technique called backpropagation for
training. Its multiple layers and non-linear activation distinguish MLP from
a linear perceptron. It can distinguish data that is not linearly separable.
Multilayer perceptron’s are sometimes colloquially referred to as ‘vanilla’
neural networks, especially when they have a single hidden layer

12
CHAPTER 5
SOFTWARE DESIGN

AUTISM.PY:
import numpy as np
import pandas as pd
from time import time
from IPython.display import display
import visuals as vs
data = pd.read_csv("autism_data.csv")
display(data.head(5))
n_records = len(data.index)
n_asd_yes = len(data[data['Class/ASD'] == 'YES'])
n_asd_no = len(data[data['Class/ASD'] == 'NO'])
yes_percentage = (n_asd_yes / n_records) * 100
print(f'Total number of records: {n_records}')
print(f'Number of individuals with ASD: {n_asd_yes}')
print(f'Number of individuals without ASD: {n_asd_no}')
print("Percentage of individuals with ASD: {:.2f}%".format(yes_percentage))
data.info()
data.describe()
missing_values_count = data.isna().sum()
print("Number of missing values in each column:")
print(missing_values_count)
data.dropna(inplace=True)
descriptive_stats = data.describe()
print(descriptive_stats)
n_records = len(data.index)
n_asd_yes = len(data[data['Class/ASD'] == 'YES'])
n_asd_no = len(data[data['Class/ASD'] == 'NO'])
print("AFTER REMOVING NULL VALUES:")
print(f'Total number of records: {n_records}')
print(f'Number of individuals with ASD: {n_asd_yes}')
print(f'Number of individuals without ASD: {n_asd_no}')
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid", color_codes=True)
sns.violinplot(x="result", y="jundice", hue="austim", data=data, split=True,
13
inner="quart", palette={'yes': "r", 'no': "b"})
sns.despine(left=True)
plt.show()
sns.violinplot(x="result", y="jundice", hue="Class/ASD", data=data, split=True,
inner="quart", palette={'YES': "r", 'NO': "b"})
sns.despine(left=True)
plt.show()
sns.catplot(x="jundice", y="result", hue="Class/ASD", s=5, col="gender", data=data,
kind="swarm")
plt.show()
data_raw = data['Class/ASD']
features_raw = data[['age', 'gender', 'ethnicity', 'jundice', 'austim', 'contry_of_res',
'result',
'relation', 'A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score',
'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score']]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
num = ['age', 'result']
features_minmax_transform = pd.DataFrame(data=features_raw)
features_minmax_transform[num] = scaler.fit_transform(features_raw[num])
from IPython.display import display
display(features_minmax_transform.head(5))
features_final = pd.get_dummies(features_minmax_transform)
features_final.head(5)
data_classes = data_raw.apply(lambda x: 1 if x == 'YES' else 0)
encoded = list(features_final.columns)
print("{} total features after one-hot encoding".format(len(encoded)))
print(encoded)
import matplotlib.pyplot as plt
plt.hist(data_classes, bins=10)
plt.xlim(0, 1)
plt.title('Histogram of Class/ASD')
plt.xlabel('Class/ASD from processed data')
plt.ylabel('Frequency')
plt.show()
from sklearn.model_selection import train_test_split
import numpy as np
np.random.seed(123)
14
X_train, X_test, y_train, y_test = train_test_split(features_final, data_classes,
test_size=0.2, random_state=1)
print("Train set has {} entries.".format(X_train.shape[0]))
print("Test set has {} entries.".format(X_test.shape[0]))
from sklearn.tree import DecisionTreeClassifier
dec_model = DecisionTreeClassifier()
dec_model.fit(X_train.values, y_train)
y_pred = dec_model.predict(X_test.values)
print('True labels : ', y_test.values[0:25])
print('Predicted labels:', y_pred[0:25])
from sklearn import metrics
cm = metrics.confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
TP = cm[1, 1]
FP = cm[0, 1]
TN = cm[0, 0]
FN = cm[1, 0]
print("True Positives (TP):", TP)
print("False Positives (FP):", FP)
print("True Negatives (TN):", TN)
print("False Negatives (FN):", FN)
accuracy = (TN + TP) / float(TP + TN + FP + FN)
print('Accuracy:', accuracy)
error = (FP + FN) / float(TP + TN + FP + FN)
print('Error:', error)
precision = metrics.precision_score(y_test, y_pred)
print('Precision:', precision)
score = dec_model.score(X_test.values, y_test)
print('Score:', score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rndm_model = RandomForestClassifier(n_estimators=5, random_state=1)
cv_scores = cross_val_score(rndm_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn import svm
from sklearn.model_selection import cross_val_score
15
svm_model = svm.SVC(kernel='linear', C=1, gamma=2)
cv_scores = cross_val_score(svm_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn.metrics import fbeta_score
svm_model.fit(X_train.values, y_train)
y_pred = svm_model.predict(X_test.values)
fbeta = fbeta_score(y_test, y_pred, average='binary', beta=0.5)
print("F-beta Score:", fbeta)
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
knn_model = neighbors.KNeighborsClassifier(n_neighbors=10)
cv_scores = cross_val_score(knn_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn.metrics import fbeta_score
knn_model.fit(X_train.values, y_train)
y_pred = knn_model.predict(X_test.values)
fbeta = fbeta_score(y_test, y_pred, average='binary', beta=0.5)
print("F-beta Score:", fbeta)
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
for n in range(10, 30):
knn_model = neighbors.KNeighborsClassifier(n_neighbors=n)
cv_scores = cross_val_score(knn_model, features_final, data_classes, cv=10)
print(n, cv_scores.mean())
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
nb_model = MultinomialNB()
cv_scores = cross_val_score(nb_model, features_final, data_classes, cv=10)
mean_cv_score = cv_scores.mean()
print("Mean Cross-Validation Score:", mean_cv_score)
from sklearn.metrics import fbeta_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
nb_model = MultinomialNB()
nb_cv_score = cross_val_score(nb_model, features_final, data_classes, cv=10)
16
print("Multinomial Naive Bayes Mean Cross-Validation Score:", nb_cv_score.mean())
nb_model.fit(X_train.values, y_train)
y_pred_nb = nb_model.predict(X_test.values)
fbeta_nb = fbeta_score(y_test, y_pred_nb, average='binary', beta=0.5)
print("Multinomial Naive Bayes F-beta Score:", fbeta_nb)
lr_model = LogisticRegression()
lr_cv_score = cross_val_score(lr_model, features_final, data_classes, cv=10)
print("Logistic Regression Mean Cross-Validation Score:", lr_cv_score.mean())
lr_model.fit(X_train.values, y_train)
y_pred_lr = lr_model.predict(X_test.values)
fbeta_lr = fbeta_score(y_test, y_pred_lr, average='binary', beta=0.5)
print("Logistic Regression F-beta Score:", fbeta_lr)
from sklearn.metrics import fbeta_score, accuracy_score, make_scorer
from sklearn.model_selection import GridSearchCV, train_test_split
scorers = {
'fbeta_score': make_scorer(fbeta_score, beta=0.5),
'accuracy_score': make_scorer(accuracy_score)
}
X_train, X_test, y_train, y_test = train_test_split(features_final, data_classes,
test_size=0.2, random_state=1)
model = RandomForestClassifier()
param_grid = {
'n_estimators': [10, 50, 100], # Example values for tuning
'max_depth': [None, 10, 20, 30] # Example values for tuning
}
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
scoring=scorers, refit='fbeta_score', cv=10)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_scores = grid_search.best_score_
best_estimator = grid_search.best_estimator_
y_pred = best_estimator.predict(X_test)
fbeta = fbeta_score(y_test, y_pred, beta=0.5)
accuracy = accuracy_score(y_test, y_pred)
print("F-beta Score:", fbeta)
print("Accuracy Score:", accuracy)
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.svm import SVC
17
from sklearn.model_selection import GridSearchCV
def f_beta_score(y_true, y_predict):
return fbeta_score(y_true, y_predict, beta=0.5)
clf = SVC(random_state=1)
parameters = {'C': range(1, 6), 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'], 'degree':
range(1, 6)}
scorer = make_scorer(f_beta_score)
grid_search = GridSearchCV(estimator=clf, param_grid=parameters, scoring=scorer,
cv=10)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
best_scores = grid_search.best_score_
best_estimator = grid_search.best_estimator_
from sklearn.model_selection import GridSearchCV
grid_obj = GridSearchCV(estimator=clf, param_grid=parameters, scoring=scorer)
grid_fit = grid_obj.fit(X_train.values, y_train)
best_clf = grid_fit.best_estimator_
predictions = clf.fit(X_train.values, y_train).predict(X_test.values)
best_predictions = best_clf.predict(X_test.values)
print("Unoptimized Model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test,
predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions,
beta=0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test,
best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test,
best_predictions, beta=0.5)))

VISUAL.PY

# Suppress matplotlib user warnings


# Necessary for newer version of matplotlib
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
#
# Display inline matplotlib plots with IPython
18
from IPython import get_ipython
# get_ipython().run_line_magic('matplotlib', 'inline')
###########################################
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.cm as cm
import seaborn as sns
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score, accuracy_score

def distribution(data, feature_label, transformed = False):


"""
Visualization code for displaying skewed distributions of features
"""

sns.set()
sns.set_style("whitegrid")
# Create figure
fig = plt.figure(figsize = (11,5));

# Skewed feature plotting


for i, feature in enumerate([feature_label]):
ax = fig.add_subplot(1, 2, i+1)
ax.hist(data[feature], bins = 25, color = '#00A0A0')
ax.set_title("'%s' Feature Distribution"%(feature), fontsize = 14)
ax.set_xlabel(feature_label)
ax.set_ylabel("Total Number")
ax.set_ylim((0, 1500))
ax.set_yticks([0, 200, 400, 600, 800])
ax.set_yticklabels([0, 200, 400, 600, 800, ">1000"])

# Plot aesthetics
if transformed:
fig.suptitle("Log-transformed Distributions", \
19
fontsize = 16, y = 1.03)
else:
fig.suptitle("Skewed Distributions", \
fontsize = 16, y = 1.03)

fig.tight_layout()
fig.show()

def visualize_classification_performance(results):
"""
Visualization code to display results of various learners.

inputs:
- results: a list of dictionaries of the statistic results from 'train_predict_evaluate()'
"""

# Create figure
sns.set()
sns.set_style("whitegrid")
fig, ax = plt.subplots(2, 3, figsize = (11,7))
# print("VERSION:")
# print(matplotlib._version_)
# Constants
bar_width = 0.3
colors = ["#e55547", "#4e6e8e", "#2ecc71"]

# Super loop to plot four panels of data


for k, learner in enumerate(results.keys()):
for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test',
'f_test']):
for i in np.arange(3):

# Creative plot code


ax[j//3, j%3].bar(i+k*bar_width, results[learner][i][metric], width =
bar_width, color = colors[k])
ax[j//3, j%3].set_xticks([0.45, 1.45, 2.45])
ax[j//3, j%3].set_xticklabels(["1%", "10%", "100%"])
20
ax[j//3, j%3].set_xlabel("Training Set Size")
ax[j//3, j%3].set_xlim((-0.1, 3.0))

# Add unique y-labels


ax[0, 0].set_ylabel("Time (in seconds)")
ax[0, 1].set_ylabel("Accuracy Score")
ax[0, 2].set_ylabel("F-score")
ax[1, 0].set_ylabel("Time (in seconds)")
ax[1, 1].set_ylabel("Accuracy Score")
ax[1, 2].set_ylabel("F-score")

# Add titles
ax[0, 0].set_title("Model Training")
ax[0, 1].set_title("Accuracy Score on Training Subset")
ax[0, 2].set_title("F-score on Training Subset")
ax[1, 0].set_title("Model Predicting")
ax[1, 1].set_title("Accuracy Score on Testing Set")
ax[1, 2].set_title("F-score on Testing Set")

# Add horizontal lines for naive predictors


ax[0, 1].axhline(y = 1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle
= 'dashed')
ax[1, 1].axhline(y = 1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle
= 'dashed')
ax[0, 2].axhline(y = 1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle
= 'dashed')
ax[1, 2].axhline(y = 1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle
= 'dashed')

# Set y-limits for score panels


ax[0, 1].set_ylim((0, 1))
ax[0, 2].set_ylim((0, 1))
ax[1, 1].set_ylim((0, 1))
ax[1, 2].set_ylim((0, 1))

# Create patches for the legend


patches = []
for i, learner in enumerate(results.keys()):
21
patches.append(mpatches.Patch(color = colors[i], label = learner))
plt.legend(handles = patches, bbox_to_anchor = (-.80, 2.53), \
loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large')

# Aesthetics
plt.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize
= 16, y = 1.10)
plt.tight_layout(pad=1, w_pad=2, h_pad=5.0)
plt.show()

def feature_plot(importances, X_train, y_train):

# Display the five most important features


indices = np.argsort(importances)[::-1]
columns = X_train.columns.values[indices[:11]]
values = importances[indices][:11]

sns.set()
sns.set_style("whitegrid")

# Creat the plot


fig = plt.figure(figsize = (12,5))
plt.title("Normalized Weights for First Five Most Predictive Features", fontsize =
16)
plt.bar(np.arange(11), values, width = 0.2, align="center", label = "Feature Weight")
# plt.bar(np.arange(11) - 0.3, np.cumsum(values), width = 0.2, align = "center",
color = '#00A0A0', \
# label = "Cumulative Feature Weight")
plt.xticks(np.arange(11), columns)
plt.xlim((-0.5, 4.5))
plt.ylabel("Weight", fontsize = 12)
plt.xlabel("Feature", fontsize = 12)

plt.legend(loc = 'upper center')


plt.tight_layout()
plt.show()

22
CHAPTER 6

RESULT ANALYSIS

Nerve damage Class/ASD

Figure 5.1

23
Nerve damage-Autism

Figure 5.2

24
Gender

Figure 5.3

1. Decision Tree Classifier:


True labels: [1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0]
Predicted labels: [1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0]
Confusion matrix: [[101 0]
[ 0 40]]
True Positives (TP): 40
False Positives (FP): 0
True Negatives (TN): 101
False Negatives (FN): 0
Accuracy: 1.0
Error: 0.0
Precision: 1.0
Score: 1.0

2. Random Forest Classifier:


Mean Cross-Validation Score: 0.9900603621730383

25
3. Support Vector Machine (SVM):
Mean Cross-Validation Score: 1.0
F-beta Score: 1.0

4. K-Nearest Neighbors (KNN):


Mean Cross-Validation Score: 0.9458752515090543
F-beta Score: 0.9183673469387755

5. Naive Bayes:
Mean Cross-Validation Score: 0.8746277665995976
F-beta Score: 0.7675438596491229

6. Logistic Regression:
Mean Cross-Validation Score: 0.9971428571428571
F-beta Score: 0.9948979591836735

7. Grid Search with Random Forest Classifier:


F-beta Score: 1.0
Accuracy Score: 1.0

8. Grid Search with Support Vector Machine (SVM):


F-beta Score: 1.0
Accuracy Score: 1.0

26
CHAPTER 7

CONCLUSION

This technology is beneficial because it reduces the response time of rescue


teams and helps in the efficient management of resources during emergencies.
The system can be easily installed in vehicles and can operate continuously
without requiring any human intervention. Overall, this IoT-based automatic
vehicle accident detection and rescue system can significantly contribute to the
safety of the people on the roads and can potentially save lives in the event of an
accident.

27
CHAPTER 8

FUTURE ENHANCEMENT

Thus, to summarize, we set out with the hopes of applying machine learning
algorithms, specifically, supervised machine learning techniques that can classify new
patients (new instances) with certain measurable characteristics (the variables) into one of
two categories “patient has ASD” or “patient does not have ASD”. Cleaning the data set
(which documented the characteristics associated with ASD), was challenging in that we had
mostly categorical variables and just two umerical variables, but ultimately we were able to
build such models and found that the algorithm that performs the best in all aspects is the
SVM machine learning algorithm, using a ‘linear kernel’. The SVM outperformed all other
models with respect to cross-validation score, AUC Score, and F-Beta Score, all of score were
1, and thus was as good as our benchmark model. Although the data association made the
prediction very simple, but we feel this work can certainly serve as an invaluable aid for
physicians for detection of new autistic cases.

In our consideration, to build an accurate and robust model, one needs to have larger datasets.
Here the number of instances after cleaning the data were not sufficient enough to claim that
this model is optimum. Looking at the performances of our learning models, nothing can be
improved with this current data set as models are already at their best. After discussing this
issue with a researcher directly working on adult autism, we have realised that it is extremely
difficult to collect a lot of well documented data related to ASD. This ASD dataset has
recently been made public (available from December 2017), and thus not much work has been
done.

28
REFERENCES

[1] Brian Godsey, Think Like a Data Scientist Manning, ISBN: 9781633430273
[2] H. Brink, J. Richards, M. Fetherolf, Real World Machine Learning, Manning, ISBN:
9781617291920
[3] D. Cielen, A. Meysman, M. Ali, Introducing Data Science, Manning ISBN:
9781633430037.
[4] J. Grus, Data Science From Scratch First Principles With Python, O’Reilly ISBN:
9781491901427
[5] A. G´eron, Hands-On Machine Learning with Scikit-Learn & Tensor Flow, O’Reilly
ISBN: 9781491962299
[6] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Second
Edition, Springer 23
[7] G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning
with Applications in R, Springer, ISBN 9781461471370
[8] Tabtah, F. (2017). Autism Spectrum Disorder Screening: Machine Learning Adaptation
and DSM-5 Fulfillment. Proceedings of the 1st International Conference on Medical and
Health Informatics 2017, pp.1-6. Taichung City, Taiwan, ACM.
[9] Thabtah, F. (2017). ASDTests. A mobile app for ASD screening. www.asdtests.com
[accessed December 20th, 2017].
[10] Thabtah, F. (2017). Machine Learning in Autistic Spectrum Disorder Behavioural
Research: A Review. Informatics for Health and Social Care Journal. December, 2017 (in
press)

29

You might also like