ml documentation
ml documentation
By
C. MADHUMITHA
(38110284)
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
MARCH - 2022
1
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
Internal Guide
DR R. SATHYABAMA KRISHNAN
2
DECLARATION
DATE:
3
ACKNOWLEDGEMENT
I convey my thanks to Dr. T.Sasikala M.E., Ph.D, Dean, School of Computing Dr.
L. Lakshmanan M.E., Ph.D. , and Dr.S.Vigneshwari M.E., Ph.D. Heads of the
Department of Computer Science and Engineering for providing
necessary support and details at the right time during the progressive
reviews.
4
ABSTRACT
Professionals can help diagnose and treat patients more effectively by
detection mental health issues early. In this article, we discuss the
current status of AI in the mental health field and its potential
applications in healthcare. Machine learning techniques can help
address the basic mental health issues that people face, such as anxiety
and depression. They can also detect patterns and provide helpful
suggestions for addressing the problems. The attribute data has been
reduced using Feature Selection algorithms. Various machine learning
algorithms have been compared in terms of accuracy over the full set of
attributes and a select set of attributes. Although various algorithms
have been studied, further work is still needed to reduce the aperture
between AI and mental health analysis.
5
TABLE OF CONTENT
LIST OF FIGURES
7
1 INTRODUCTION
Millions of people around the world are affected by one or more mental
disorders that interfere in their thinking and behaviour. A timely
detection of these issues is challenging but crucial, since it could open
the possibility to offer help to people before the illness gets worse. One
alternative to accomplish this is to monitor how people express
themselves, which is for example what and how they write, or even a
step further, what emotions they express in their social media
communications. In this study, we analyse two computational
representations that aim to model the presence and changes of the
emotions expressed by social media users. In our evaluation we use
recent public data sets for the mental disorder: Depression. The
obtained results suggest that the presence and variability of emotions,
captured by the proposed representations, allow to highlight important
information about social media users suffering from depression.
8
and where to locate the relevant data, as well as to mine, clean, and
present data. Businesses use data scientists to source, manage, and
analyze large amounts of unstructured data.
9
Required Skills for a Data Scientist:
10
AI research include reasoning, knowledge representation, planning,
learning, natural language processing, perception and the ability to
move and manipulate objects. General intelligence (the ability to solve
an arbitrary problem) is among the field’s long-term
11
goals. To solve these problems, AI researchers use versions of search
and mathematical optimization, formal logic, artificial neural networks,
and methods based on statistics, probability and economics. AI also
draws upon computer science, psychology, linguistics, philosophy, and
many other fields.
13
AI is important because it can give enterprises insights into their
operations that they may not have been aware of previously and
because, in some cases, AI can perform tasks better than humans.
Particularly when it comes to repetitive, detail-oriented tasks like
analyzing large numbers of legal documents to ensure relevant fields
are filled in properly, AI tools often complete jobs quickly and with
relatively few errors.
14
1.4 Machine Learning
15
without being explicitly programmed. Machine learning focuses on the
development of Computer Programs that can change when exposed to
new data and the basics of Machine Learning, implementation of a
simple machine learning algorithm using python. Process of training and
prediction involves use of specialized algorithms. It feed the training
data to an algorithm, and the algorithm uses this training data to give
predictions on a new test data.
This data set may simply be bi-class (like identifying whether the person
is male or female or that the mail is spam or non-spam) or it may be
multi-class too. Some examples of classification problems are: speech
16
recognition, handwriting recognition, bio metric identification, document
classification etc.
Analyse
s Result
Past Data Machine learning
Train Predic
s t
17
Fig. 1 PROCESS OF MACHINE
LEARNING
1. Depression
2. Not depression
18
3 PROPOSED SYSTEM
19
Hospital Dataset
Classification ML
Training Model
dataset Algorithm
3.4 Advantages
These reports are to the investigation of applicability of machine
learning techniques for Mental Health prediction in operational
conditions.
Finally, it highlights some observations on future research issues,
20
4 LITERATURE SURVEY
22
Title : Predicting the Utilization of Mental Health Treatment with
Various Machine Learning Algorithms
Author: MEERA SHARMA, SONOK MAHAPATRA, ADEETHYIA SHANKAR
Year : 2020
In 2017, about 792 million people (more than 10% of the global
population) lived their lives with a mental disorder [24]– 78 million of
which committed suicide because of it. In these unprecedented times of
COVID-19, mental health challenges have been even further
exacerbated as home environments have been proven to be major
sources of the creation and worsening of poor mental health.
Additionally, proper diagnosis and treatment for people with mental
health disorders remains underdeveloped in modern-day’s society due
to the widely ever- present public stigma attached to caring about
mental health. Recently there have been attempts in the data science
world to predict if a person is suicidal (and other diagnostic approaches)
yet all face major setbacks. To begin, big data has many ethical issues
related to privacy and reusability without permission—especially in
regards to using feeds from social media. Additionally, people diagnosed
with specific mental health conditions may not actually seek treatment,
so data may be incorrect. In this research, we address both of these
problems by using anonymous datasets to predict the answer to a
different question—whether or not people are seeking mental health
treatment. We also use a large variety of machine learning and deep
learning classifiers and predictive models to predict with a high
accuracy rate through statistical analysis. From our research, we were
able to conclude that machine learning can be used to predict likelihood
of individuals seeking treatment with a high degree of accuracy (76.3% -
82.5%) by utilizing a self-reported questionnaire. Similarly, through a
simple questionnaire that asks enough questions relevant to mental
health, machine learning should also be able to determine if the person
requires treatment. Despite stigma surrounding mental illness,
individuals would be able to utilize machine learning to determine the
23
correct course of action for their mental illness. As a result, these
individuals would be more productive, reducing social and economic
costs at the tech workplace.
24
Title : Prediction of Mental Disorder for employees in IT Industry
Author: Sandhya P, Mahek Kantesaria
Year : 2019
Mental health is nowadays a topic which is most frequently discussed
when it comes to research but least frequently discussed when it comes
to the personal life. The wellbeing of a person is the measure of mental
health. The increasing use of technology will lead to a lifestyle of less
physical work. Also, the constant pressure on an employee in any
industry will make more vulnerable to mental disorder. These
vulnerabilities consist of peer pressure, anxiety attack, depression, and
many more. Here we have taken the dataset of the questionnaires
which were asked to an IT industry employee. Based on their answers
the result is derived. Here output will be that the person needs an
attention or not. Different machine learning techniques are used to get
the results. This prediction also tells us that it is very important for an IT
employee to get the regular mental health check up to tract their
health. The employers should have a medical service provided in their
company and they should also give benefits for the affected employees
There are many suggestions that employers and employees could keep
in mind. Employers need to keep track of number of their employees
having mental disorder. Employers should allow flexible work
environment with flexible work scheduling and break timings. They
should allow employees to work from home or have flexible place of
work.
25
Title : Prediction of Mental Health Problems Among Children Using
Machine Learning Techniques
Author: Ms. Sumathi M.R, Dr. B. Poorna
Year : 2016
Early diagnosis of mental health problems helps the professionals to
treat it at an earlier stage and improves the patients’ quality of life. So,
there is an urgent need to treat basic mental health problems that
prevail among children which may lead to complicated problems, if not
treated at an early stage. Machine learning Techniques are currently
well suited for analyzing medical data and diagnosing the problem. This
research has identified eight machine learning techniques and has
compared their performances on different measures of accuracy in
diagnosing five basic mental health problems. A data set consisting of
sixty cases is collected for training and testing the performance of the
techniques. Twenty-five attributes have been identified as important for
diagnosing the problem from the documents. The attributes have been
reduced by applying Feature Selection algorithms over the full attribute
data set. The accuracy over the full attribute set and selected attribute
set on various machine learning techniques have been compared. It is
evident from the results that the three classifiers viz., Multilayer
Perceptron, Multiclass Classifier and LAD Tree produced more accurate
results and there is only a slight difference between their performances
over full attribute set and selected attribute set Nowadays, a number of
expert systems are utilized in medical domain to predict diseases
accurately at an early stage so that treatment can be made effectively
and efficiently. Also, expert systems are developed in the mental health
domain to predict the mental health problem at an earlier stage. As a
number of machine learning techniques are available to construct
expert systems, it is necessary to compare them and identify the best
that suits the domain of interest. The research has compared eight
machine learning techniques (classifiers) on classifying the dataset to
different mental health problems. It is evident from the results that the
26
three classifiers viz., Multilayer Perceptron, Multiclass Classifier and LAD
Tree produce more accurate results than the others. The data set is very
minimal and in future, the research may be applied for a large data
set to obtain
27
more accuracy. The classifiers need to be trained prior to the
implementation of any technique in real prediction
28
5 METHODOLOGY
5.1 Objectives
The goal is to develop a machine learning model for mental health
(Depression) prediction and potentially replace the updatable
supervised machine learning classification models by predicting results
in the form of best accuracy by comparing supervised algorithm.
29
Comparing algorithm to predict the result :
30
6 FEASIBILITY STUDY
Quality control, open data and investigate and tidy up the data given.
Deal with the record cautiously and settle on certain that the tidiness
choice is legitimized.
6.3 Preprocessing
Data collection and lots of past data is required for machine learning.
The data collection contains enough historical data and raw data. The
raw data cannot be used without preprocessing the data. It is used for
post-processing, sort of an algorithm with the model. Train and test this
model to work and predict well with minimal error. The model is
adjusted relative to time with improved accuracy.
31
Data Gathering
Data Pre-Processing
Choose model
Train model
Test model
Tune model
Prediction
32
7 PROJECT REQUIREMENTS
1. Functional requirements
2. Non-Functional requirements
3. Technical requirements
A. Hardware requirements
B. Software requirements
I. Problem define
II. Preparing data
III. Evaluating algorithms
IV. Improving results
V. Prediction the result
33
7.3 Technical Requirements
Software Requirements:
Hardware requirements:
RAM : minimum 2 GB
34
8 SOFTWARE DESCRIPTION
8.1 ANACONDA
36
The command-line program conda is both a package manager and an
environment manager. This helps data scientists ensure that each
version of each package has all the dependencies it requires and works
correctly.
◻ JupyterLab
◻ Jupyter Notebook
◻ Spyder
◻ PyCharm
◻ VSCode
◻ Glueviz
◻ Orange 3 App
◻ Rstudio
◻ Anaconda Prompt (Windows only)
◻ Anaconda PowerShell (Windows only)
37
Fig. 4. Anaconda Navigator (1)
38
Fig. 5. Anaconda Navigator (2)
Conda :
39
8.2 JUPYTER NOTEBOOK
40
Save notebooks:
Modifications to the notebooks are automatically saved every few
minutes. To avoid modifying the original notebook, make a copy of the
notebook document (menu file -> make a copy…) and save the
modifications on the copy.
Executing a notebook:
Download the notebook you want to execute and put it in your notebook
folder (or a sub-folder of it).
❖ Launch the jupyter notebook app
41
The Jupyter Notebook App can be executed on a local desktop
requiring no internet access (as described in this document) or can be
installed on a remote server and accessed through the internet.
8.2.2 Kernel
A notebook kernel is a ―computational engine‖ that executes the code
contained in a Notebook document. The ipython kernel, referenced in this
guide, executes python code. Kernels for many other languages exist
(official kernels).
When a Notebook document is opened, the associated kernel is
automatically launched. When the notebook is executed (either cell-by-
cell or with menu Cell -> Run All), the kernel performs the computation
and produces the results.
Notebook Dashboard:
The Notebook Dashboard is the component which is shown first when
you launch Jupyter Notebook App. The Notebook Dashboard is mainly
used to open notebook documents, and to manage the running kernels
(visualize and shutdown).
The Notebook Dashboard has other features similar to a file manager,
namely navigating folders and renaming/deleting files
42
◻ Load a dataset and understand its structure using statistical
summaries and data visualization.
◻ Machine learning models, pick the best and build confidence that
the accuracy is reliable.
Python is a popular and powerful interpreted language. Unlike R,
Python is a complete language and platform that you can use for both
research and development and developing production systems. There
are also a lot of modules and libraries to choose from, providing multiple
ways to do each task. It can feel overwhelming.
8.4 PYTHON
43
9 SYSTEM ARCHITECTURE:
44
9.1 WORK FLOW DIAGRAM
Source Data
Training Testing
Dataset Dataset
45
10 MODULE DESCRIPTION
Some of these sources are just simple random mistakes. Other times,
there can be a deeper reason why data is missing. It’s important
to understand these different types of missing data from a statistics
point of view. The type of missing data will influence how to deal with
filling in the missing values and to detect missing values, and do some
basic imputation and detailed statistical approach for dealing with
missing data. Before, joint into code, it’s important to understand the
sources of missing data. Here are some typical reasons why data is
missing:
46
● Users chose not to fill out a field tied to their beliefs about how
the results would be used or interpreted.
47
Variable identification with Uni-variate, Bi-variate and Multi-variate analysis:
48
Fig. 10. Data Type Identification
MODULE DIAGRAM
input : data
49
patterns, corrupt data, outliers, and much more. With
50
a little domain knowledge, data visualizations can be used to express
and demonstrate key relationships in plots and charts that are more
visceral and stakeholders than measures of association or significance.
Data visualization and exploratory data analysis are whole fields
themselves and it will recommend a deeper dive into some the books
mentioned at the end.
Sometimes data does not make sense until it can look at in a visual
form, such as with charts and plots. Being able to quickly visualize of
data samples and others is an important skill both in applied statistics
and in applied machine learning. It will discover the many types of plots
that you will need to know when visualizing data in Python and how to
use them to better understand your own data.
◻ How to chart time series data with line plots and categorical
quantities with bar charts.
◻ How to summarize data distributions with histograms and box plots..
51
Fig. 13. Members in Family vs Depression
MODULE DIAGRAM
input : data
52
10.3 Comparing Algorithm with prediction in the form of best accuracy result
53
algorithm, Building a Machine Learning Model using install Scikit-Learn
libraries. In this library package have to done preprocessing, linear
model with logistic regression method, cross validating by KFold
method, ensemble with random forest method and tree with decision
tree
54
classifier. Additionally, splitting the train set and test set. To predicting
the result by comparing accuracy.
sklearn:
NumPy:
Matplotlib:
Pandas:
57
Fig. 16. Confusion Matrix of Logistic Regression
MODULE DIAGRAM
input : data
59
Fig. 19. Confusion Matrix of Random Forest
MODULE DIAGRAM
input : data
60
11 CONCLUSION AND FUTURE WORK
Conclusion
evaluation. The best accuracy on public test set is higher accuracy score
will be find out. This application can help to find the Prediction of
mental health.
Future Work
61
12 APPENDICES
A.SAMPLE CODE
Module – 1
Pre-Processing
In [ ]:
import warnings
warnings.filterwarnings("ignore")
In [ ]:
#Load given dataset
data = pd.read_csv("depressed.csv")
Before drop the given dataset:
In [ ]:
data.head()
In [ ]:
#shape
data.shape
After drop the given dataset: In [ ]:
df = data.dropna() In [ ]:
In [ ]:
df.head()
#shape In [ ]:
df.shape
#columns
In [ ]:
df.columns
In [ ]:
#To describe the dataframe
df.describe()
df.sex.unique() In [ ]:
df.Age.unique()
62
df.Married.unique()
In [ ]:
df.education_level.unique()
In [ ]:
df.incoming_agricultural.unique()
In [ ]:
df.total_members_in_family.unique()
In [ ]:
df.incoming_own_farm.unique()
In [ ]:
df.depressed.unique()
Before Pre_Processing: In [ ]:
df.head()
In [ ]:
After Pre_Processing:
63
Module – 2
Visualization
df = data.dropna()
In [ ]:
df.columns In [ ]:
pd.crosstab(df.Married,df.depressed) In [ ]:
#Propagation by variable
def PropByVar(df, variable):
dataframe_pie = df[variable].value_counts()
ax = dataframe_pie.plot.pie(figsize=(10,10), autopct='%1.2f%
%', fontsize = 12) In [ ]:
ax.set_title(variable + ' \n', fontsize = 15)
return np.round(dataframe_pie/df.shape[0]*100,2)
PropByVar(df, 'education_level')
fig, ax = plt.subplots(figsize=(15,6)) In [ ]:
sns.boxplot(df.Age, ax =ax)
plt.title("Age distribution")
plt.show()
sns.pairplot(df)
plt.show()
64
In [ ]:
fig, ax = plt.subplots(figsize=(15,6))
sns.violinplot(y = df['Age'], x = df['depressed'], ax=ax)
plt.title("Depressed Persons and their age")
plt.show()
In [ ]:
# Heatmap plot diagram
fig, ax = plt.subplots(figsize=(15,10))
sns.heatmap(df.corr(), ax=ax, annot=True)
Spliting Train/Test:
In [ ]:
65
Module – 3
Logistic Regression
In [ ]:
import warnings
warnings.filterwarnings('ignore')
In [ ]:
#Load given dataset
data = pd.read_csv("depressed.csv")
In [ ]:
df=data.dropna()
In [ ]:
df.columns
In [ ]:
#According to the cross-validated MCC scores, the random forest is the
best-performing model, so now let's evaluate its performance on the test
set.
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score, roc_auc_score
In [ ]:
X = df.drop(labels='depressed', axis=1)
#Response variable
y = df.loc[:,'depressed']
In [ ]:
#We'll use a test size of 30%. We also stratify the split on the response
variable, which is very important to do because there are so few
fraudulent transactions.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)
Logistic Regression :
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import
logR.fit(X_train,y_train)
predictR = logR.predict(X_test)
print("")
print('Classification report of Logistic Regression Results:')
66
print("")
print(classification_report(y_test,predictR))
print("")
cm=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Logistic Regression is:\n',cm)
print("")
sensitivity = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity )
print("")
specificity = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity)
print("")
In [ ]:
In [ ]:
68
In [ ]:
def plot_confusion_matrix(cm2, title='Confusion matrix-
LogisticRegression', cmap=plt.cm.Blues):
target_names=['Predict','Actual']
plt.imshow(cm2, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(target_names))
plt.xticks(tick_marks, target_names, rotation=45)
plt.yticks(tick_marks, target_names)
plt.tight_layout()
plt.ylabel('True label')
69
Module – 4
In [ ]:
import warnings
warnings.filterwarnings('ignore')
In [ ]:
#Load given dataset
data = pd.read_csv("depressed.csv")
In [ ]:
df=data.dropna()
In [ ]:
df.columns In [ ]:
#According to the cross-validated MCC scores, the random forest is the
best-performing model, so now let's evaluate its performance on the test
set.
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score, roc_auc_score
In [ ]:
X = df.drop(labels='depressed', axis=1)
#Response variable
y = df.loc[:,'depressed']
In [ ]:
#We'll use a test size of 30%. We also stratify the split on the response
variable, which is very important to do because there are so few
fraudulent transactions.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1, stratify=y)
RandomForestClassifier:
In [ ]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
predictR = rfc.predict(X_test)
print("")
print('Classification report of Random Forest Classifier Results:')
print("")
70
print(classification_report(y_test,predictR))
print("")
cm=confusion_matrix(y_test,predictR)
print('Confusion Matrix result of Random Forest Classifier is:\n',cm)
print("")
sensitivity = cm[0,0]/(cm[0,0]+cm[0,1])
print('Sensitivity : ', sensitivity )
print("")
specificity = cm[1,1]/(cm[1,0]+cm[1,1])
print('Specificity : ', specificity)
print("")
In [ ]:
def graph():
import matplotlib.pyplot as plt
data=[LR]
alg="Random Fores tClassifier"
plt.figure(figsize=(5,5))
b=plt.bar(alg,data,color=("b"))
plt.title("Accuracy comparison of Depression",fontsize=15)
plt.legend(b,data,fontsize=9)
In [ ]:
graph()
In [ ]:
TP = cm[0][0]
FP = cm[1][0]
FN = cm[1][1]
TN = cm[0][1]
print("True Positive :",TP)
print("True Negative :",TN)
71
Module – 5
Flask Deploy
import numpy as np
from flask import Flask, request, jsonify, render_template
import pickle
import joblib
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict',methods=['POST'])
def predict():
'''
For rendering results on HTML GUI
'''
int_features = [(x) for x in request.form.values()]
final_features = [np.array(int_features)]
print(final_features)
prediction = model.predict(final_features)
print(prediction)
output = prediction[0]
if output == 1:
return render_template('index.html', prediction_text='Person in
depression')
else:
return render_template('index.html', prediction_text='Person not
in depression')
print(output)
72
B.SCREENSHOTS
73
C. REFERENCES
74
Information Retrieval: 42nd European Conference on IR Research, ECIR
2020, Lisbon, Portugal, 2020.
75
[10]S. Burdisso, M. Errecalde, and M. Montes-y Go´mez, ―A text
classification framework for simple and effective early depression
detection over social media streams,‖ Expert Systems With Applications,
Vol. 133, 2019.
76