100% found this document useful (1 vote)

306 views

NLP and ML Project

The document describes steps to detect depression related posts on Reddit using machine learning techniques. It discusses preprocessing text data which includes removing URLs, mentions, punctuation and stemming. Feature extraction methods like n-grams, LIWC, LDA and TF-IDF are explained. Classification algorithms like logistic regression, support vector machine, random forest, adaptive boosting and multilayer perceptron are discussed. The document provides details about various natural language processing and machine learning concepts used to build a model for detecting depression from social media text data.

Uploaded by

Ahhishek Varanasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

306 views

NLP and ML Project

Uploaded by

Ahhishek Varanasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

DETECTION OF DEPRESSION RELATED POSTS

IN REDDIT SOCIAL MEDIA FORUM

GUIDE:
Dr. M. Sree Latha
Prof. in CSE
BATCH NO:

7 TEAM MEMBERS:
P. Sai Yasaswini
Nitheesha. B
K. Harish Kumar
INDEX
01 STEPS AND DATASET

02 PREPROCESSING

03 FEATURE EXTRACTION

04 CLASSIFICATION

05 CONCLUSION
STEPS:
 PREPROCESSING
 FEAUTURE EXTRACTION
 CLASSIFICATION
DATA SET
DATASET: SENTIMENT 140

S E N T I M E N T 1 4 0

DESCRIPTION:
• This sentiment dataset consists of 16,00,000
tuples and 6 columns.
• Out of these 16,00,000 tuples, 8,00,000 are
COLUMNS depression related tweets. Other, non-
SENTIMENT (POLARITY) depression related tweets.
ID
DATE
• We mainly focus on two columns namely
FLAG (NO_QUERY) Sentiment and tweets.
USERNAME
TWEET

dataset.csv
PREPROCESSING
STEP-1 STEP-3 RESULT
REMOVING URLS STOP WORDS CLEAN TWEETS
. .
•REMOVING MENTIONS
•REMOVING
PUNCTUATIONS

STEP-2
STEMMING
HOW TO DO:

•URLS •STEMMING •STOPWORDS

•MENTIONS
RESULT

‘RE NLTK NLTK

BEAUTIFUL SOUP STEM CORPUS CLEAN TWEETS

TO DATA FRAMES
r'@[A-Za-z0-9_]+' WORDNETLEMM STOPWORD
ETIZER
 r'https?://[^ ]+'
EXAMPLE(1)
@caregiving couldn't bear to watch it. And I thought the UA loss was embarrassing . . . . .

Removal of URLs ,Mentions, negation words and uppercase

could not bear to watch it and thought the ua loss was embarrassing

Applying Lemmatization

could not bear to watch it and think the ua loss be embarrass

Removing Stop words

could not bear watch think ua loss embarrass

EXAMPLE(2)
@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.
You shoulda got David Carr of Third Day to do it. ;D

awww that bummer you shoulda got

david carr of third day to do it

awww that bummer you shoulda get

david carr of third day to do it

awww bummer shoulda get david carr third day

IMPLEMENTATION
RESULT

ML
CLASSIFICATION

NLP
FEATURE EXTRACTION

RE,NLP
PREPROCESSING
FEATURE EXTRACTION
N-GRAMS LDA
Latent Dirichlet Allocation is a
Used to calculate the probability of
probabilistic generative model
co-occurence of each input sentence
helpful in discovering underlying
as a unigram and bigram
topic structures

CLEAN
TWEETS
LIWC TFIDF
Linguistic Inquiry and Word Count Term Frequency – Inverse Document
dictionary can be used to obtain scores Frequency is a numeric statistic which
for standard linguistic dimensions, highlights the importance of a word w.r.t.
psychological processes and personal each document
concerns
N-GRAM MODELING

•Used to calulate the probability of co-occurence of each input

sequence as unigrams and bigrams.

•N-gram model is a type of probabilistic language

model for predicting the next item in a sequence in the
form of (n-1) order.
N-GRAM MODELING(Contd..)
S1: This movie is Bad.
S2: This movie is Good.

Unigrams: Bigrams:

[This , movie , is , [ [This , movie] , [movie , is] ,

Bad , Good] [is , Bad] , [is , Good]]

Vectorising:

S1: [1,1,1,1,0] S1: [1,1,1,0]

S2: [1,1,1,0,1] S2: [1,1,0,1]
N-GRAM MODELING(Contd..)
•TF-IDF will be used as a numeric statistic which highlights
the importance of a word w.r.t. each document.

•TF:Term Frequency
Numbers of times a particular word has occured in a
given document.

•IDF: Inverse Document Frequency

In how many documents the word has appeared.

•Together called TF-IDF

N-GRAM MODELING(Contd..)

•
•
LIWC
Linguistic Inquiry and Word Count
 The LIWC dictionary used in this demonstration is composed of
5,690 words and word stems. Each word or word stem defines one or more
word categories.

 For example, the word 'cried' is part of four word categories:

sadness, negative emotion, overall affect, and a past tense verb. Hence, if it
is found in the target text, each of these four category scale scores will be
incremented. As in this example, many of the LIWC categories are arranged
hierarchically. All anger words, by definition, will be categorized as negative
emotion and overall emotion words.
LIWC(Contd..)
LIWC(Contd..)

•Basically, it reads a given text and

counts the percentage of words that
reflect different emotions, thinking
styles, social concerns, and even
parts of speech.

•LIWC Dictionary
For each dictionary word,
there is a corresponding dictionary
entry that defines one or more word
categories
LDA
Latent Dirichlet Allocation

Used for topic modeling.

Generative probabilistic model of a collection of

composites made up of parts.

Typically used to detect underlying topics in text

documents.

Particularly useful for finding reasonably accurate mixtures

of topics within a given document.
LDA(Contd..)
Assumption

Documents with similar topics will use similar group of words.

How to do

•NLP
•Topic Modeling
LDA(Contd..)
LDA(Contd..)
CLASSIFICATION
LOGISTIC REGREESION

LOGISTIC
REGRESSION
SUPPORT ADAPTIVE BOOSTING
SUPPOR
VECTOR
MACHINE ADAPTIVE
BOOSTING

ALGORITHMS

RANDOM MULTILAYER
FOREST PERCEPTRON
LOGISTIC REGRESSION
•Simple Algorithm used for
binary/multivariate
classification tasks.

•Sigmoid function is used to

reduce the values to [0,1].

•The output is the class label

either 1(Yes) or 0(No).
LOGISTIC REGRESSION(Contd..)

MERITS

•Provides model logistic probability

•Easy to interpret
•Quickly update model to incorporate new data.

DEMERITS

•Suffers multicollinearity.
•Sensitive to extreme values of continuous variables.
SUPPORT VECTOR MACHINE

•Discriminative classifier formally defined by a separating

hyperplane.

•Given labeled data, SVM outputs an optimal hyperplane which

categorizes new examples.

•In two dimensional plane, hyperplane divides is a line dividing a

plane in two parts where in each class lay in either side.
SUPPORT VECTOR MACHINE(Contd..)
SUPPORT VECTOR MACHINE(Contd..)

MERITS

•Good prediction in a variety of situations.

•Low generalization errors.
•Easy to interpret results.

DEMERITS

•Computationally expensive.
•Complexity is high.
•Requires more memory and time for training the model.
RANDOM FOREST

Ensemble
method which
uses multiple
learning
models to gain
• Ensemble method which uses multiple learning models to gain better predictive results.
better
predictive
results.

• Creates a forest with a number of decision trees.

Creates a
forest with a
number of
• Decision trees are created from randomly selected subset of training set.
decision trees.

• Aggregates the votes from different decision trees to decide the final class of the test object.
Decision trees
are created
from randomly
selectedubset
of training set.

Aggregates the
votes from
different
decision trees
to decide the
final class of
the test object.
RANDOM FOREST(Contd..)
RANDOM FOREST(Contd..)

MERITS

•Efficient on large datasets.

•Deals with high dimensional data.

DEMERITS

•Observed to be overfit for some datasets with noisy classification tasks.

•Large number of trees makes the algorithm slow for real-time prediction.
ADAPTIVE BOOSTING

•First practical boosting algorithm

proposed by Freund and Schapire
in 1996.

•Focuses on classification
problems and aims to convert a
set of weak classifiers into a
strong one.

•Used to boost the performance

of a machine learning algorithm.
ADAPTIVE BOOSTING(Contd..)
MERITS

•Simple to implement
•Does features selection resulting in a simple classifier
•Fairly good generalization

DEMERITS

•Sensitive to noisy data and outliers

MULTI LAYER PERCEPTRON
•Class of feedforward artificial neural network.

•Consists of atleast three layers: an input layer, a

hidden layer and an output layer.

•Except for the input nodes, each node is a neuron

that uses a non linear activation function.

•Utilizes a supervised learning technique called

Backpropagation for training.

•Can distinguish data that is not linearly separable.

MULTI LAYER PERCEPTRON(Contd..)

MERITS

•Powerful, can model complex functions

•Adapt to unknown situations

DEMERITS

•Can get stuck in a local minima

•Number of hidden layers is to be set by the user
EXTENSION

•Existing methods are developed based on single features.

•Single features may be inefficient and result in lesser accuracy.

•To increase the accuracy combinations of these features are to

be taken.
CONCLUSION

•Perform Data Preprocessing, Feature Extraction and

Classification.

•Multiple models are to be built perform feature extraction and

classification.

•Determining the best model which gives more accuracy.

Logistic Regression Project With Python
No ratings yet
Logistic Regression Project With Python
14 pages
Book Copyright Page Examples For Your Ebook
67% (3)
Book Copyright Page Examples For Your Ebook
10 pages
Research Ethics
100% (3)
Research Ethics
64 pages
A Project Report
No ratings yet
A Project Report
38 pages
Machine Learning
100% (1)
Machine Learning
46 pages
Data Science Project
No ratings yet
Data Science Project
3 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
ML UNIT-IV Notes
100% (1)
ML UNIT-IV Notes
23 pages
deep learning
No ratings yet
deep learning
34 pages
Car Make and Model Recognition Using Ima
No ratings yet
Car Make and Model Recognition Using Ima
8 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
70 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
6 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
CS 8520: Artificial Intelligence: Knowledge Representation
No ratings yet
CS 8520: Artificial Intelligence: Knowledge Representation
30 pages
Face Detection & Emotion Recognition
No ratings yet
Face Detection & Emotion Recognition
26 pages
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
100% (1)
Chapter 11: Business Intelligence and Knowledge Management: Oz (5th Edition)
20 pages
Python Programming-Grade 9
No ratings yet
Python Programming-Grade 9
53 pages
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
No ratings yet
Detection of Fake Online Reviews Using Semi Supervised and Supervised Learning
4 pages
Title Lightweight Model Implementation Using Neural Network For Fruit Recognition
No ratings yet
Title Lightweight Model Implementation Using Neural Network For Fruit Recognition
48 pages
ML Unit-5
No ratings yet
ML Unit-5
83 pages
30 Amazing Machine Learning Projects For The Past Year (v.2018)
No ratings yet
30 Amazing Machine Learning Projects For The Past Year (v.2018)
22 pages
Machine Learning Applications Used in Accounting and Audits
100% (1)
Machine Learning Applications Used in Accounting and Audits
6 pages
Campus Placement Analyzer: Using Supervised Machine Learning Algorithms
No ratings yet
Campus Placement Analyzer: Using Supervised Machine Learning Algorithms
5 pages
Artificial Intelligence: CS60045 Course Introduction
100% (4)
Artificial Intelligence: CS60045 Course Introduction
16 pages
Crop Yield Prediction Using Deep Reinforcement Learning Model For Sustainable Agrarian Applications
No ratings yet
Crop Yield Prediction Using Deep Reinforcement Learning Model For Sustainable Agrarian Applications
17 pages
KCA 303 Unit-2
No ratings yet
KCA 303 Unit-2
32 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
The Price Prediction For Used Cars Using Multiple Linear Regression Model
No ratings yet
The Price Prediction For Used Cars Using Multiple Linear Regression Model
6 pages
Lesson 4 Logic and Knowledge Representation
No ratings yet
Lesson 4 Logic and Knowledge Representation
100 pages
Course Collections by Coursera - Machine Learning & Artificial Intelligence
100% (2)
Course Collections by Coursera - Machine Learning & Artificial Intelligence
6 pages
Understanding Support Vector Machine Algorithm From Examples
No ratings yet
Understanding Support Vector Machine Algorithm From Examples
10 pages
Face Detection and Feature Extraction For Facial Emotion Detection
No ratings yet
Face Detection and Feature Extraction For Facial Emotion Detection
6 pages
Digital Media Marketing Using Trend Analysis On Social Media Seminar Presentation
100% (1)
Digital Media Marketing Using Trend Analysis On Social Media Seminar Presentation
16 pages
Implications of Predictive Analytics
No ratings yet
Implications of Predictive Analytics
9 pages
Fundamentals of Machine Learning For Predictive Data Analytics
No ratings yet
Fundamentals of Machine Learning For Predictive Data Analytics
49 pages
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
100% (1)
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
13 pages
Class Xi Python
100% (2)
Class Xi Python
138 pages
Opencv Python Tutroals
No ratings yet
Opencv Python Tutroals
273 pages
71A Machine Learning
No ratings yet
71A Machine Learning
8 pages
(IJETA-V8I5P1) :yew Kee Wong
No ratings yet
(IJETA-V8I5P1) :yew Kee Wong
5 pages
LP3 - ML Mini-Project Report Format Shreeyas
No ratings yet
LP3 - ML Mini-Project Report Format Shreeyas
13 pages
Detection of Stroke Disease Using Machine Learning Algorithams Full
No ratings yet
Detection of Stroke Disease Using Machine Learning Algorithams Full
57 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
AutoGen - The Automated Program Generator
No ratings yet
AutoGen - The Automated Program Generator
196 pages
Age and Gender Detection Using Deep Learning: HYDERABAD - 501 510
No ratings yet
Age and Gender Detection Using Deep Learning: HYDERABAD - 501 510
11 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
HW1
100% (1)
HW1
8 pages
PPT1
No ratings yet
PPT1
93 pages
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
No ratings yet
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
26 pages
Model With One-Word Context: 2vec 2vec 2vec 2vec
100% (1)
Model With One-Word Context: 2vec 2vec 2vec 2vec
17 pages
Clouds and Big Data Computing
No ratings yet
Clouds and Big Data Computing
13 pages
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
No ratings yet
Machine Learning Techniques For Heart Disease Prediction: A. Lakshmanarao, Y.Swathi, P.Sri Sai Sundareswar
4 pages
Statistics Powerpoint Presentation - Regression
No ratings yet
Statistics Powerpoint Presentation - Regression
17 pages
Chronic Kidney Disease Using CNN
100% (1)
Chronic Kidney Disease Using CNN
10 pages
Maximize The Business Value of Generative Ai
No ratings yet
Maximize The Business Value of Generative Ai
19 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
Virtualization and Five Step Process
No ratings yet
Virtualization and Five Step Process
19 pages
DataScience Unit1 (+notes)
No ratings yet
DataScience Unit1 (+notes)
56 pages
Key Data Mining Tasks: 1. Descriptive Analytics
No ratings yet
Key Data Mining Tasks: 1. Descriptive Analytics
10 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Touchpad Plus Ver. 1.1 Class 7
From Everand
Touchpad Plus Ver. 1.1 Class 7
Nisha Batra
No ratings yet
Financial Analyst
No ratings yet
Financial Analyst
2 pages
Sports Analytics
0% (1)
Sports Analytics
11 pages
Bbkn3103 Business Communication Assigment
100% (1)
Bbkn3103 Business Communication Assigment
5 pages
Computer-Mediated Communication
50% (2)
Computer-Mediated Communication
3 pages
Photo 101 - Hands On DSLR
No ratings yet
Photo 101 - Hands On DSLR
27 pages
MIRM Framework
No ratings yet
MIRM Framework
90 pages
BJ Sliding Sleeve
No ratings yet
BJ Sliding Sleeve
1 page
Roles and Responsibilities
No ratings yet
Roles and Responsibilities
6 pages
Issee Data Sheet Esm
No ratings yet
Issee Data Sheet Esm
2 pages
Lab 1 Random Number Generator: 2 Cryptool
No ratings yet
Lab 1 Random Number Generator: 2 Cryptool
6 pages
After Class - AVTC5 - Unit 1 - Line Graph
No ratings yet
After Class - AVTC5 - Unit 1 - Line Graph
3 pages
Listening Comprehension Exercises
No ratings yet
Listening Comprehension Exercises
7 pages
CR - Artikel 1 - Yuwono DKK
No ratings yet
CR - Artikel 1 - Yuwono DKK
18 pages
KG2 Week Seven: Downloaded From
No ratings yet
KG2 Week Seven: Downloaded From
14 pages
Verbal - Part - 4 - Paragraph Formation and Jumbled Sentences
No ratings yet
Verbal - Part - 4 - Paragraph Formation and Jumbled Sentences
18 pages
Using The Laws of Attraction
No ratings yet
Using The Laws of Attraction
29 pages
Ionic Formulas Lesson Plan
No ratings yet
Ionic Formulas Lesson Plan
9 pages
3.1 Weve Been Told That Our Child Cant Have A Continence Assessment at Home England
No ratings yet
3.1 Weve Been Told That Our Child Cant Have A Continence Assessment at Home England
2 pages
Midterm Exam in Research II Chapter 4 and 5
No ratings yet
Midterm Exam in Research II Chapter 4 and 5
170 pages
What Is Good Governance?
No ratings yet
What Is Good Governance?
5 pages
The Discourse of Advertising - Linguistic Features and Classroom Activities
No ratings yet
The Discourse of Advertising - Linguistic Features and Classroom Activities
14 pages
Team Roles
No ratings yet
Team Roles
14 pages
SHEWHART
100% (1)
SHEWHART
2 pages
EE4513 Analog and Digital Communications Laboratory: Non-Rectangular Constellation Quadrature Amplitude Modulation (QAM)
No ratings yet
EE4513 Analog and Digital Communications Laboratory: Non-Rectangular Constellation Quadrature Amplitude Modulation (QAM)
2 pages
1967 4 Eng
No ratings yet
1967 4 Eng
15 pages
BTVN 10.5
No ratings yet
BTVN 10.5
5 pages
SAR Window Functions: A Review and Analysis of The Notched Spectrum Problem
No ratings yet
SAR Window Functions: A Review and Analysis of The Notched Spectrum Problem
54 pages
Nand, Nor Gates, Circuit Minimization and Karnaugh Maps: Prof. Sin-Min Lee Department of Computer Science
No ratings yet
Nand, Nor Gates, Circuit Minimization and Karnaugh Maps: Prof. Sin-Min Lee Department of Computer Science
62 pages

NLP and ML Project

Uploaded by

NLP and ML Project

Uploaded by

DETECTION OF DEPRESSION RELATED POSTS

IN REDDIT SOCIAL MEDIA FORUM

•URLS •STEMMING •STOPWORDS

‘RE NLTK NLTK

BEAUTIFUL SOUP STEM CORPUS CLEAN TWEETS

Removal of URLs ,Mentions, negation words and uppercase

could not bear to watch it and think the ua loss be embarrass

Removing Stop words

could not bear watch think ua loss embarrass

awww that bummer you shoulda got

awww that bummer you shoulda get

awww bummer shoulda get david carr third day

•Used to calulate the probability of co-occurence of each input

•N-gram model is a type of probabilistic language

[This , movie , is , [ [This , movie] , [movie , is] ,

S1: [1,1,1,1,0] S1: [1,1,1,0]

•IDF: Inverse Document Frequency

•Together called TF-IDF

 For example, the word 'cried' is part of four word categories:

•Basically, it reads a given text and

Used for topic modeling.

Generative probabilistic model of a collection of

Typically used to detect underlying topics in text

Particularly useful for finding reasonably accurate mixtures

Documents with similar topics will use similar group of words.

•Sigmoid function is used to

•The output is the class label

•Provides model logistic probability

•Discriminative classifier formally defined by a separating

•Given labeled data, SVM outputs an optimal hyperplane which

•In two dimensional plane, hyperplane divides is a line dividing a

•Good prediction in a variety of situations.

• Creates a forest with a number of decision trees.

•Efficient on large datasets.

•Observed to be overfit for some datasets with noisy classification tasks.

•First practical boosting algorithm

•Used to boost the performance

•Sensitive to noisy data and outliers

•Consists of atleast three layers: an input layer, a

•Except for the input nodes, each node is a neuron

•Utilizes a supervised learning technique called

•Can distinguish data that is not linearly separable.

•Powerful, can model complex functions

•Can get stuck in a local minima

•Existing methods are developed based on single features.

•Single features may be inefficient and result in lesser accuracy.

•To increase the accuracy combinations of these features are to

•Perform Data Preprocessing, Feature Extraction and

•Multiple models are to be built perform feature extraction and

•Determining the best model which gives more accuracy.

You might also like