NLP and ML Project
NLP and ML Project
GUIDE:
Dr. M. Sree Latha
Prof. in CSE
BATCH NO:
7 TEAM MEMBERS:
P. Sai Yasaswini
Nitheesha. B
K. Harish Kumar
INDEX
01 STEPS AND DATASET
02 PREPROCESSING
03 FEATURE EXTRACTION
04 CLASSIFICATION
05 CONCLUSION
STEPS:
PREPROCESSING
FEAUTURE EXTRACTION
CLASSIFICATION
DATA SET
DATASET: SENTIMENT 140
S E N T I M E N T 1 4 0
DESCRIPTION:
• This sentiment dataset consists of 16,00,000
tuples and 6 columns.
• Out of these 16,00,000 tuples, 8,00,000 are
COLUMNS depression related tweets. Other, non-
SENTIMENT (POLARITY) depression related tweets.
ID
DATE
• We mainly focus on two columns namely
FLAG (NO_QUERY) Sentiment and tweets.
USERNAME
TWEET
dataset.csv
PREPROCESSING
STEP-1 STEP-3 RESULT
REMOVING URLS STOP WORDS CLEAN TWEETS
. .
•REMOVING MENTIONS
•REMOVING
PUNCTUATIONS
STEP-2
STEMMING
HOW TO DO:
could not bear to watch it and thought the ua loss was embarrassing
Applying Lemmatization
ML
CLASSIFICATION
NLP
FEATURE EXTRACTION
RE,NLP
PREPROCESSING
FEATURE EXTRACTION
N-GRAMS LDA
Latent Dirichlet Allocation is a
Used to calculate the probability of
probabilistic generative model
co-occurence of each input sentence
helpful in discovering underlying
as a unigram and bigram
topic structures
CLEAN
TWEETS
LIWC TFIDF
Linguistic Inquiry and Word Count Term Frequency – Inverse Document
dictionary can be used to obtain scores Frequency is a numeric statistic which
for standard linguistic dimensions, highlights the importance of a word w.r.t.
psychological processes and personal each document
concerns
N-GRAM MODELING
Unigrams: Bigrams:
Vectorising:
•TF:Term Frequency
Numbers of times a particular word has occured in a
given document.
•
•
LIWC
Linguistic Inquiry and Word Count
The LIWC dictionary used in this demonstration is composed of
5,690 words and word stems. Each word or word stem defines one or more
word categories.
•LIWC Dictionary
For each dictionary word,
there is a corresponding dictionary
entry that defines one or more word
categories
LDA
Latent Dirichlet Allocation
How to do
•NLP
•Topic Modeling
LDA(Contd..)
LDA(Contd..)
CLASSIFICATION
LOGISTIC REGREESION
LOGISTIC
REGRESSION
SUPPORT ADAPTIVE BOOSTING
SUPPOR
VECTOR
MACHINE ADAPTIVE
BOOSTING
ALGORITHMS
RANDOM MULTILAYER
FOREST PERCEPTRON
LOGISTIC REGRESSION
•Simple Algorithm used for
binary/multivariate
classification tasks.
MERITS
DEMERITS
•Suffers multicollinearity.
•Sensitive to extreme values of continuous variables.
SUPPORT VECTOR MACHINE
MERITS
DEMERITS
•Computationally expensive.
•Complexity is high.
•Requires more memory and time for training the model.
RANDOM FOREST
Ensemble
method which
uses multiple
learning
models to gain
• Ensemble method which uses multiple learning models to gain better predictive results.
better
predictive
results.
Creates a
forest with a
number of
• Decision trees are created from randomly selected subset of training set.
decision trees.
• Aggregates the votes from different decision trees to decide the final class of the test object.
Decision trees
are created
from randomly
selectedubset
of training set.
Aggregates the
votes from
different
decision trees
to decide the
final class of
the test object.
RANDOM FOREST(Contd..)
RANDOM FOREST(Contd..)
MERITS
DEMERITS
•Focuses on classification
problems and aims to convert a
set of weak classifiers into a
strong one.
•Simple to implement
•Does features selection resulting in a simple classifier
•Fairly good generalization
DEMERITS
MERITS
DEMERITS