Text Analysis
Text Analysis
Dr Nitsa Herzog
Syntactic analysis.
Semantic analysis.
NLP Steps
Discourse integration.
Pragmatic analysis.
Text
Preprocess
ing
● Lower casing
● Removal of Punctuations
● Removal of Stopwords
● Removal of Frequent words
Text ● Removal of Rare words
● Stemming
Preprocessing ● Lemmatization
● Removal of emojis (pictogram, logogram, ideogram, or
Speech
Example:
(POS) • he saw a fox
Tagging Each tag marks the POS for the corresponding word,
such as:
• PRP VBD DT NN (According to the Penn Treebank
POS tags)
Four words are mapped to
• pronoun (personal), verb (past tense), determiner,
and noun (singular).
Example
For example,
The texts “a dog bites a man” and “a man bites a dog” have very
different meanings, but they share the same representation with
bag-of-words.
Term frequency
Term frequency (TF) is how common a word is, and inverse document frequency (IDF) is how
unique or rare a word is.
Useful for decreasing the weight of common, low-information words.
Nitsa Herzog Text Analysis 35
TF-IDF: Example
Consider a document containing 100 words wherein the word “apple”
appears 5 times. The term frequency (i.e., TF) for apple is then (5 / 100)
= 0.05.
Now, assume we have 10 million documents, and the word “apple”
appears in 1000 of these. Then, the inverse document frequency (i.e.,
IDF) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the TF-IDF weight is the product of these quantities: 0.05 * 4 =
0.20.
The topics
Nitsa Herzog Text Analysis of 5-start reviews 39
Review
Categorizati
on with LDA
Supervised Unsupervised
• Classic ML (the most popular: • Clustering
Naive Bayes(NB), Support • Deep Learning (the most
Vector Machine (SVM) popular: Convolutional
Neural Network (CNN),
Recurrent Neural Network
(RNN)