CS-875-Lecture 4
CS-875-Lecture 4
Lecture 4
Instructor: Dr. Hashir Kiani
General Conduct
• Be respectful of others
• Only speak at your turn and preferably raise your hand if you want
to say something
2
Lecture Outline
3
Recap of Previous Lecture
• Explain the pipeline for supervised machine learning?
• What is sentiment analysis?
• Why is it important?
• What is the sentiment analysis pipeline if we are using supervised
machine learning?
• What is vocabulary?
• How do we extract features from text using vectors equal to the
size of vocabulary?
• What are the problems with such a representation?
• How do we use positive and negative counts to extract features?
4
Preprocessing: Stop Words and Punctuation
• Stop words in a language are those words that add little or no
significant value to the meaning or context of the text.
• Examples include “the”, “is”, ”in”, “to” etc.
• Stop words occur very frequently in text.
• For example, “I went to the market”. Here to and the only add
grammatical structure and context is still clear without these
words.
• Therefore, stop words and punctuations are removed as
preprocessing step.
5
Preprocessing: Stop Words and Punctuation
Stop Words
Punctuation import nltk
and
, from nltk.corpus import stopwords
is
: from nltk.tokenize import word_tokenize
a
!
at import string
;
has
‘
of stop_w=set(stopwords.words('english’))
“
the tokens = word_tokenize(text)
in
to filtered_tokens = [word for word in tokens if word.lower()
not in stop_w and word not in string.punctuation]
6
Preprocessing: Stemming
• Stemming is used to reduce words to their base or root form.
• It is a rule based approach that involves stripping common prefixes
or suffixes.
• For example: running changes to run and happiness changes to
happi
• Stemming does not consider the grammatical rules or linguistic
structure of the language.
• It reduces vocabulary size and simplifies text analysis.
7
Preprocessing: Stemming Code
porter = PorterStemmer()
tokens = word_tokenize(text)
8
Text to Feature Vectors
I am happy because I am learning NLP
Preprocessing
[happy,learn,NLP]
Feature Extraction
[1,10,5]
9
Final Feature/Input Matrix
(1) (1)
1 𝑥1 𝑥2
(2) (2)
1 𝑥1 𝑥2
⋮ ⋮ ⋮
1 (𝑚) (𝑚)
𝑥1 𝑥2
10
Supervised Machine Learning (Training)
𝜃
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠
𝑂𝑢𝑡𝑝𝑢𝑡
(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒𝑠)
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑌 𝐶𝑜𝑠𝑡
𝑋 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
𝐿𝑎𝑏𝑒𝑙𝑠
𝑌
11
Logistic Regression – A Linear Classifier
• Predictor Function: ℎ 𝒙 = 𝑔(𝒘𝑻 𝜙 𝒙 )
1
Where g 𝑧 = (Sigmoid/Logistic Function)
1+𝑒 −𝑧
12
Logistic Regression – Loss Function
• Logistic loss also known as Binary Cross Entropy Loss:
𝑁
1
𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 𝒘 = − [𝑦𝑖 𝑙𝑜𝑔 ℎ 𝒙𝒊 + 1 − 𝑦𝑖 log(1 − ℎ 𝒙𝒊 )]
𝑁
𝑖=1
13
Logistic Regression – Optimization Algorithm
• Gradient Descent:
𝑁
1 1 𝑇
∇𝒘 𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 𝒘 = [(ℎ 𝒙𝒊 − 𝑦𝑖 )𝒙𝒊 ] = 𝑋 (𝒉 − 𝒚)
𝑁 𝑁
𝑖=1
𝛼 𝑇
𝒘𝑛𝑒𝑤 = 𝒘𝑜𝑙𝑑 − 𝑋 (𝒉 − 𝒚)
𝑁
14
Training, Validation and Testing
15
Classification Model Performance Evaluation
• Accuracy: Proportion of correct predictions.
• Confusion Matrix: Table summarizing TP, FP, FN and TN.
17
Text Representations
Text Representations
19
Text Representations
20
Why is Text Representation Important?
21
Common Text Representation Methods
● One-Hot Encoding
● Bag of Words
● TF-IDF
● N-grams
● Word Embeddings
22
One-Hot Encoding (OHE)
23
One-Hot Encoding (OHE)
24
One-Hot Encoding (OHE)
● It is generally not ideal for larger or more complex tasks because it does
not capture word frequencies, context, or semantic relationships between
words.
25
One-Hot Encoding (OHE)
● Pros
○ Easy to implement and interpret.
○ Simple, which can be useful for small tasks.
● Cons
○ High dimensionality
■ As the vocabulary grows, the vectors become extremely large and sparse, which
can be computationally inefficient.
○ No semantic meaning
■ Words are treated independently, and the encoding does not capture relationships
between words.
■ For example, "king" and "queen" are just as unrelated as "king" and "apple" in
one-hot encoding.
26
Bag of Words (BoW): Conceptual Model
● The Bag of Words (BoW) model is one of the simplest ways to represent
text data numerically.
27
Bag of Words (BoW): Conceptual Model
● Tokenization
○ Break down each document into individual words (tokens).
● Vocabulary Construction
○ Create a list of all unique words (vocabulary) across the entire document set.
● Document-Term Matrix
○ Represent each document as a vector of word counts from the vocabulary.
○ The vector length is equal to the number of unique words in the corpus.
28
Bag of Words (BoW): Conceptual Model
29
Bag of Words (BoW): Conceptual Model
30
Bag of Words (BoW): Conceptual Model
● Pros
○ Simplicity
■ The Bag of Words model is easy to understand and implement.
○ Sparse Representation
■ For many NLP tasks, a simple BoW model can work well and is computationally
efficient.
● Cons
○ Ignores Word Order
■ BoW doesn’t capture the sequence of words, so "cat sat on mat" and "mat sat on
cat" would be treated identically.
○ No Semantics
■ BoW doesn’t capture the meaning of words or their context.
■ Words with similar meanings or synonyms are treated as different.
○ High Dimensionality
■ As the vocabulary grows, the size of the document-term matrix increases, leading
to sparsity and higher computational costs.
31
TF-IDF
32
TF-IDF
33
TF-IDF
● The TF-IDF value for a term in a document is simply the product of the
term's TF and IDF:
34
TF-IDF
● Vocabulary
○ ["I", "love", "eat", "apples", "every", "day"]
● TF for Document 1
○ "I": 1/3, "love": 1/3, "apples": ⅓
● TF for Document 2
○ "I": 1/5, "eat": 1/5, "apples": 1/5, "every": 1/5, "day": 1/5
35
TF-IDF
● IDF
○ "I": log(2/2) = 0 (because it appears in both documents)
○ "love": log(2/1) = 0.301 (appears only in Document 1)
○ "eat": log(2/1) = 0.301
○ "apples": log(2/2) = 0
○ "every": log(2/1) = 0.301
○ "day": log(2/1) = 0.301
36
TF-IDF
● Pros
○ Balances word frequency and rarity
■ TF-IDF emphasizes words that are important within a document but discounts
common words that are frequent across many documents.
○ Easy to interpret
■ It is straightforward to understand and implement, making it a popular choice for
early text analysis.
○ Reduces the effect of stop words
■ Common stop words (e.g., "the", "is", "and") are given less weight because they
appear frequently across all documents.
○ Good for keyword extraction
■ TF-IDF is effective for identifying the most important terms in a document, which
can be useful in information retrieval and search engine algorithms.
38
TF-IDF
● Cons
○ Ignores word context:
■ TF-IDF treats words independently, so it doesn’t account for the order of words or
their context (e.g., "not good" would be treated the same as "good not").
○ Sparse and high-dimensional
■ For large corpora, the resulting vectors can become very large and sparse, which
can be inefficient to store and process.
○ No semantic understanding
■ TF-IDF does not capture any semantic relationships between words (e.g., "car"
and "automobile" are treated as unrelated). It only considers word frequency.
○ Doesn't handle synonyms
■ Since it only counts exact word matches, synonyms are treated as completely
different terms, which may not reflect their actual relationship.
39
N-grams
40
N-grams
○ Example sentence
■ "I love apples"
■ Unigrams (n=1): ["I", "love", "apples"]
■ Bigrams (n=2): ["I love", "love apples"]
■ Trigrams (n=3): ["I love apples"]
○ For multiple sentences, you would apply the same process to each sentence.
41
N-grams
○ Unigram vocabulary
■ ["I", "love", "apples", "oranges"]
○ Bigram vocabulary
■ ["I love", "love apples", "love oranges"]
42
N-grams
43
N-grams
44
N-grams
● Pros
○ Captures Local Word Order
■ N-grams preserve some of the order of words, especially compared to Bag of
Words (BoW), where word order is entirely ignored.
○ Improved Performance on Short Phrases
■ They are useful for capturing common phrases, idiomatic expressions, or short
word dependencies (e.g., "New York", "United States").
○ Simple to Implement
■ N-grams are relatively easy to compute and can be effective for tasks like text
classification or sentiment analysis
45
N-grams
● Cons
○ High Dimensionality
■ As n increases, the number of possible N-grams explodes, leading to very
high-dimensional and sparse feature vectors, which are hard to work with
computationally.
○ Ignores Long-Range Dependencies
■ N-grams only capture short-range dependencies. They fail to model longer context
relationships (e.g., "The movie was not good" vs. "The movie was good").
○ Overfitting
■ For small datasets, N-gram models can overfit easily because they learn specific
word sequences that may not generalize well.
46
Questions?
47