0% found this document useful (0 votes)
2 views

CS-875-Lecture 4

The document outlines the fourth lecture of a Natural Language Processing course, focusing on general conduct, sentiment analysis, text representation techniques, and preprocessing methods such as stop words removal and stemming. It discusses various text representation methods including One-Hot Encoding, Bag of Words, and TF-IDF, explaining their advantages and disadvantages. The lecture also covers supervised machine learning concepts, logistic regression, and model evaluation metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CS-875-Lecture 4

The document outlines the fourth lecture of a Natural Language Processing course, focusing on general conduct, sentiment analysis, text representation techniques, and preprocessing methods such as stop words removal and stemming. It discusses various text representation methods including One-Hot Encoding, Bag of Words, and TF-IDF, explaining their advantages and disadvantages. The lecture also covers supervised machine learning concepts, logistic regression, and model evaluation metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

CS-875: Natural Language Processing

Lecture 4
Instructor: Dr. Hashir Kiani
General Conduct
• Be respectful of others

• Only speak at your turn and preferably raise your hand if you want
to say something

• Do not cut off others when they are talking

• Join the class on time

2
Lecture Outline

• Recap of Previous Lecture

• Sentiment Analysis with Logistic Regression

• Text Representation Techniques

3
Recap of Previous Lecture
• Explain the pipeline for supervised machine learning?
• What is sentiment analysis?
• Why is it important?
• What is the sentiment analysis pipeline if we are using supervised
machine learning?
• What is vocabulary?
• How do we extract features from text using vectors equal to the
size of vocabulary?
• What are the problems with such a representation?
• How do we use positive and negative counts to extract features?
4
Preprocessing: Stop Words and Punctuation
• Stop words in a language are those words that add little or no
significant value to the meaning or context of the text.
• Examples include “the”, “is”, ”in”, “to” etc.
• Stop words occur very frequently in text.
• For example, “I went to the market”. Here to and the only add
grammatical structure and context is still clear without these
words.
• Therefore, stop words and punctuations are removed as
preprocessing step.

5
Preprocessing: Stop Words and Punctuation
Stop Words
Punctuation import nltk
and
, from nltk.corpus import stopwords
is
: from nltk.tokenize import word_tokenize
a
!
at import string
;
has

of stop_w=set(stopwords.words('english’))

the tokens = word_tokenize(text)
in
to filtered_tokens = [word for word in tokens if word.lower()
not in stop_w and word not in string.punctuation]

6
Preprocessing: Stemming
• Stemming is used to reduce words to their base or root form.
• It is a rule based approach that involves stripping common prefixes
or suffixes.
• For example: running changes to run and happiness changes to
happi
• Stemming does not consider the grammatical rules or linguistic
structure of the language.
• It reduces vocabulary size and simplifies text analysis.

7
Preprocessing: Stemming Code

from nltk.stem import PorterStemmer

porter = PorterStemmer()

tokens = word_tokenize(text)

porter_stems = [porter.stem(token) for token in tokens]

8
Text to Feature Vectors
I am happy because I am learning NLP

Preprocessing

[happy,learn,NLP]

Feature Extraction

[1,10,5]

9
Final Feature/Input Matrix

(1) (1)
1 𝑥1 𝑥2
(2) (2)
1 𝑥1 𝑥2
⋮ ⋮ ⋮
1 (𝑚) (𝑚)
𝑥1 𝑥2

10
Supervised Machine Learning (Training)

𝜃
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠

𝑂𝑢𝑡𝑝𝑢𝑡
(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑉𝑎𝑙𝑢𝑒𝑠)
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑌෠ 𝐶𝑜𝑠𝑡
𝑋 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛

𝐿𝑎𝑏𝑒𝑙𝑠
𝑌

11
Logistic Regression – A Linear Classifier
• Predictor Function: ℎ 𝒙 = 𝑔(𝒘𝑻 𝜙 𝒙 )

1
Where g 𝑧 = (Sigmoid/Logistic Function)
1+𝑒 −𝑧

• Output interpreted as probability


of positive label.
• 𝑔 𝑧 > 0.5 means positive label
• 𝑔 𝑧 ≤ 0.5 means negative label

12
Logistic Regression – Loss Function
• Logistic loss also known as Binary Cross Entropy Loss:

𝐿𝑜𝑠𝑠 𝑦, ℎ 𝒙 = −𝑦𝑙𝑜𝑔 ℎ 𝒙 − 1 − 𝑦 log(1 − ℎ 𝒙 )

• Training loss in this case:

𝑁
1
𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 𝒘 = − ෍[𝑦𝑖 𝑙𝑜𝑔 ℎ 𝒙𝒊 + 1 − 𝑦𝑖 log(1 − ℎ 𝒙𝒊 )]
𝑁
𝑖=1

13
Logistic Regression – Optimization Algorithm
• Gradient Descent:

𝒘𝑛𝑒𝑤 = 𝒘𝑜𝑙𝑑 − 𝛼∇𝒘 𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 𝒘

𝑁
1 1 𝑇
∇𝒘 𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 𝒘 = ෍[(ℎ 𝒙𝒊 − 𝑦𝑖 )𝒙𝒊 ] = 𝑋 (𝒉 − 𝒚)
𝑁 𝑁
𝑖=1

𝛼 𝑇
𝒘𝑛𝑒𝑤 = 𝒘𝑜𝑙𝑑 − 𝑋 (𝒉 − 𝒚)
𝑁

14
Training, Validation and Testing

• Dataset is usually split in to training set, validation set and testing


set
• Training set is used to train your model and estimate its
parameters.
• Validation set is used to validate the performance of your model
and tune the hyperparameters.
• Testing set is used to check the accuracy of your final model.
• We need our model to perform well on unseen data.

15
Classification Model Performance Evaluation
• Accuracy: Proportion of correct predictions.
• Confusion Matrix: Table summarizing TP, FP, FN and TN.

• Precision: Proportion of true positives among predicted positives.


• Recall: Proportion of true positives among actual positives.
• F1 Score: Harmonic mean of precision and recall.
16
Code for Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

17
Text Representations
Text Representations

● Text representation refers to the process of


converting text data into a numerical format
that can be used by machine learning models
and algorithms.

● Since most machine learning models require


numerical input, text, which is naturally unstructured
○ needs to be transformed into structured numerical data.

19
Text Representations

● Simplifying Language for Computation


○ By converting words into numbers, we create a
form that allows algorithms to analyze patterns
and relationships within the text.

● Handling Large Text Data


○ Text representation makes it possible to process
and analyze large corpora (sets of text) more
efficiently by compressing it into a more manageable
numerical form.

20
Why is Text Representation Important?

● ML/DL models operate on numbers, not words.


○ In order to analyze and derive insights from textual data, we need to convert
words, sentences, and documents into a numerical form.

● It helps to capture the important features of the text


○ such as word occurrence, frequency, or relationships between words
○ making it possible to apply algorithms for classification, clustering, sentiment
analysis, summarization, etc.

21
Common Text Representation Methods

● One-Hot Encoding

● Bag of Words

● TF-IDF

● N-grams

● Word Embeddings

22
One-Hot Encoding (OHE)

● One-Hot Encoding is a simple and widely-used technique to represent


words or tokens as binary vectors.

● In One-Hot Encoding, each unique


word in the corpus is represented
by a vector where all elements are
0, except for a single element that
corresponds to the word's position
in the vocabulary.
○ This specific position is set to 1.

23
One-Hot Encoding (OHE)

● Consider the sentences:


○ "I love apples"
○ "I eat apples"

● The vocabulary consists of these unique words:


○ ["I", "love", "eat", "apples"]

● The one-hot vectors for each word would be:


○ "I" = [1, 0, 0, 0]
○ "love" = [0, 1, 0, 0]
○ "eat" = [0, 0, 1, 0]
○ "apples" = [0, 0, 0, 1]
○ "I love apples" = [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 0, 1]]

24
One-Hot Encoding (OHE)

● Each word is represented as a vector of 0s and 1s.

● One-hot encoding does not capture word frequency


○ it only captures the presence or absence of a word.

● One-hot encoding is often used in simpler models or early stages of text


processing where simplicity is more important than capturing semantics.

● It is generally not ideal for larger or more complex tasks because it does
not capture word frequencies, context, or semantic relationships between
words.

25
One-Hot Encoding (OHE)

● Pros
○ Easy to implement and interpret.
○ Simple, which can be useful for small tasks.

● Cons
○ High dimensionality
■ As the vocabulary grows, the vectors become extremely large and sparse, which
can be computationally inefficient.
○ No semantic meaning
■ Words are treated independently, and the encoding does not capture relationships
between words.
■ For example, "king" and "queen" are just as unrelated as "king" and "apple" in
one-hot encoding.

26
Bag of Words (BoW): Conceptual Model

● The Bag of Words (BoW) model is one of the simplest ways to represent
text data numerically.

● The idea is to create a "bag" or


collection of all the words present in a
set of documents without caring about
the order or grammar.

● Each document is represented as a


vector of word frequencies, which counts the occurrence of each word in
the document.

27
Bag of Words (BoW): Conceptual Model

● Tokenization
○ Break down each document into individual words (tokens).

● Vocabulary Construction
○ Create a list of all unique words (vocabulary) across the entire document set.

● Document-Term Matrix
○ Represent each document as a vector of word counts from the vocabulary.
○ The vector length is equal to the number of unique words in the corpus.

28
Bag of Words (BoW): Conceptual Model

● Let’s consider three simple sentences as an example:


○ "The cat sat on the mat."
○ "The dog lay on the mat."
○ "The cat lay down."

● The unique vocabulary across these three sentences is:


○ [‘The’, ‘cat’, ‘sat’, ‘on’, ‘mat’, ‘dog’, ‘lay’, ‘down’]

29
Bag of Words (BoW): Conceptual Model

● Now, each sentence is represented as a vector based on the word


frequency:

● Document Term Matrix (DTM)


○ Each row in this matrix corresponds to a document (sentence),
○ Each column corresponds to a term (word) .
○ The values are the counts of the words in each sentence.

30
Bag of Words (BoW): Conceptual Model

● Pros
○ Simplicity
■ The Bag of Words model is easy to understand and implement.
○ Sparse Representation
■ For many NLP tasks, a simple BoW model can work well and is computationally
efficient.
● Cons
○ Ignores Word Order
■ BoW doesn’t capture the sequence of words, so "cat sat on mat" and "mat sat on
cat" would be treated identically.
○ No Semantics
■ BoW doesn’t capture the meaning of words or their context.
■ Words with similar meanings or synonyms are treated as different.
○ High Dimensionality
■ As the vocabulary grows, the size of the document-term matrix increases, leading
to sparsity and higher computational costs.
31
TF-IDF

● Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical


measure used to evaluate
○ how important a word is to a document in a
collection or corpus.

● Unlike simple BoW, which only counts word


occurrences, TF-IDF adjusts word counts by
their importance across documents,
○ helping to reduce the influence of commonly used words (like "the", "is", etc.)
that may not carry much meaning in relation to the content.

32
TF-IDF

● TF-IDF is the product of two metrics:


○ TF (Term Frequency): Measures how frequently a word appears in a
document.

○ IDF (Inverse Document Frequency): Measures how unique or rare a word


is across all documents in the corpus.

33
TF-IDF

● The more common a word is across documents,


○ the lower its IDF value, meaning common words get discounted.
○ Rare words have a high IDF, highlighting their importance.

● The TF-IDF value for a term in a document is simply the product of the
term's TF and IDF:

34
TF-IDF

● Consider the following corpus of two documents:


○ Document 1: "I love apples"
○ Document 2: "I eat apples every day"

● Vocabulary
○ ["I", "love", "eat", "apples", "every", "day"]

● TF for Document 1
○ "I": 1/3, "love": 1/3, "apples": ⅓

● TF for Document 2
○ "I": 1/5, "eat": 1/5, "apples": 1/5, "every": 1/5, "day": 1/5

35
TF-IDF

● IDF
○ "I": log(2/2) = 0 (because it appears in both documents)
○ "love": log(2/1) = 0.301 (appears only in Document 1)
○ "eat": log(2/1) = 0.301
○ "apples": log(2/2) = 0
○ "every": log(2/1) = 0.301
○ "day": log(2/1) = 0.301

36
TF-IDF

● TF-IDF values for Document 1


○ "I": 1/3 × 0 = 0
○ "love": 1/3 × 0.301 = 0.100
○ "apples": 1/3 × 0 = 0

● TF-IDF values for Document 2


○ "I": 1/5 × 0 = 0
○ "eat": 1/5 × 0.301 = 0.060
○ "apples": 1/5 × 0 = 0
○ "every": 1/5 × 0.301 = 0.060
○ "day": 1/5 × 0.301 = 0.060
● TF-IDF emphasizes "love" in Document 1, and "eat," "every," and "day" in
Document 2, while discounting common words like "I" and "apples."
37
TF-IDF

● Pros
○ Balances word frequency and rarity
■ TF-IDF emphasizes words that are important within a document but discounts
common words that are frequent across many documents.
○ Easy to interpret
■ It is straightforward to understand and implement, making it a popular choice for
early text analysis.
○ Reduces the effect of stop words
■ Common stop words (e.g., "the", "is", "and") are given less weight because they
appear frequently across all documents.
○ Good for keyword extraction
■ TF-IDF is effective for identifying the most important terms in a document, which
can be useful in information retrieval and search engine algorithms.

38
TF-IDF
● Cons
○ Ignores word context:
■ TF-IDF treats words independently, so it doesn’t account for the order of words or
their context (e.g., "not good" would be treated the same as "good not").
○ Sparse and high-dimensional
■ For large corpora, the resulting vectors can become very large and sparse, which
can be inefficient to store and process.
○ No semantic understanding
■ TF-IDF does not capture any semantic relationships between words (e.g., "car"
and "automobile" are treated as unrelated). It only considers word frequency.
○ Doesn't handle synonyms
■ Since it only counts exact word matches, synonyms are treated as completely
different terms, which may not reflect their actual relationship.

39
N-grams

● N-grams are a sequence of n consecutive


words (or characters) from a given text.

● They are used to capture the local word


context and the order of words.

● N-grams break the text into groups of n


words, allowing algorithms to consider
neighboring words when analyzing text.
○ Unigrams: Single words (n=1).
○ Bigrams: Two consecutive words (n=2).
○ Trigrams: Three consecutive words (n=3), and so on.

40
N-grams

● Text Tokenization into N-Grams


○ First, we split the text into N-grams (e.g., unigrams, bigrams, trigrams).

○ Example sentence
■ "I love apples"
■ Unigrams (n=1): ["I", "love", "apples"]
■ Bigrams (n=2): ["I love", "love apples"]
■ Trigrams (n=3): ["I love apples"]

○ For multiple sentences, you would apply the same process to each sentence.

41
N-grams

● Create a Vocabulary of N-Grams


○ Create a vocabulary of unique N-grams from the entire corpus (set of
documents).
○ Each unique N-gram is assigned an index in the vocabulary.

○ For example, if we have these sentences:


■ Sentence 1: "I love apples"
■ Sentence 2: "I love oranges”

○ Unigram vocabulary
■ ["I", "love", "apples", "oranges"]
○ Bigram vocabulary
■ ["I love", "love apples", "love oranges"]

42
N-grams

● Each N-gram is given an index:


○ "I love" → index 0
○ "love apples" → index 1
○ "love oranges" → index 2
● Convert N-Grams into a Feature Vector
○ Once the N-grams are tokenized and you have a vocabulary, the next step is
to represent the text as a feature vector.
○ This is done using a document-term matrix, where:
■ Rows represent documents (or sentences).
■ Columns represent the unique N-grams from the vocabulary.
■ For each sentence, the value in the matrix can be:
● Frequency: The number of times the N-gram appears in the sentence.
● Binary Indicator (Presence/Absence): Whether the N-gram is present (1) or not (0).

43
N-grams

● Let’s create a document-term matrix where each cell contains the


frequency of the corresponding bigram:

● Each sentence has been converted into a vector of numbers


○ "I love apples" → [1, 1, 0]
○ "I love oranges" → [1, 0, 1]

44
N-grams

● Pros
○ Captures Local Word Order
■ N-grams preserve some of the order of words, especially compared to Bag of
Words (BoW), where word order is entirely ignored.
○ Improved Performance on Short Phrases
■ They are useful for capturing common phrases, idiomatic expressions, or short
word dependencies (e.g., "New York", "United States").
○ Simple to Implement
■ N-grams are relatively easy to compute and can be effective for tasks like text
classification or sentiment analysis

45
N-grams

● Cons
○ High Dimensionality
■ As n increases, the number of possible N-grams explodes, leading to very
high-dimensional and sparse feature vectors, which are hard to work with
computationally.
○ Ignores Long-Range Dependencies
■ N-grams only capture short-range dependencies. They fail to model longer context
relationships (e.g., "The movie was not good" vs. "The movie was good").
○ Overfitting
■ For small datasets, N-gram models can overfit easily because they learn specific
word sequences that may not generalize well.

46
Questions?
47

You might also like