NLP Module 2
NLP Module 2
Lexical analysis
Lexical analysis is the first phase of a compiler or interpreter process, where the input source
code (usually in the form of text) is transformed into a sequence of tokens. These tokens
represent the meaningful elements or "lexical units" of the code, which can then be processed
by the syntax analyzer (parsing).
Here's an overview of lexical analysis:
Purpose
1. Tokenization: It converts the raw source code into tokens, which are the smallest
units of meaningful information. Tokens could represent keywords, operators,
identifiers, literals, and punctuation symbols.
2. Filtering Comments and Whitespace: Lexical analysis removes irrelevant
information such as comments and unnecessary whitespace that does not affect the
program's logic.
3. Error Detection: It can also detect certain lexical errors, such as invalid characters or
malformed strings.
Components of Lexical Analysis
1. Input Stream: The source code provided by the user.
2. Lexical Analyzer (Scanner): The program responsible for reading the input stream
and producing tokens.
3. Tokens: The result of lexical analysis, typically represented as a sequence of symbols
that represent the language's syntax rules.
Example
For the code:
python
Copy code
x = 42 + 3
The lexical analysis would break it into tokens:
x (identifier)
= (assignment operator)
42 (integer literal)
+ (addition operator)
3 (integer literal)
How Lexical Analysis Works
1. Reading the Input: The lexical analyzer reads the source code character by character.
2. Pattern Matching: It checks the characters against predefined patterns (regular
expressions) to identify valid tokens.
3. Tokenization: Once a pattern is matched, the lexical analyzer creates a token and
moves on to the next part of the code.
4. Output: The analyzer produces a sequence of tokens for further processing by the
parser.
Tools for Lexical Analysis
Lex: A tool used to generate lexical analyzers, commonly used in combination with
Yacc or Bison (for parsing).
Flex: A modern version of Lex, offering enhanced features.
ANTLR: Another popular tool for generating lexical analyzers and parsers.
Importance
Efficiency: By breaking the input into tokens, lexical analysis makes it easier for the
parser to process the code.
Error Handling: It helps catch simple errors (such as unrecognized symbols) early in
the process.
Code Optimization: It allows the compiler or interpreter to streamline the code into
manageable units.
Unsmoothed N grams
Unsmoothed N-grams refer to a sequence of N items (such as words or characters) from a
given text or speech. In the context of language modeling, unsmoothed N-grams are simply
the raw N-grams extracted from the text without applying any smoothing techniques to
handle unseen N-grams.
N-grams Overview
An N-gram is a contiguous sequence of N items (words, characters, etc.) from a given
sample of text. The basic types are:
Unigrams (1-grams): Single words or items.
Bigrams (2-grams): Sequences of two adjacent words or items.
Trigrams (3-grams): Sequences of three adjacent words or items.
And so on for higher-order N-grams.
For example, in the sentence:
css
Copy code
I am learning machine learning.
Unigrams (1-grams): I, am, learning, machine, learning
Bigrams (2-grams): I am, am learning, learning machine, machine learning
Trigrams (3-grams): I am learning, am learning machine, learning machine learning
Unsmoothed N-grams
An unsmoothed N-gram model counts the frequency of each N-gram in the text and
calculates the probability of a sequence of words based on those frequencies. This is the
maximum likelihood estimate (MLE) approach to estimating N-gram probabilities. The
probability of a word sequence in an unsmoothed N-gram model is calculated as:
P(w1,w2,…,wN)=P(w1)×P(w2∣w1)×P(w3∣w1,w2)×⋯×P(wN∣w1,w2,…,wN−1)P(w_1, w_2,
\dots, w_N) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1, w_2) \times \dots
\times P(w_N \mid w_1, w_2, \dots, w_{N-1})P(w1,w2,…,wN)=P(w1)×P(w2∣w1)×P(w3∣w1
,w2)×⋯×P(wN∣w1,w2,…,wN−1)
Where:
P(w1)P(w_1)P(w1) is the probability of the first word.
P(w2∣w1)P(w_2 \mid w_1)P(w2∣w1) is the probability of the second word given the
first word.
P(w3∣w1,w2)P(w_3 \mid w_1, w_2)P(w3∣w1,w2) is the probability of the third word
given the first two words, and so on.
This calculation is based purely on observed frequencies of the N-grams in the training data.
Example of Unsmoothed N-gram Model
Given the following text:
bash
Copy code
the cat sat on the mat
Unigrams: the, cat, sat, on, the, mat
Bigrams: the cat, cat sat, sat on, on the, the mat
Count the frequencies of the bigrams:
the cat: 1
cat sat: 1
sat on: 1
on the: 1
the mat: 1
The probability of seeing "cat" after "the" in an unsmoothed bigram model is calculated as:
P(cat∣the)=count(the cat)count(the)P(\text{cat} \mid \text{the}) =
\frac{\text{count}(\text{the
cat})}{\text{count}(\text{the})}P(cat∣the)=count(the)count(the cat)
Where:
count(the cat) = 1 (number of times "the cat" appears).
count(the) = 2 (number of times "the" appears).
So, the probability would be:
P(cat∣the)=12=0.5P(\text{cat} \mid \text{the}) = \frac{1}{2} = 0.5P(cat∣the)=21=0.5
Issues with Unsmoothed N-grams
Unsmoothed N-grams have a few limitations:
1. Zero Probability for Unseen N-grams: If an N-gram does not appear in the training
data, its probability will be zero, making it impossible to calculate the likelihood of
any unseen sequence.
For example, if "cat runs" does not appear in the training data, then
P(runs∣cat)=0P(\text{runs} \mid \text{cat}) = 0P(runs∣cat)=0.
2. Data Sparsity: As N increases, the number of possible N-grams grows exponentially.
For large N, most N-grams will not be seen in the training data, making the model
inaccurate or unusable without smoothing.
Smoothing Techniques
To address these issues, smoothing techniques are applied to N-gram models. These
techniques adjust the probabilities to ensure that every possible N-gram (including unseen
ones) has a non-zero probability. Some popular smoothing methods include:
Laplace Smoothing: Adds a small constant (usually 1) to all N-gram counts to avoid
zero probabilities.
Kneser-Ney Smoothing: More sophisticated and generally more effective, especially
for language modeling tasks.
In summary, unsmoothed N-grams are basic statistical models based on raw counts of N-
grams, but they can suffer from issues like zero probabilities for unseen N-grams. Smoothing
techniques are used to improve their performance in practical applications.
evaluating N grams
In the context of natural language processing (NLP) and text analysis, N-grams refer to
contiguous sequences of NNN items (typically words or characters) from a given sample of
text. They are used for various tasks such as language modeling, text prediction, and feature
extraction.
Types of N-grams:
Unigrams (1-grams): Single words or tokens.
Bigrams (2-grams): Pairs of consecutive words.
Trigrams (3-grams): Triplets of consecutive words.
And so on for higher NNN-grams.
For example, given the sentence "I love programming," the following N-grams can be
extracted:
Unigrams: ["I", "love", "programming"]
Bigrams: ["I love", "love programming"]
Trigrams: ["I love programming"]
Evaluation of N-grams:
The evaluation of N-grams typically involves:
1. Frequency Count: Counting how often each N-gram appears in a given text or
corpus.
2. Contextual Analysis: Determining how the occurrence of N-grams contributes to text
understanding, sentiment analysis, or other NLP tasks.
3. Model Training: Using N-grams in models like n-gram language models, where they
help in predicting the next word or identifying word dependencies.
Key Points to Consider:
Sparsity: As N increases, the number of possible N-grams increases exponentially,
which can lead to sparsity, meaning many N-grams may not appear in the training
data.
Memory Requirements: Storing and processing higher-order N-grams can be
memory-intensive.
Smoothness: Smoothing techniques (like Laplace smoothing) are often used to
handle cases where certain N-grams don’t appear in the training data but are likely to
occur in real-world text.
1. Transformation-Based Models
Overview:
Transformation-Based Models (TBMs), also known as Brill Tagging (named after the
linguist Eric Brill who developed it), are a rule-based machine learning approach where
rules are learned from training data to transform an initial set of labels (often based on simple
heuristics) into more accurate labels.
These models are based on iteratively improving the labeling system by applying rules that
"transform" the incorrect tags into correct ones. TBMs combine aspects of rule-based
systems with machine learning, making them unique and powerful.
How Transformation-Based Models Work:
1. Initial Labeling:
o The process begins by assigning an initial label to each word in the text (this
could be done using a simpler method, such as using a dictionary or a POS
tagger).
o For example, a first-pass labeling might assign tags based on simple heuristics
or by using a basic POS tagger.
2. Rule Learning:
o The model then learns rules that can correct the initial labeling. A rule
typically has the form: if (Condition) then (Action)\text{if} \
(\text{Condition}) \ \text{then} \ (\text{Action})if (Condition) then (Action)
where the Condition is based on the word or context (e.g., a specific word or
neighboring tags), and the Action is a transformation (e.g., changing a tag
from NN to VB).
o These rules are learned by looking at the disagreements between the initial
labeling and the correct labeling in the training data.
3. Rule Application:
o The rules are applied iteratively to improve the labeling. Each rule is applied
to the data to correct the tag sequence, and the model continues to refine its
rules to better align with the target labels.
4. Iterative Process:
o The process of learning and applying rules is iterative. After each round of
rule application, the system checks the errors, generates new rules, and applies
them again.
o The goal is to minimize error by refining the set of transformation rules.
Advantages of TBMs:
Flexibility: TBMs combine rule-based methods with machine learning, allowing them
to handle ambiguous or rare cases better than purely statistical methods.
Interpretability: The learned transformation rules are easy to understand, making the
model more interpretable.
Efficiency: TBMs can be computationally efficient, especially when the number of
rules to be learned is small.
Example:
Let’s consider a simple sentence like:
"I saw the dog". Initially, using a basic tagger, you might label it as:
"I/PRP saw/VBD the/DT dog/NN".
If the model learns the rule:
"If the previous word is 'saw' and the current word is 'NN', change the tag to 'VB'."
It would then apply this rule to transform "dog/NN" into "dog/VB".
Use Cases:
POS Tagging: One of the most common applications of TBMs is in part-of-speech
tagging, where a sequence of words is assigned correct POS tags.
Named Entity Recognition (NER): TBMs can also be used to tag named entities
(e.g., Person, Location, Organization) based on context.
Transformation-Based
Aspect Maximum Entropy Models
Models
Less flexible, heavily depends Highly flexible, can use any feature
Flexibility
on rule construction set
Training Data Needs a labeled training corpus Needs a labeled corpus but can
Requirement with correct tags incorporate many types of features
Conclusion:
Transformation-Based Models (e.g., Brill Tagging) combine rule-based learning
with machine learning, applying transformations iteratively to improve initial
predictions.
Maximum Entropy Models use a probabilistic framework to model the distribution
of labels based on observed features, ensuring that the distribution is as
"uninformative" as possible subject to the constraints provided by the data.
Both models have their strengths and are applied widely in tasks like POS tagging, Named
Entity Recognition, and other sequence labeling tasks in NLP.