0% found this document useful (0 votes)
8 views18 pages

NLP Module 2

Uploaded by

zishanansari2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

NLP Module 2

Uploaded by

zishanansari2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Module 2

Lexical analysis
Lexical analysis is the first phase of a compiler or interpreter process, where the input source
code (usually in the form of text) is transformed into a sequence of tokens. These tokens
represent the meaningful elements or "lexical units" of the code, which can then be processed
by the syntax analyzer (parsing).
Here's an overview of lexical analysis:
Purpose
1. Tokenization: It converts the raw source code into tokens, which are the smallest
units of meaningful information. Tokens could represent keywords, operators,
identifiers, literals, and punctuation symbols.
2. Filtering Comments and Whitespace: Lexical analysis removes irrelevant
information such as comments and unnecessary whitespace that does not affect the
program's logic.
3. Error Detection: It can also detect certain lexical errors, such as invalid characters or
malformed strings.
Components of Lexical Analysis
1. Input Stream: The source code provided by the user.
2. Lexical Analyzer (Scanner): The program responsible for reading the input stream
and producing tokens.
3. Tokens: The result of lexical analysis, typically represented as a sequence of symbols
that represent the language's syntax rules.
Example
For the code:
python
Copy code
x = 42 + 3
The lexical analysis would break it into tokens:
 x (identifier)
 = (assignment operator)
 42 (integer literal)
 + (addition operator)
 3 (integer literal)
How Lexical Analysis Works
1. Reading the Input: The lexical analyzer reads the source code character by character.
2. Pattern Matching: It checks the characters against predefined patterns (regular
expressions) to identify valid tokens.
3. Tokenization: Once a pattern is matched, the lexical analyzer creates a token and
moves on to the next part of the code.
4. Output: The analyzer produces a sequence of tokens for further processing by the
parser.
Tools for Lexical Analysis
 Lex: A tool used to generate lexical analyzers, commonly used in combination with
Yacc or Bison (for parsing).
 Flex: A modern version of Lex, offering enhanced features.
 ANTLR: Another popular tool for generating lexical analyzers and parsers.
Importance
 Efficiency: By breaking the input into tokens, lexical analysis makes it easier for the
parser to process the code.
 Error Handling: It helps catch simple errors (such as unrecognized symbols) early in
the process.
 Code Optimization: It allows the compiler or interpreter to streamline the code into
manageable units.

Unsmoothed N grams
Unsmoothed N-grams refer to a sequence of N items (such as words or characters) from a
given text or speech. In the context of language modeling, unsmoothed N-grams are simply
the raw N-grams extracted from the text without applying any smoothing techniques to
handle unseen N-grams.
N-grams Overview
An N-gram is a contiguous sequence of N items (words, characters, etc.) from a given
sample of text. The basic types are:
 Unigrams (1-grams): Single words or items.
 Bigrams (2-grams): Sequences of two adjacent words or items.
 Trigrams (3-grams): Sequences of three adjacent words or items.
 And so on for higher-order N-grams.
For example, in the sentence:
css
Copy code
I am learning machine learning.
 Unigrams (1-grams): I, am, learning, machine, learning
 Bigrams (2-grams): I am, am learning, learning machine, machine learning
 Trigrams (3-grams): I am learning, am learning machine, learning machine learning
Unsmoothed N-grams
An unsmoothed N-gram model counts the frequency of each N-gram in the text and
calculates the probability of a sequence of words based on those frequencies. This is the
maximum likelihood estimate (MLE) approach to estimating N-gram probabilities. The
probability of a word sequence in an unsmoothed N-gram model is calculated as:
P(w1,w2,…,wN)=P(w1)×P(w2∣w1)×P(w3∣w1,w2)×⋯×P(wN∣w1,w2,…,wN−1)P(w_1, w_2,
\dots, w_N) = P(w_1) \times P(w_2 \mid w_1) \times P(w_3 \mid w_1, w_2) \times \dots
\times P(w_N \mid w_1, w_2, \dots, w_{N-1})P(w1,w2,…,wN)=P(w1)×P(w2∣w1)×P(w3∣w1
,w2)×⋯×P(wN∣w1,w2,…,wN−1)
Where:
 P(w1)P(w_1)P(w1) is the probability of the first word.
 P(w2∣w1)P(w_2 \mid w_1)P(w2∣w1) is the probability of the second word given the
first word.
 P(w3∣w1,w2)P(w_3 \mid w_1, w_2)P(w3∣w1,w2) is the probability of the third word
given the first two words, and so on.
This calculation is based purely on observed frequencies of the N-grams in the training data.
Example of Unsmoothed N-gram Model
Given the following text:
bash
Copy code
the cat sat on the mat
 Unigrams: the, cat, sat, on, the, mat
 Bigrams: the cat, cat sat, sat on, on the, the mat
Count the frequencies of the bigrams:
 the cat: 1
 cat sat: 1
 sat on: 1
 on the: 1
 the mat: 1
The probability of seeing "cat" after "the" in an unsmoothed bigram model is calculated as:
P(cat∣the)=count(the cat)count(the)P(\text{cat} \mid \text{the}) =
\frac{\text{count}(\text{the
cat})}{\text{count}(\text{the})}P(cat∣the)=count(the)count(the cat)
Where:
 count(the cat) = 1 (number of times "the cat" appears).
 count(the) = 2 (number of times "the" appears).
So, the probability would be:
P(cat∣the)=12=0.5P(\text{cat} \mid \text{the}) = \frac{1}{2} = 0.5P(cat∣the)=21=0.5
Issues with Unsmoothed N-grams
Unsmoothed N-grams have a few limitations:
1. Zero Probability for Unseen N-grams: If an N-gram does not appear in the training
data, its probability will be zero, making it impossible to calculate the likelihood of
any unseen sequence.
For example, if "cat runs" does not appear in the training data, then
P(runs∣cat)=0P(\text{runs} \mid \text{cat}) = 0P(runs∣cat)=0.
2. Data Sparsity: As N increases, the number of possible N-grams grows exponentially.
For large N, most N-grams will not be seen in the training data, making the model
inaccurate or unusable without smoothing.
Smoothing Techniques
To address these issues, smoothing techniques are applied to N-gram models. These
techniques adjust the probabilities to ensure that every possible N-gram (including unseen
ones) has a non-zero probability. Some popular smoothing methods include:
 Laplace Smoothing: Adds a small constant (usually 1) to all N-gram counts to avoid
zero probabilities.
 Kneser-Ney Smoothing: More sophisticated and generally more effective, especially
for language modeling tasks.
In summary, unsmoothed N-grams are basic statistical models based on raw counts of N-
grams, but they can suffer from issues like zero probabilities for unseen N-grams. Smoothing
techniques are used to improve their performance in practical applications.

evaluating N grams
In the context of natural language processing (NLP) and text analysis, N-grams refer to
contiguous sequences of NNN items (typically words or characters) from a given sample of
text. They are used for various tasks such as language modeling, text prediction, and feature
extraction.
Types of N-grams:
 Unigrams (1-grams): Single words or tokens.
 Bigrams (2-grams): Pairs of consecutive words.
 Trigrams (3-grams): Triplets of consecutive words.
 And so on for higher NNN-grams.
For example, given the sentence "I love programming," the following N-grams can be
extracted:
 Unigrams: ["I", "love", "programming"]
 Bigrams: ["I love", "love programming"]
 Trigrams: ["I love programming"]
Evaluation of N-grams:
The evaluation of N-grams typically involves:
1. Frequency Count: Counting how often each N-gram appears in a given text or
corpus.
2. Contextual Analysis: Determining how the occurrence of N-grams contributes to text
understanding, sentiment analysis, or other NLP tasks.
3. Model Training: Using N-grams in models like n-gram language models, where they
help in predicting the next word or identifying word dependencies.
Key Points to Consider:
 Sparsity: As N increases, the number of possible N-grams increases exponentially,
which can lead to sparsity, meaning many N-grams may not appear in the training
data.
 Memory Requirements: Storing and processing higher-order N-grams can be
memory-intensive.
 Smoothness: Smoothing techniques (like Laplace smoothing) are often used to
handle cases where certain N-grams don’t appear in the training data but are likely to
occur in real-world text.

Morphology and Finite state Transducers


Morphology and Finite State Transducers
Morphology is the branch of linguistics concerned with the structure and construction of
words. It involves the study of how words are formed from smaller units known as
morphemes (the smallest units of meaning), and how these morphemes combine to create
complex word forms.
Finite State Transducers (FSTs) are computational models used to represent and process the
morphology of natural languages. They are a type of finite state automaton (FSA), which
are formal mathematical models that represent sequences of states and transitions between
them. FSTs are a powerful tool for modeling morphological processes, including word
formation, inflection, derivation, and compounding.
How FSTs Work
An FST consists of two components:
1. States: These are the points in the computation that represent different stages of the
word-forming process.
2. Transitions: These are the rules that govern how the machine moves from one state to
another. Transitions are labeled with pairs of symbols (or labels), with each pair
consisting of an input symbol and an output symbol. The FST is used to map one form
of a word to another.
For example, an FST could take a verb like "run" and produce its past tense form "ran" by
following a sequence of transitions that involve the removal of the infinitive suffix and the
addition of the past tense morpheme.
Types of Morphological Processes Handled by FSTs
FSTs are particularly suited for handling regular morphological processes. Some common
tasks include:
1. Inflection: Changing the form of a word to express grammatical information such as
tense, number, or case. For example:
o "cat" → "cats" (pluralization)
o "run" → "ran" (past tense)
2. Derivation: Creating a new word by adding a prefix or suffix to a base word, which
may change the word's part of speech or meaning. For example:
o "happy" → "unhappy" (negative derivation)
o "teach" → "teacher" (noun derivation)
3. Compounding: Combining two words to form a new word. For example:
o "tooth" + "paste" → "toothpaste"
4. Reduplication: A process where part of a word is repeated for emphasis or to create a
grammatical effect.
Example of a Simple FST for English Plural Formation
Consider an FST that handles the transformation from singular to plural forms in English. For
example:
 "cat" → "cats" (regular plural formation)
 "box" → "boxes" (irregular plural formation)
For a simple FST, the process could involve:
1. Starting in an initial state.
2. Checking the final character of the word (e.g., if it is "x", apply the plural rule
"boxes").
3. If it's not an irregular case, add "s" to make the word plural.
FSTs for Morphological Analysis and Generation
 Analysis: Given a word form, FSTs can analyze it to decompose it into its morphemic
components. For instance, the word "unhappiness" can be analyzed as the root
"happy", the prefix "un-", and the suffix "-ness".
 Generation: FSTs can generate forms of a word by applying rules to create new word
forms. For example, given the verb "play", the FST can generate forms like "played"
(past tense) or "playing" (present participle).
Applications of FSTs
FSTs are widely used in natural language processing (NLP) tasks that involve morphological
analysis, such as:
 Spell-checking
 Machine translation
 Information retrieval
 Speech recognition and synthesis
FSTs are especially valuable for languages with regular morphological patterns (e.g.,
languages with predictable verb conjugations or noun declensions).
Advantages of FSTs
 Efficiency: FSTs can handle large lexicons and complex morphological rules while
maintaining computational efficiency.
 Simplicity: They provide a simple, elegant way of modeling complex linguistic
processes.
 Determinism: FSTs can be made deterministic (DFA), meaning there is only one
possible state transition for each input symbol, leading to efficient processing.
Conclusion
Finite State Transducers (FSTs) are a powerful tool in computational linguistics for modeling
and processing morphological systems in languages. They provide an effective way to
analyze and generate word forms based on their morphological structure, and they are widely
used in various NLP applications. Their ability to model regular morphological processes
makes them a popular choice for both linguistic research and practical language processing
tasks.

Interpolation and Back off – word classes


Interpolation and Backoff in Natural Language Processing
Interpolation and backoff are two strategies used in smoothing techniques to handle unseen or
low-frequency events in language models, particularly in the context of n-gram models.
1. Interpolation
Interpolation involves combining different estimates of probability to smooth the model. This
technique blends probabilities from different sources, such as higher-order and lower-order n-
grams, to make more accurate predictions, especially for unseen or rare events. The goal is to
use a mixture of probabilities from different models to account for cases where the higher-
order n-gram model (e.g., trigram or bigram) may be too sparse.
Example of interpolation:
Suppose you're working with a bigram model and a trigram model:
 Bigram model: Calculates the probability of a word given the previous word.
 Trigram model: Calculates the probability of a word given the previous two words.
The probability of a word sequence can be interpolated by combining the probabilities from
both the bigram and trigram models:
P(wn∣wn−1)=λ1Pbigram(wn∣wn−1)+λ2Ptrigram(wn∣wn−2,wn−1)P(w_n | w_{n-1}) =
\lambda_1 P_{\text{bigram}}(w_n | w_{n-1}) + \lambda_2 P_{\text{trigram}}(w_n | w_{n-
2}, w_{n-1})P(wn∣wn−1)=λ1Pbigram(wn∣wn−1)+λ2Ptrigram(wn∣wn−2,wn−1)
Here, λ1\lambda_1λ1 and λ2\lambda_2λ2 are the interpolation weights, where
λ1+λ2=1\lambda_1 + \lambda_2 = 1λ1+λ2=1.
2. Backoff
Backoff is another smoothing technique where, if a higher-order n-gram (e.g., trigram) is not
found in the training data, the model "backs off" to a lower-order n-gram (e.g., bigram or
unigram). Essentially, if the trigram probability is zero (or very low), the model will use the
bigram model, and if the bigram is also missing, the model will revert to the unigram model.
Example of backoff:
In a trigram model:
 If P(wn∣wn−2,wn−1)=0P(w_n | w_{n-2}, w_{n-1}) = 0P(wn∣wn−2,wn−1)=0, the
model backs off to the bigram: P(wn∣wn−1)P(w_n | w_{n-1})P(wn∣wn−1)
 If the bigram probability is also zero, it will then use the unigram model:
P(wn)P(w_n)P(wn)
In practice, backoff methods assign probabilities to sequences that are not observed in the
training data by "discounting" the probability mass of observed sequences and redistributing
it to unseen sequences.
3. Combining Interpolation and Backoff
Many modern language models use a combination of both interpolation and backoff. For
example, Kneser-Ney smoothing, a widely-used technique in n-gram models, combines the
principles of interpolation and backoff, providing more robust smoothing.
Word Classes (Part of Speech)
In natural language processing, word classes, or parts of speech (POS), refer to the
categories that words can be classified into based on their function in a sentence. Common
word classes include:
1. Nouns: Names of people, places, things, or concepts (e.g., dog, city, happiness).
2. Verbs: Words that describe actions, occurrences, or states of being (e.g., run, is,
seem).
3. Adjectives: Words that describe or modify nouns (e.g., happy, blue, large).
4. Adverbs: Words that modify verbs, adjectives, or other adverbs (e.g., quickly, very,
too).
5. Pronouns: Words that replace nouns (e.g., he, she, they).
6. Prepositions: Words that show relationships between nouns or pronouns and other
parts of the sentence (e.g., on, in, under).
7. Conjunctions: Words that connect words, phrases, or clauses (e.g., and, but, or).
8. Interjections: Words or phrases that express strong emotion (e.g., oh, wow, oops).
9. Determiners: Words that introduce nouns, often providing additional information
about quantity or definiteness (e.g., the, a, some, my).
In the context of interpolation and backoff, word classes play a crucial role in defining how a
language model should treat different types of words. For instance, verbs and nouns may
follow different syntactic patterns, and recognizing their word classes can improve the
model’s ability to predict the next word in a sequence.
Summary:
 Interpolation combines different models (e.g., n-grams of different orders) to
estimate probabilities more robustly.
 Backoff "backs off" to lower-order models when higher-order ones have insufficient
data.
 Word classes (such as nouns, verbs, adjectives, etc.) are important for understanding
syntactic structure and improving language modeling.

Part of Speech Tagging – Markov Models


Part of Speech (POS) Tagging with Markov Models
POS tagging involves assigning each word in a sentence a part of speech label (e.g., noun,
verb, adjective). Markov models, specifically Hidden Markov Models (HMMs), are
commonly used for this task because they provide a probabilistic framework for predicting
the sequence of POS tags based on observed words.
How it works:
1. States and Observations:
o States: Each state represents a POS tag (e.g., NN for noun, VB for verb).
o Observations: Each observation corresponds to a word in the sentence.
2. Markov Assumption:
o The probability of a particular POS tag depends only on the previous tag (first-
order Markov assumption). So, the probability of a sequence of tags is:
P(t1,t2,…,tn∣w1,w2,…,wn)=∏i=1nP(ti∣ti−1)⋅P(wi∣ti)P(t_1, t_2, \dots, t_n |
w_1, w_2, \dots, w_n) = \prod_{i=1}^{n} P(t_i | t_{i-1}) \cdot P(w_i | t_i)P(t1
,t2,…,tn∣w1,w2,…,wn)=i=1∏nP(ti∣ti−1)⋅P(wi∣ti) where tit_iti is the POS tag
at position iii and wiw_iwi is the observed word at position iii.
3. Training:
o The model is trained on a labeled corpus (e.g., sentences with POS tags). From
this data, it learns:
 Transition probabilities P(ti∣ti−1)P(t_i | t_{i-1})P(ti∣ti−1) — the
likelihood of transitioning from one POS tag to another.
 Emission probabilities P(wi∣ti)P(w_i | t_i)P(wi∣ti) — the likelihood of
a word wiw_iwi occurring with a given POS tag tit_iti.
4. Decoding:
o Given a sequence of words, the goal is to find the most likely sequence of POS
tags. This is typically done using the Viterbi algorithm, which efficiently
computes the most probable tag sequence.
Example:
For the sentence "The dog runs":
 Observations: The, dog, runs
 States: DT (Determiner), NN (Noun), VBZ (Verb, 3rd person singular)
The model would calculate:
1. Transition probabilities (e.g., P(NN∣DT)P(NN | DT)P(NN∣DT), P(VBZ∣NN)P(VBZ |
NN)P(VBZ∣NN))
2. Emission probabilities (e.g., P(The∣DT)P(The | DT)P(The∣DT), P(dog∣NN)P(dog |
NN)P(dog∣NN), P(runs∣VBZ)P(runs | VBZ)P(runs∣VBZ))
Summary:
 HMMs are used for POS tagging, with the goal of finding the most likely sequence of
POS tags for a given sentence.
 The model relies on transition probabilities (from one POS tag to another) and
emission probabilities (from a word to its POS tag).
 The Viterbi algorithm is used to decode the most probable tag sequence.

Hidden Markov Models


Hidden Markov Models (HMMs)
A Hidden Markov Model (HMM) is a statistical model used to represent systems that
follow a Markov process with hidden states. It is particularly useful in applications where the
system’s internal state cannot be directly observed, but the observations made at each time
step are influenced by those hidden states.
HMMs are widely used in areas like speech recognition, part-of-speech tagging, biological
sequence analysis, and natural language processing.
Key Components of an HMM:
An HMM consists of the following elements:
1. States:
o The system has a set of hidden states, typically denoted as S1,S2,…,SNS_1,
S_2, \dots, S_NS1,S2,…,SN, where NNN is the number of states.
o The states represent the underlying system that cannot be directly observed. In
POS tagging, for example, these hidden states would correspond to the POS
tags (e.g., Noun, Verb).
2. Observations:
o At each time step, an observable output is generated by the system, which
depends on the current hidden state. These observations are usually denoted as
O1,O2,…,OTO_1, O_2, \dots, O_TO1,O2,…,OT.
o The observations are visible and are typically words in a sentence or
phonemes in speech recognition.
3. Transition Probabilities:
o These are the probabilities of transitioning from one state to another. The
transition probability from state iii to state jjj is denoted as AijA_{ij}Aij,
where Aij=P(qt=j∣qt−1=i)A_{ij} = P(q_t = j | q_{t-1} = i)Aij=P(qt=j∣qt−1=i).
o The transition matrix AAA is typically square with dimensions N×NN \times
NN×N.
4. Emission Probabilities:
o These are the probabilities of observing a particular observation given a
specific hidden state. The emission probability of observing OtO_tOt given
state StS_tSt is denoted as Bij=P(Ot=oj∣St=si)B_{ij} = P(O_t = o_j | S_t =
s_i)Bij=P(Ot=oj∣St=si).
o The emission matrix BBB is typically of size N×MN \times MN×M, where
NNN is the number of states, and MMM is the number of possible
observations.
5. Initial State Probabilities:
o These represent the probabilities of starting in each state at time t=1t=1t=1.
The initial probability vector is denoted as π=[π1,π2,…,πN]\pi = [\pi_1, \pi_2,
\dots, \pi_N]π=[π1,π2,…,πN], where πi=P(q1=Si)\pi_i = P(q_1 = S_i)πi=P(q1
=Si).
How HMMs Work:
Given the above components, an HMM is a model for a sequence of observations where the
true sequence of states is hidden, and we aim to infer the most likely sequence of hidden
states based on the observed data.
1. Forward Algorithm (Calculating the Probability of an Observation Sequence):
The forward algorithm calculates the probability of a sequence of observations
O1,O2,…,OTO_1, O_2, \dots, O_TO1,O2,…,OT given the model parameters. It uses
dynamic programming to avoid recalculating intermediate results and works as follows:
 Define the forward variable αt(i)\alpha_t(i)αt(i) as the probability of the partial
observation sequence O1,O2,…,OtO_1, O_2, \dots, O_tO1,O2,…,Ot ending in state
SiS_iSi:
αt(i)=P(O1,O2,…,Ot,qt=Si)\alpha_t(i) = P(O_1, O_2, \dots, O_t, q_t = S_i)αt(i)=P(O1,O2
,…,Ot,qt=Si)
 The algorithm recursively computes αt(i)\alpha_t(i)αt(i) using:
αt(i)=∑j=1Nαt−1(j)AjiBi(Ot)\alpha_t(i) = \sum_{j=1}^{N} \alpha_{t-1}(j) A_{ji}
B_{i}(O_t)αt(i)=j=1∑Nαt−1(j)AjiBi(Ot)
Here:
o AjiA_{ji}Aji is the transition probability from state jjj to state iii,
o Bi(Ot)B_i(O_t)Bi(Ot) is the emission probability of observing OtO_tOt in
state SiS_iSi,
o αt−1(j)\alpha_{t-1}(j)αt−1(j) is the probability of reaching state SjS_jSj at
time t−1t-1t−1.
 The probability of the observation sequence is given by:
P(O1,O2,…,OT)=∑i=1NαT(i)P(O_1, O_2, \dots, O_T) = \sum_{i=1}^{N} \alpha_T(i)P(O1
,O2,…,OT)=i=1∑NαT(i)
2. Viterbi Algorithm (Finding the Most Likely State Sequence):
The Viterbi algorithm is used to find the most likely sequence of hidden states (also called the
state sequence or path) given a sequence of observations.
 Let δt(i)\delta_t(i)δt(i) be the maximum probability of observing the sequence
O1,O2,…,OtO_1, O_2, \dots, O_tO1,O2,…,Ot and ending in state SiS_iSi at time ttt:
δt(i)=max⁡1≤j≤N(δt−1(j)Aji)Bi(Ot)\delta_t(i) = \max_{1 \leq j \leq N} \left( \delta_{t-1}(j)
A_{ji} \right) B_i(O_t)δt(i)=1≤j≤Nmax(δt−1(j)Aji)Bi(Ot)
 The Viterbi algorithm uses dynamic programming to compute the most likely path by
recursively choosing the best transition to each state. The most likely sequence of
states can be traced back from the final time step TTT to the initial step using
backtracking.
3. Baum-Welch Algorithm (Training an HMM):
The Baum-Welch algorithm is used to train an HMM when the model parameters
(transition, emission, and initial state probabilities) are unknown. It is a special case of the
Expectation-Maximization (EM) algorithm, and it works by iterating between two steps:
 Expectation step: Calculate the expected number of transitions and emissions for
each state and observation based on the current model parameters.
 Maximization step: Re-estimate the model parameters (transition, emission, and
initial probabilities) based on these expectations.
This iterative process improves the model parameters until they converge to values that
maximize the likelihood of the observed data.
Applications of HMMs:
HMMs are used in various fields due to their versatility in modeling sequential data:
1. Speech Recognition: In speech recognition, HMMs model the sequence of spoken
phonemes, where the observations are the acoustic features of the speech, and the
hidden states correspond to phonemes or words.
2. Part-of-Speech Tagging: In NLP, HMMs are used to assign POS tags to words in a
sentence by modeling the sequence of tags (hidden states) and the likelihood of words
given these tags (observations).
3. Bioinformatics: In DNA sequence analysis, HMMs are used to model the hidden
biological processes that generate observable sequences, such as in gene prediction.
4. Machine Translation: HMMs can be used to model the relationship between words
in source and target languages by treating the translation process as a sequence of
hidden states.
Summary:
A Hidden Markov Model (HMM) is a probabilistic model where:
 There is a sequence of hidden states (which we cannot observe directly).
 The system produces observable outputs based on those hidden states.
 The model is characterized by transition probabilities between states, emission
probabilities from states to observations, and initial state probabilities.
HMMs are powerful tools for modeling sequential data and are widely applied in fields such
as speech recognition, natural language processing, and bioinformatics.

Transformation based Models – Maximum Entropy


Models.
Transformation-Based Models and Maximum Entropy Models
In Natural Language Processing (NLP), there are various approaches to modeling language
data, with Transformation-Based Models and Maximum Entropy Models being two key
techniques used for tasks like Part-of-Speech (POS) tagging, Named Entity Recognition
(NER), and other sequence labeling tasks.
Let's dive into each of these models:

1. Transformation-Based Models
Overview:
Transformation-Based Models (TBMs), also known as Brill Tagging (named after the
linguist Eric Brill who developed it), are a rule-based machine learning approach where
rules are learned from training data to transform an initial set of labels (often based on simple
heuristics) into more accurate labels.
These models are based on iteratively improving the labeling system by applying rules that
"transform" the incorrect tags into correct ones. TBMs combine aspects of rule-based
systems with machine learning, making them unique and powerful.
How Transformation-Based Models Work:
1. Initial Labeling:
o The process begins by assigning an initial label to each word in the text (this
could be done using a simpler method, such as using a dictionary or a POS
tagger).
o For example, a first-pass labeling might assign tags based on simple heuristics
or by using a basic POS tagger.
2. Rule Learning:
o The model then learns rules that can correct the initial labeling. A rule
typically has the form: if (Condition) then (Action)\text{if} \
(\text{Condition}) \ \text{then} \ (\text{Action})if (Condition) then (Action)
where the Condition is based on the word or context (e.g., a specific word or
neighboring tags), and the Action is a transformation (e.g., changing a tag
from NN to VB).
o These rules are learned by looking at the disagreements between the initial
labeling and the correct labeling in the training data.
3. Rule Application:
o The rules are applied iteratively to improve the labeling. Each rule is applied
to the data to correct the tag sequence, and the model continues to refine its
rules to better align with the target labels.
4. Iterative Process:
o The process of learning and applying rules is iterative. After each round of
rule application, the system checks the errors, generates new rules, and applies
them again.
o The goal is to minimize error by refining the set of transformation rules.
Advantages of TBMs:
 Flexibility: TBMs combine rule-based methods with machine learning, allowing them
to handle ambiguous or rare cases better than purely statistical methods.
 Interpretability: The learned transformation rules are easy to understand, making the
model more interpretable.
 Efficiency: TBMs can be computationally efficient, especially when the number of
rules to be learned is small.
Example:
Let’s consider a simple sentence like:
 "I saw the dog". Initially, using a basic tagger, you might label it as:
 "I/PRP saw/VBD the/DT dog/NN".
If the model learns the rule:
"If the previous word is 'saw' and the current word is 'NN', change the tag to 'VB'."
It would then apply this rule to transform "dog/NN" into "dog/VB".
Use Cases:
 POS Tagging: One of the most common applications of TBMs is in part-of-speech
tagging, where a sequence of words is assigned correct POS tags.
 Named Entity Recognition (NER): TBMs can also be used to tag named entities
(e.g., Person, Location, Organization) based on context.

2. Maximum Entropy Models


Overview:
Maximum Entropy Models, also known as MaxEnt models, are a class of statistical models
used for various NLP tasks, especially when the goal is to estimate a probability distribution
given partial knowledge. MaxEnt models are based on the principle of maximum entropy,
which suggests that the probability distribution should be as "uninformative" as possible,
given the constraints.
Principle of Maximum Entropy:
The idea behind the maximum entropy principle is that, when estimating probabilities, we
should choose the distribution that has the highest entropy (i.e., the least bias or assumptions),
subject to the constraints imposed by the observed data. In other words, given the available
data and constraints (like known feature values), the model should not make any assumptions
beyond what is necessary.
Mathematically, we want to find a probability distribution P(y∣x)P(y | x)P(y∣x) over a set of
outcomes yyy given some input xxx, such that:
P(y∣x)=arg⁡max⁡P Entropy(P)=−∑yP(y∣x)log⁡P(y∣x)P(y | x) = \arg \max_{P} \,
\text{Entropy}(P) = - \sum_{y} P(y | x) \log P(y | x)P(y∣x)=argPmaxEntropy(P)=−y∑
P(y∣x)logP(y∣x)
Subject to constraints based on the feature functions:
∑yP(y∣x)fi(x,y)=known value for feature fi(x)\sum_{y} P(y | x) f_i(x, y) = \text{known value
for feature} \ f_i(x)y∑P(y∣x)fi(x,y)=known value for feature fi(x)
where fi(x,y)f_i(x, y)fi(x,y) are the features derived from the data.
How Maximum Entropy Models Work:
1. Feature Functions:
o The model uses a set of features f(x,y)f(x, y)f(x,y), which are functions of the
input xxx and the output yyy. These features capture important information
about the relationship between the input and output.
o For example, in a POS tagging task, one feature might be whether a word is
capitalized, or if a word is preceded by a determiner.
2. Training:
o The goal is to train the model to learn the best weights for each feature, which
maximizes the likelihood of the observed data while keeping the entropy (or
uncertainty) of the model as high as possible, given the constraints.
o This is done by adjusting the parameters iteratively through a process like
gradient descent.
3. Prediction:
o Once trained, the model can predict the most likely output for a given input by
using the learned probability distribution. This is done by computing the
probabilities for each possible output and selecting the one with the highest
probability.
Advantages of Maximum Entropy Models:
 Flexibility: MaxEnt models can handle a wide range of feature types (binary, discrete,
continuous, etc.).
 No Strong Assumptions: The maximum entropy principle ensures that the model
doesn't make unjustified assumptions about the data (e.g., assuming independence of
features unless supported by the data).
 Well-suited for Structured Prediction: MaxEnt is often used for sequence labeling
tasks like POS tagging, NER, and chunking.
Example:
For POS tagging, a MaxEnt model could use features such as:
 The word itself (e.g., dog),
 The previous word (e.g., saw),
 Whether the word starts with a capital letter,
 Whether the word is plural, etc.
These features would influence the probability of a tag for the word, and the model would
compute the likelihood of all possible tags for a word based on these features.
Use Cases:
 POS Tagging: MaxEnt models are commonly used for tagging words with their
appropriate part of speech.
 Named Entity Recognition (NER): MaxEnt is also used to classify named entities in
text.
 Text Classification: MaxEnt models can be used for classifying text into categories
(e.g., spam vs. non-spam emails).
 Machine Translation: MaxEnt models are sometimes applied in translation tasks,
where features might include word alignment, syntactic dependencies, etc.

Comparison: Transformation-Based Models vs. Maximum Entropy Models

Transformation-Based
Aspect Maximum Entropy Models
Models

Statistical model based on entropy


Approach Rule-based machine learning
maximization

Less flexible, heavily depends Highly flexible, can use any feature
Flexibility
on rule construction set

Highly interpretable, easy to Less interpretable, as the model is a


Interpretability
understand rules black-box
Transformation-Based
Aspect Maximum Entropy Models
Models

Handling of Handles ambiguity with learned Handles ambiguity via feature-based


Ambiguity rules probabilistic reasoning

Training Data Needs a labeled training corpus Needs a labeled corpus but can
Requirement with correct tags incorporate many types of features

Conclusion:
 Transformation-Based Models (e.g., Brill Tagging) combine rule-based learning
with machine learning, applying transformations iteratively to improve initial
predictions.
 Maximum Entropy Models use a probabilistic framework to model the distribution
of labels based on observed features, ensuring that the distribution is as
"uninformative" as possible subject to the constraints provided by the data.
Both models have their strengths and are applied widely in tasks like POS tagging, Named
Entity Recognition, and other sequence labeling tasks in NLP.

You might also like