Unit 1 Notes.pptx
Unit 1 Notes.pptx
Unit 1 reference
https://www.goseeko.com/reader/notes/other-university/bebtech-g/c
omputer-engineering/level-4/semester/speech-natural-language-proce
ssing-1/unit-1-introduction-and-word-level-analysis-3?srsltid=AfmBOo
pmbN_DEmA6oguGAWQqjmb1xmGlOvE-agYIjaai6UjSoF1Eq7qx
Language modelling
• It is a technique in natural language processing (NLP) that
predicts the next word in a sentence.
• It uses statistical and probabilistic methods to analyze text data
and determine the likelihood of a word sequence.
How it works
1.Grammar based LM
• In Natural Language Processing (NLP), "grammar-based
language modeling" refers to a technique where a language
model explicitly incorporates grammatical rules and structures
to predict the next word in a sequence, relying on the
understanding of syntax and sentence construction to generate
more accurate and contextually relevant text, unlike purely
statistical models that might not fully capture grammatical
relationships.
Ex:
Statistical LM:
• Statistical Language Modeling, or Language Modeling and LM
for short, is the development of probabilistic models that can
predict the next word in the sequence given the words that
precede it.
• A statistical language model learns the probability of word
occurrence based on examples of text. Simpler models may
look at a context of a short sequence of words, whereas larger
models may work at the level of sentences or paragraphs. Most
commonly, language models operate at the level of words.
What are the types of statistical language
models?
• 1. N-Gram
• This is one of the simplest approaches to language modelling.
Here, a probability distribution for a sequence of ‘n’ is created,
where ‘n’ can be any number and defines the size of the gram
(or sequence of words being assigned a probability). If n=4, a
gram may look like: “can you help me”. Basically, ‘n’ is the
amount of context that the model is trained to consider. There
are different types of N-Gram models such as unigrams,
bigrams, trigrams, etc.
Conti..
• 2. Exponential
• This type of statistical model evaluates text by using an
equation which is a combination of n-grams and feature
functions. Here the features and parameters of the desired
results are already specified. The model is based on the
principle of entropy, which states that probability distribution with
the most entropy is the best choice. Exponential models have
fewer statistical assumptions which mean the chances of having
accurate results are more.
• 3. Continuous Space
• In this type of statistical model, words are arranged as a non-linear
combination of weights in a neural network. The process of assigning
weight to a word is known as word embedding. This type of model
proves helpful in scenarios where the data set of words continues to
become large and include unique words.
• In cases where the data set is large and consists of rarely used or
unique words, linear models such as n-gram do not work. This is
because, with increasing words, the possible word sequences
increase, and thus the patterns predicting the next word become
weaker.
How do you build a simple Statistical
Language Model?
• Language models start with a Markov Assumption. This is a
simplifying assumption that the k+1st word is dependent on the
previous k words. A 2nd order assumption results in a Bigram
model. The models are training using Maximum Likelihood
Estimations (MLE) of an existing corpus. The MLE approach
then is simply a fraction of work counts.
N-Grams:
EX
• A classic example of a statistical language model (LM) in NLP
is a bigram model, which predicts the next word in a sentence
based solely on the previous word;
• for instance, if the current word is "the", the model might predict
"cat" as the next word with a higher probability because "the
cat" is a common phrase in language data.
Regular Expression(regex)
Reg Exp
• n Natural Language Processing (NLP), a regular expression
(regex) is a sequence of characters used to identify patterns
within text, allowing you to extract specific information like
phone numbers, dates, or email addresses; for example, to find
phone numbers, you might use a regex
like "[\d]{3}-[\d]{3}-[\d]{4}" which matches a pattern of three
digits, a hyphen, another three digits, a hyphen, and finally four
digits.
Finite State Automata
• The theory of automata plays a significant role in providing
solutions to many problems in natural language processing.
• For example, speech recognition, spelling correction,
information retrieval, etc.
• Finite state methods are pretty useful in processing natural
language as the modeling of information using rules has many
advantages for language modeling.
• Finite state automaton has a mathematical model which is quite
understandable; data can be represented in a compacted form
using finite state automaton and it allows automatic compilation
of system components.
• Finite state automata (deterministic and non-deterministic finite
automata) provide decisions regarding the acceptance and
rejection of a string while transducers provide some output for a
given input.
• Thus the two machines are quite useful in language processing
tasks. Finite state automata are useful in deciding whether a
given word belongs to a particular language or not.
Features of Finite Automata
• It allows null (ϵ) moves, where the machine can change states without consuming any input.
• Example:
• Construct an NFA that accepts strings ending in ‘a’.
• Given:
• Σ = {a, b},
• Q = {q0, q1},
• F = {q1}
•
• Non-Deterministic Finite Automata (NFA)
In an NFA, if any transition leads to an accepting state,
the string is accepted.
Step by Step working of Finite State
Transducer in NLP
• One common application of Finite-State Transducers (FSTs) in
Natural Language Processing (NLP) is morphological analysis,
which involves analyzing the structure and meaning of words
at the morpheme level. Here, is the explanation of the
application of FSTs in morphological analysis with an example
of stemming using a finite-state transducer for English.
Stemming with FSTs
• Example transitions:
• State 0: Initial state
• Transition: If the input ends with "ing", remove "ing" and transition to state 1.
• State 1: "ing" suffix removed
• Transition: If the input ends with "ly", remove "ly" and transition to state 2.
• State 2: "ly" suffix removed
• Final state: Output the stemmed word
• Step 2. Construct the FST
• Based on the defined states and transitions, construct the FST using a tool like OpenFST or write
code to implement the FST.
• Step 3. Apply the FST to Input Words:
• Given an input word, apply the FST to find the stem.
• The FST traverses through the states according to the input word and transitions until it reaches a
final state, outputting the stemmed word.