0% found this document useful (0 votes)
4 views74 pages

Unit 1 Notes.pptx

Unit 1 covers language modeling techniques in natural language processing (NLP), including grammar-based and statistical models like N-Gram, Exponential, and Continuous Space models. It also discusses finite state automata, their types (DFA and NFA), and their applications in tasks like morphological analysis and stemming. Additionally, the document addresses morphemes, inflectional and derivational morphology, and the use of transducers for lexicon and rules in NLP.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views74 pages

Unit 1 Notes.pptx

Unit 1 covers language modeling techniques in natural language processing (NLP), including grammar-based and statistical models like N-Gram, Exponential, and Continuous Space models. It also discusses finite state automata, their types (DFA and NFA), and their applications in tasks like morphological analysis and stemming. Additionally, the document addresses morphemes, inflectional and derivational morphology, and the use of transducers for lexicon and rules in NLP.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Unit 1

Unit 1 reference
https://www.goseeko.com/reader/notes/other-university/bebtech-g/c
omputer-engineering/level-4/semester/speech-natural-language-proce
ssing-1/unit-1-introduction-and-word-level-analysis-3?srsltid=AfmBOo
pmbN_DEmA6oguGAWQqjmb1xmGlOvE-agYIjaai6UjSoF1Eq7qx
Language modelling
• It is a technique in natural language processing (NLP) that
predicts the next word in a sentence.
• It uses statistical and probabilistic methods to analyze text data
and determine the likelihood of a word sequence.
How it works
1.Grammar based LM
• In Natural Language Processing (NLP), "grammar-based
language modeling" refers to a technique where a language
model explicitly incorporates grammatical rules and structures
to predict the next word in a sequence, relying on the
understanding of syntax and sentence construction to generate
more accurate and contextually relevant text, unlike purely
statistical models that might not fully capture grammatical
relationships.
Ex:
Statistical LM:
• Statistical Language Modeling, or Language Modeling and LM
for short, is the development of probabilistic models that can
predict the next word in the sequence given the words that
precede it.
• A statistical language model learns the probability of word
occurrence based on examples of text. Simpler models may
look at a context of a short sequence of words, whereas larger
models may work at the level of sentences or paragraphs. Most
commonly, language models operate at the level of words.
What are the types of statistical language
models?
• 1. N-Gram
• This is one of the simplest approaches to language modelling.
Here, a probability distribution for a sequence of ‘n’ is created,
where ‘n’ can be any number and defines the size of the gram
(or sequence of words being assigned a probability). If n=4, a
gram may look like: “can you help me”. Basically, ‘n’ is the
amount of context that the model is trained to consider. There
are different types of N-Gram models such as unigrams,
bigrams, trigrams, etc.
Conti..
• 2. Exponential
• This type of statistical model evaluates text by using an
equation which is a combination of n-grams and feature
functions. Here the features and parameters of the desired
results are already specified. The model is based on the
principle of entropy, which states that probability distribution with
the most entropy is the best choice. Exponential models have
fewer statistical assumptions which mean the chances of having
accurate results are more.
• 3. Continuous Space
• In this type of statistical model, words are arranged as a non-linear
combination of weights in a neural network. The process of assigning
weight to a word is known as word embedding. This type of model
proves helpful in scenarios where the data set of words continues to
become large and include unique words.
• In cases where the data set is large and consists of rarely used or
unique words, linear models such as n-gram do not work. This is
because, with increasing words, the possible word sequences
increase, and thus the patterns predicting the next word become
weaker.
How do you build a simple Statistical
Language Model?
• Language models start with a Markov Assumption. This is a
simplifying assumption that the k+1st word is dependent on the
previous k words. A 2nd order assumption results in a Bigram
model. The models are training using Maximum Likelihood
Estimations (MLE) of an existing corpus. The MLE approach
then is simply a fraction of work counts.
N-Grams:
EX
• A classic example of a statistical language model (LM) in NLP
is a bigram model, which predicts the next word in a sentence
based solely on the previous word;
• for instance, if the current word is "the", the model might predict
"cat" as the next word with a higher probability because "the
cat" is a common phrase in language data.
Regular Expression(regex)
Reg Exp
• n Natural Language Processing (NLP), a regular expression
(regex) is a sequence of characters used to identify patterns
within text, allowing you to extract specific information like
phone numbers, dates, or email addresses; for example, to find
phone numbers, you might use a regex
like "[\d]{3}-[\d]{3}-[\d]{4}" which matches a pattern of three
digits, a hyphen, another three digits, a hyphen, and finally four
digits.
Finite State Automata
• The theory of automata plays a significant role in providing
solutions to many problems in natural language processing.
• For example, speech recognition, spelling correction,
information retrieval, etc.
• Finite state methods are pretty useful in processing natural
language as the modeling of information using rules has many
advantages for language modeling.
• Finite state automaton has a mathematical model which is quite
understandable; data can be represented in a compacted form
using finite state automaton and it allows automatic compilation
of system components.
• Finite state automata (deterministic and non-deterministic finite
automata) provide decisions regarding the acceptance and
rejection of a string while transducers provide some output for a
given input.
• Thus the two machines are quite useful in language processing
tasks. Finite state automata are useful in deciding whether a
given word belongs to a particular language or not.
Features of Finite Automata

• Input: Set of symbols or characters provided to the machine.


• Output: Accept or reject based on the input pattern.
• States of Automata: The conditions or configurations of the
machine.
• State Relation: The transitions between states.
• Output Relation: Based on the final state, the output decision
is made.
• Mathematically, an automaton can be represented by 5-tuple
(Q, Σ, δ, q0, F), where −

► Q is a finite set of states.


► Σ is a finite set of symbols, called the alphabet of the
automaton
► δ is the transition function
► q0 is the initial state from where any input is processed (q0 ∈
Q)
► F is a set of final state/states of Q (F ⊆ Q)
• https://www.slideshare.net/slideshow/lecture-notesfinite-state-auto
mata-for-nlppdf/255750487
Deterministic Finite Automata (DFA)

• DFA is represented as {Q, Σ, q, F, δ}. In DFA, for each input


symbol, the machine transitions to one and only one state. DFA
does not allow any null transitions, meaning every state must
have a transition defined for every input symbol.
• Example:

• Construct a DFA that accepts all strings ending with ‘a’.


• Given:
• Σ = {a, b},
• Q = {q0, q1},
• F = {q1}
• In this example, if the string ends in ‘a’, the machine reaches
state q1, which is an accepting state.
Non-Deterministic Finite Automata (NFA)

• NFA is similar to DFA but includes the following features:


• It can transition to multiple states for the same input.

• It allows null (ϵ) moves, where the machine can change states without consuming any input.

• Example:
• Construct an NFA that accepts strings ending in ‘a’.
• Given:
• Σ = {a, b},
• Q = {q0, q1},
• F = {q1}

• Non-Deterministic Finite Automata (NFA)
In an NFA, if any transition leads to an accepting state,
the string is accepted.
Step by Step working of Finite State
Transducer in NLP
• One common application of Finite-State Transducers (FSTs) in
Natural Language Processing (NLP) is morphological analysis,
which involves analyzing the structure and meaning of words
at the morpheme level. Here, is the explanation of the
application of FSTs in morphological analysis with an example
of stemming using a finite-state transducer for English.
Stemming with FSTs

• Stemming is the process of reducing words to their root or base


form, often by removing affixes such as prefixes and suffixes.
FSTs can be used to perform stemming efficiently by defining
rules for stripping affixes and producing the stem of a word.
• Step 1. Define the FST's States and Transitions
• Start by defining the states of the FST, representing different stages of stemming.

• Define transitions between states based on rules for removing suffixes.

• Example transitions:
• State 0: Initial state
• Transition: If the input ends with "ing", remove "ing" and transition to state 1.
• State 1: "ing" suffix removed
• Transition: If the input ends with "ly", remove "ly" and transition to state 2.
• State 2: "ly" suffix removed
• Final state: Output the stemmed word
• Step 2. Construct the FST
• Based on the defined states and transitions, construct the FST using a tool like OpenFST or write
code to implement the FST.
• Step 3. Apply the FST to Input Words:
• Given an input word, apply the FST to find the stem.

• The FST traverses through the states according to the input word and transitions until it reaches a
final state, outputting the stemmed word.

• Example Input and Output:


• 1. Input: "running"
• FST transitions: State 0 (input: "running") →→ State 1 (remove "ing") →→ State 2 (output: "run")
• 2. Input: "quickly"
• FST transitions: State 0 (input: "quickly") →→State 1 (no "ing") →→ State 2 (remove "ly") →→ State 3 (output:
"quick")
Morphological parsing and generation

• Morphological parsing is the process of producing a lexical


structure of a word, that is, breaking a word into stem and
affixes and labeling the morphemes with category label.
• For example the word ‘books’ can be parsed as book+s and
further as book + N+PL, the word ‘went’ can be parsed as
go+V+PAST etc.
• generation is the reverse process of parsing, that is combining
the lexical form of a word to produce the word. For example,
box+N+Pl generates the word ‘boxes’.
• Finite state transducers are quite useful in morphological
parsing. Let us consider the lexicon form of regular inflectional
‘girl +N +PL’. Figure below shows the transducer which convert
the lexicon form ‘girl+N+PL’ into the word ‘girls’. Let us assume
that x represents the word ‘girl’ for simplification purpose as for
every regular singular noun the input and output will remain
same in a transducer, that’s why a variable x can be used for a
noun. The word ‘girl’ can be replaced by any other regular noun
like ‘boy’.
• In other sense, we can say that morphology is the study of −
► The formation of words.
► The origin of the words.
► Grammatical forms of the words.
► Use of prefixes and suffixes in the formation of words.
► How parts-of-speech (PoS) of a language are formed.
Reference important
• https://www.syllabussolved.com/syllabuses/153-artificial-intelligence
-and-machine-learning-al-504-b-rgpv/1733-english-morphology-trans
ducers-for-lexicon-and-rules-tokenization
Morphemes and their types
• Morphemes are the smallest meaningful units of language.
They can be classified into two types: free morphemes and
bound morphemes.
1. Free Morphemes
• Free morphemes are standalone words that can convey
meaning on their own. Examples of free morphemes include
'dog,' 'book,' and 'run.'
1. Bound Morphemes
• Bound morphemes, on the other hand, cannot stand alone
and need to be attached to other morphemes. They modify
the meaning or function of the word. Examples of bound
morphemes include prefixes like 'un-' and suffixes like '-ed'
and '-s.'
Inflectional Morphology
• Inflectional morphology deals with the modification of words to express
grammatical relationships. It includes processes like pluralization, verb
conjugation, and the formation of comparative and superlative forms.
1. Pluralization
• Pluralization is the process of forming the plural form of a noun. It involves
adding suffixes like '-s' or '-es' to the base form of the noun. For example,
'cat' becomes 'cats,' and 'box' becomes 'boxes.'
1. Verb conjugation
• Verb conjugation involves modifying the form of a verb to indicate tense,
mood, aspect, and agreement with the subject. For example, the verb 'run'
can be conjugated as 'runs' (present tense, third person singular) or 'ran'
(past tense).
1. Comparative and superlative forms
• Comparative and superlative forms are used to compare the degree of an
adjective or adverb. They involve adding suffixes like '-er' and '-est' or using
the words 'more' and 'most.' For example, 'big' becomes 'bigger'
(comparative) and 'biggest' (superlative).
Derivational Morphology

• Derivational morphology involves the creation of new words by adding prefixes or


suffixes to existing words. It helps in expanding the vocabulary and creating words
with different meanings or grammatical categories.
1. Prefixes
• Prefixes are morphemes added at the beginning of a word to modify its meaning or
create a new word. For example, the prefix 'un-' added to the word 'happy' changes
its meaning to 'unhappy.'
1. Suffixes
• Suffixes are morphemes added at the end of a word to modify its meaning or create a
new word. For example, the suffix '-er' added to the word 'teach' changes its
meaning to 'teacher.'
1. Conversion
• Conversion is a process in which a word changes its grammatical category without
any change in form. For example, the noun 'book' can be converted into a verb by
using it in a sentence like 'I will book a flight.'
Morphological Analysis and
Generation
• Morphological analysis involves breaking down a word into its constituent
morphemes and identifying their meanings and grammatical properties. It helps in
understanding the structure and meaning of words.
1. Stemming
• Stemming is a process in which the affixes of a word are removed to obtain its base
or root form. It is a rule-based approach that helps in reducing words to their common
stem. For example, the word 'running' can be stemmed to 'run.'
1. Lemmatization
• Lemmatization is a process similar to stemming but aims to obtain the base form of a
word using vocabulary and morphological analysis. It considers the context and part
of speech of the word to determine its lemma. For example, the word 'better' can be
lemmatized to 'good.'
1. Word formation rules
• Word formation rules govern the creation of new words by combining morphemes.
These rules specify the permissible combinations of prefixes, suffixes, and base
forms to form valid words.
Transducers for Lexicon and Rules

• Transducers are computational devices that transform input


sequences into output sequences based on a set of rules.
In the context of NLP, transducers are used to represent
and manipulate lexical entries and linguistic rules, enabling
efficient language processing.
• A. Definition and Purpose of Transducers
• Transducers are devices that perform transformations on
input sequences based on a set of rules. In the context of
NLP, transducers are used to represent and manipulate
lexical entries and linguistic rules, facilitating efficient
language processing.
Lexicon Transducers

• Lexicon transducers are used to represent and manipulate lexical


entries, which are the building blocks of language. They provide
mappings between surface forms (words or phrases) and their
corresponding lexical entries.
1. Construction and representation of lexicons
• Lexicons are constructed by compiling a list of words or phrases along
with their associated information such as part of speech,
morphological properties, and semantic features. Lexicon transducers
represent this information and enable efficient retrieval and
manipulation of lexical entries.
1. Mapping between surface forms and lexical entries
• Lexicon transducers establish mappings between surface forms
(words or phrases) and their corresponding lexical entries. These
mappings are used to retrieve the relevant information associated with
a particular word or phrase.
Rule Transducers

• Rule transducers are used to represent and manipulate linguistic


rules that govern the transformation of linguistic structures.
These rules can be used to perform tasks such as syntactic
parsing, semantic analysis, and text generation.
1. Construction and representation of rules
• Rules are constructed by specifying the conditions and actions
for transforming linguistic structures. Rule transducers represent
these rules and enable efficient application of rule-based
transformations.
1. Rule-based transformations of linguistic structures
• Rule transducers apply rule-based transformations to linguistic
structures. These transformations can involve operations such
as substitution, deletion, insertion, and reordering of linguistic
elements.
Tokenization

• Tokenization is the process of dividing a text into smaller


units called tokens. These tokens can be words, sentences,
or even smaller units like characters. Tokenization is a
fundamental step in NLP tasks as it helps in text
preprocessing, information retrieval, and text analysis.
• A. Definition and Importance of Tokenization
• Tokenization is the process of breaking down a text into
smaller units called tokens. Tokens can be words,
sentences, or even smaller units like characters.
Tokenization is important in NLP as it provides the basic
units for further analysis and processing.
Tokenization Techniques
• There are several techniques for tokenization, including rule-based
tokenization, statistical tokenization, and hybrid tokenization.
1. Rule-based Tokenization
• Rule-based tokenization involves defining a set of rules to determine
the boundaries between tokens. These rules can be based on
punctuation marks, white spaces, or other linguistic patterns.
1. Statistical Tokenization
• Statistical tokenization uses machine learning algorithms to learn
patterns from a large corpus of text. These algorithms analyze the
frequency and distribution of characters or words to determine token
boundaries.
1. Hybrid Tokenization
• Hybrid tokenization combines rule-based and statistical approaches to
achieve better accuracy and coverage. It uses rules as a starting point
and then applies statistical models to refine the token boundaries.
Challenges in Tokenization

• Tokenization can be challenging due to various factors such as


ambiguity in word boundaries and handling punctuation marks
and special characters.
1. Ambiguity in word boundaries
• In some languages, word boundaries are not clearly defined,
leading to ambiguity in tokenization. For example, in German,
compound words can be written as a single word or separated
by spaces.
1. Handling punctuation marks and special characters
• Punctuation marks and special characters can pose challenges
in tokenization. For example, the apostrophe in contractions like
'don't' or hyphenated words like 'state-of-the-art' need to be
Real-world Applications of
Tokenization
• Tokenization has various real-world applications in NLP and related
fields.
1. Text preprocessing in Natural Language Processing
• Tokenization is an essential step in text preprocessing, where it helps
in breaking down the text into meaningful units for further analysis. It
enables tasks like part-of-speech tagging, named entity recognition,
and syntactic parsing.
1. Information retrieval and search engines
• Tokenization is used in information retrieval systems and search
engines to index and retrieve documents based on user queries. It
helps in matching query terms with indexed tokens and retrieving
relevant documents.
1. Sentiment analysis and text classification
• Tokenization is used in sentiment analysis and text classification tasks
to extract features from text. It helps in representing text data in a
format suitable for machine learning algorithms.
Advantages and Disadvantages of
English Morphology, Transducers, and
Tokenization
• Advantages
1. Improved accuracy in language processing tasks
• English Morphology, Transducers, and Tokenization techniques help in
improving the accuracy of various language processing tasks like text
classification, sentiment analysis, and information retrieval.
1. Efficient handling of morphological variations
• English Morphology and Transducers enable efficient handling of
morphological variations in words. They can handle inflections,
derivations, and other morphological changes to ensure accurate
language processing.
1. Enhanced text understanding and analysis
• English Morphology, Transducers, and Tokenization techniques
provide a deeper understanding of the structure and meaning of text.
They enable advanced analysis and interpretation of textual data.
Disadvantages

1. Complexity in rule construction and maintenance


• English Morphology and Transducers involve the construction
and maintenance of complex rules and lexicons. Developing and
updating these resources can be time-consuming and require
linguistic expertise.
1. Difficulty in handling irregular forms and exceptions
• English Morphology and Transducers may face challenges in
handling irregular forms and exceptions in the language. These
irregularities can lead to errors or inconsistencies in language
processing tasks.
1. Potential loss of information during tokenization
• Tokenization may result in the loss of certain linguistic
information. For example, tokenizing a word like 'can't' into 'can'
and 't' may lose the contraction information.
Detecting and Correcting Spelling Errors
• There are many commercial as well as non-commercial spelling error
detection and correction tools available in the market for almost all
popular languages.
• And every tool works on word level with the help of integral
dictionary/Wordnet as the backend database for correction and
detection.
• Every word from the text is looked up in the speller lexicon.
• When a word is not in the dictionary, it is detected as an error. In
order to correct the error, a spell checker searches the
dictionary/Wordnet for the word that is most resembled to the
erroneous word.
• These words are then suggested to the user to choose the intended
• Spelling checking in used in various applications like machine
translation, search,information retrieval etc.
• Spell checking technique comprises of two stages
• i. Error detection and
• ii. Error correction.
TYPES OF SPELL ERRORS
• spelling errors are generally divided into two types
• Typographic errors and
• Cognitive errors.
• Typographic errors (Non Word Errors): These errors occur when the
correct spelling of the word is known but the word is mistyped by
mistake. These errors are mostly related to the keyboard and
therefore do not follow any linguistic criteria.
• Cognitive errors (Real Word Errors): These errors occur when the
correct spellings of the word are not known. In the case of cognitive
errors, the pronunciation of misspelled word is the same or similar to
the pronunciation of the intended correct word.
ERROR DETECTION
• For error detection each word in a sentence or paragraph is
tokenized by using a tokenizer and checked for its validity. The
candidate word is a valid if it has a meaning else it is a non
word.
• Two commonly used techniques for error detection is
• N-gram analysis and
• Dictionary/Wordnet lookup.
N-gram Analysis
• N-gram analysis is a method to find incorrectly spelled words in
a mass of text. Instead of comparing each entire word in a text
to a dictionary, just ngrams are checked.
• A check is done by using an n-dimensional matrix where real
n-gram frequencies are stored. If a non-existent or rare n-gram
is found the word is flagged as a misspelling, otherwise not.
• An n-gram is a set of consecutive characters taken from a string
with a length of whatever n is set to
• This method is language independent as it requires no
knowledge of the language for which it is used. In this algorithm,
each string that is involved in the comparison process is split up
into sets of adjacent n-grams.
• The similarity between two strings is achieved by discovering
the number of unique n-grams that they share and then
calculating a similarity coefficient,
• i.e.the number of the ngrams in common (intersection), divided
by the total number of n-grams in the two words (union)
Dictionary Lookup
• A dictionary/Wordnet is a lexical source that contains list of correct words a
particular
• language. The non-word errors can be easily detected by checking each word
against a
• dictionary.
• Drawbacks of this method
• i. Difficulties in keeping such a dictionary up to date, and sufficiently extensive to
• cover all the words in a text.
• A large scale lexical resources is given by linguistic ontology that covers many
words of a
• language and has a hierarchical structure based on the relationship between
concepts.
• WordNet is a widely used lexical resource. It covers nouns, verbs, adjectives and
adverbs.
ERROR CORRECTION
• Error correction consists of two steps:
• The generation of candidate corrections: The candidate
generation process usually makes use of a precompiled table of
legal n-grams to locate one or more potential correction terms.
• The ranking of candidate corrections: The ranking process
usually invokes some lexical similarity measure between the
misspelled string and the candidates or a probabilistic estimate
of the likelihood of the correction to rank order the candidate
SPELL-CORRECTION ALGORITHMS
• https://www.studocu.com/in/document/university-of-madras/comp
uter-application/2021-26-spelling-err-detectn-and-correctn/3124210
5
• The isolated-word methods that will be described here are the
most studied spelling correction algorithms, they are:
• i. Edit distance
• ii. Similarity keys
• iii. Rule-based techniques
• iv. n-gram-based techniques
• v. Probabilistic techniques
• vi. Neural networks and
• vii. Noisy channel model
• All of these methods can be thought of as calculating a distance
between the misspelled word
• and each word in the dictionary or index. The shorter the
distance the higher the dictionary
• word is ranked.
Edit Distance

• It is one of the simplest methods based on the assumption that


the person usually makes few errors i.e. only one erroneous
character operation (insertion, deletions, and
substitutions)necessary to covert a dictionary word in to the non
word.
• Edit distance is useful for correcting errors resulting from
keyboard input, since these are often of the same kind as the
allowed edit operations.
• It is not quite as good for correcting phonetic spelling errors,
especially if the difference between spelling and pronunciation is
big as in English or French.

You might also like