Word Level Analysis NLP Mod 2
Word Level Analysis NLP Mod 2
● Morphology Analysis - 2
● Stemming and Lemmatization - 3
● Regular expression - 5
● Finite state transducers - 8
● DFA & NDFA - 9
● Morphology Parsing - 12
● Language Model - 14
● N –Grams- N-gram language model - 17
Morphology Analysis
Morpheme : A morpheme is the smallest unit of form that has meaning in a given
language.
Example : Unlikely (Un - like - ly)--- 3 morpheme.
Receive (Re - cieve), transport (trans - port)--- 2 morphemes.
Division/Classification of morphemes.
Derivational affixes:
● Affixes that are used to derive words i.e. changing word from one type to another
are derivational affixes and study of such morphemes is called derivation
morphology.
● Derivation morphemes may be either prefix or suffix. Example: disagreement
(dis ,agree ,ment).
Inflection affixes
● Inflectional affixes are only suffixes which express grammatical features such as
singular/plural,past/present forms of verbs.
● The study of such morphemes is called inflectional morphemes. Examples: bigger
(big+er), loves (love+s).
Stemming
Lemmatization
Regular Expressions
● A regular expression (RE) is a language for specifying text search strings.
● RE helps us to match or find other strings or sets of strings, using a specialized
syntax held in a pattern.
● Regular expressions are used to search texts in UNIX as well as in MS WORD in
an identical way.
● We have various search engines using a number of RE features.
(a+b)* It would be set of strings of a’s and b’s of any length which also
includes the null string i.e. {ε, a, b, aa , ab , bb , ba, aaa…….}
(a+b)*abb It would be set of strings of a’s and b’s ending with the string
abb i.e. {abb, aabb, babb, aaabb, ababb, …………..}
(11)* It would be set consisting of even number of 1’s which also
includes an empty string i.e. {ε, 11, 1111, 111111, ……….}
(aa + ab + ba + It would be string of a’s and b’s of even length that can be
bb)* obtained by concatenating any combination of the strings aa,
ab, ba and bb including null i.e. {aa, ab, ba, bb, aaab, aaba,
…………..}
Example of NDFA
Suppose a NDFA be
● Q = {a, b, c},
● Σ = {0, 1},
● q0 = {a},
● F = {c},
● Transition function δ is shown in the table as follows −
Types of Morphemes
Morphemes, the smallest meaning-bearing units, can be divided into two types −
● Stems
● Word Order
Stems
It is the core meaningful unit of a word. We can also say that it is the root of the word.
For example, in the word foxes, the stem is fox.
● Affixes − As the name suggests, they add some additional meaning and
grammatical functions to the words. For example, in the word foxes, the affix is −
es.
Word Order
The order of the words would be decided by morphological parsing. Let us now see the
requirements for building a morphological parser −
Lexicon
The very first requirement for building a morphological parser is lexicon, which
includes the list of stems and affixes along with the basic information about them. For
example, the information like whether the stem is Noun stem or Verb stem, etc.
Morphotactics
It is basically the model of morpheme ordering. In other sense, the model explaining
which classes of morphemes can follow other classes of morphemes inside a word. For
example, the morphotactic fact is that the English plural morpheme always follows the
noun rather than preceding it.
Orthographic rules
These spelling rules are used to model the changes occurring in a word. For example,
the rule of converting y to ie in word like city+s = cities not citys.
Language Model/Modeling
Spelling Correction: Spell correcting sentence: “Put you name into form”, so that
P(name into form)>P(name into from)
● P(name into form)>P(name into from)
Speech Recognition: Call my nurse:
P(Call my nurse.)≫P(coal miners)
● Example 2 : I have no idea.
P(no idea.)≫P(No eye deer.)
P(no idea.)≫P(No eye deer.)
Summarization, question answering, sentiment analysis etc.
Working of Language Model
Part 1: Defining Language Models
The goal of probabilistic language modelling is to calculate the probability of a sentence
of sequence of words:
and can be used to find the probability of the next word in the sequence:
This is a lot to calculate, could we not simply estimate this by counting and dividing the
results as shown in the following formula:
In general, no! There are far too many possible sentences in this method that would
need to be calculated and we would like to have very sparse data making results
unreliable.
or if k=2:
N-gram Models
From the Markov Assumption, we can formally define N-gram models where k = n-1 as
the following:
And the simplest versions of this are defined as the Unigram Model (k = 1) and the
Bigram Model (k=2).
Unigram Model (k=1):
Bigram Model (k=2):
Small Example
Given a corpus with the following three sentences, we would like to find the probability
that “I” starts the sentence.
<s I am Sam /s>
<s Sam I am /s>
<s I do not like green eggs and ham /s>
where “<s” and “/s>” denote the start and end of the sentence respectively.
Therefore, we have:
In other words, of the three times the sentence started in our corpus, “I” appeared as the
first word.