Processin
n to Natural Lan juage -
language spoken by people, e.g. English, Hindi, Marath;
like C, C++, Java, etc.
a Introductio!
fers to the
i lages,
Vprogramming langu
dinary language is any language that has evolveg
gh use and repetition without conscious planning
meditation. Natural languages can take aileron forms, such as written
: : oe speech or signing etc.) They are distinguished from constructed ang ei
ext, Ir
formal languages such as those used to program computes Si to study logic, Ke
guage processing (NLP) is a branch of artificial intelligence that helps
Natural Language re!
as opposed to artificial
‘a natural language or ©
‘paturally in humans throu
Natural lan,
computers understand, interpret and manipulate human language. NLP draws from co
many disciplines, including computer science and computational linguistics, in its Pursuit ce
1
{o fill the gap between human communication and computer understanding.
Natural Language Processing (NLP) is a field of research and application that dete
the way computers can be used to understand and manage natural language
speech to do useful things. The term “natural” in the context of the language
to distinguish human languages (such as Gujarati, English, Spanish and Fret
computer languages (such as C, C++, Java and Prolog). The definition
Language Processing clarifies that it is a theoretically induced range of com
techniques (multiple methods or techniques for language analysis) for 4
representing naturally occurring text (such as English, Gujarati al
or more levels of linguistic analysis for the purpose of achieving hum
Processing for a range of tasks or applications,
2. Need for Natural Language Processing
Significant growth in the volume and variety of data is due to the amount
text data—in fact, up to 80% of all your data is unstructured text data. C
huge amounts of documents, emails, social media, and other text-b
to get to know their customers better, offer services or market th
Most of this data is unused and untouched,
: Text gels, through the use of natural language prox
k 1e business value within these vast data
e nesses can fully utilize hI
Consider a example given in Figure 4
kufmmi mmmvw nnnfinaa3
| Ujiheaie elece mnster vensi credur |
| Baboi oi cestnize |
| COOVOE!2* ekk; IdsIIk Ikdf vnnjfi? |
|_Famgimiik mifin kire xnnnt
Figure 1: Sample Textin natural form
Computers “see” text in English the same you have seen the figure 1.
Normally, People have no trouble Understanding natural language as they have
Common sense knowledge,
Common-sense knowledge, Reasoning capacity,
Experience. Unless we teach
computers to do so, they will not understand any natu
ral language.
3. Goals of Natural Langua je Processin
4
The ultimate goal of natural language Processing is for computers to achieve
human-like comprehension of texts/ianguages. When this is achieved, computer
systems will be able to understand, draw inferences from, summarize, translate
and generate accurate and natural human text and language.
- The goal of natural language processing is to specify a language comprehension
and production theory to such a level of detail that a person is able to write a
computer program which can understand and produce natural language.
3. The basic goal of NLP is to accomplish human like language Processing. The
choice of word “processing” is very deliberate and should not be replaced with
“understanding”. For although the field of NLP was originally referred to as Natural
Language Understanding (NLU), that goal has not yet been accomplished. A full
NLU system would be able to:
~ Paraphrase an input text.
— Translate the text into another language.
a Answer questions about the contents of the text. _
~ _ Draw inferences from the text.
we spendy 2 conan==
———__—
4. Brief overview of NLP a |
———— focuses on the interactions between human language 4
The field of study tha or NLP for short. all
| Language Processing,
ters is called Natural
se of computer science, artificial intelligence, and computational linguistigg
inters 4
The essence of Natural Language Processing lies in making computers undersiang
the natural language. That's not an easy task though. Computers can understand the
structured form of data like spreadsheets and the tables in the database, but hy
languages, texts, and voices form an unstructured category of da and it gets
for the computer to understand it, and there arises the need for Natural Lan
Processing. f
There's a lot of natural language data out there in various forms and it would get |
easy if computers can understand and process that data. We can train the ol
in accordance with expected output in different ways. Humans have been writing ot
thousands of years, there are a lot of literature pieces available, and it would be real
if we make computers understand that. But the task is never going to be easy. There
are various challenges floating out there like understanding the correct meaning of the 5.
sentence, correct Named-Entity Recognition(NER), correct prediction of various partso ——
speech, coreference resolution(the most challenging thing in my opinion). Oe
NLP
Computers can't truly understand the human language. if we feed enough data and train
@ model property, it can distinguish and try categorizing various parts of speech( tae
verb, adjective, supporters, etc...) based on Previously fed data and experiences. If favs’
encounters @ new word it tried making the nearest guess which can be embarrassit requir
wrong a few times. aah aa _ base s
Its very difficult for a computer to e Eo
example — The boy radiated fire like transla
he actually radiated fire? As you can see o\ | rif meanii
going to be complicated, ; willing,
Spoiled
Choms
the pro
Notatior= Fen 1
Database [ Inteligence Algorithms [ Networking
.
(Robe Ne
Robots } latura) Language (i Search |
Web NLP
Retrieval ||] Translation Categorization
(using ontology) (E-M& ME) ‘summarization
Extractive
Summarization
Figure 2: NLP in the Computer science taxonomy
5. History of NLP t fs
NLP began in the 1950s as the intersection of artificial. intelligence and linguistics.
NLP was originally distinct from text information retrieval (IR), which employs highly
scalable statistics-based techniques to index and search large volumes of text efficiently:
Manning et al1 provide an excellent introduction to IR. With time, however, NLP and IR
have converged somewhat. Currently, NLP borrows from several, very diverse fields,
requiring today's NLP researchers and developers to broaden their mental knowledge-
base significantly.
Early simplistic approaches, for example, word-for-word Russian-to-English machine
Soe were defeated by homographs—identically ‘spelled words with mayo
Morphological
Analysis
Information Machine ‘Text [ Language7. Levels of NLP a, ort Often ROP
‘on multiple levels and most often, these Sitier dete
The NLP can broadly be divided into various thet
Rea:
Natu
in the
Contextual : al
reasoning
ge Processing works
ch other.
Natural Langu:
areas synergize well with ea
as shown in figure.
Toa
Phonology: it deals with interpretation of speech sound within and across words. Mor
Morphology: It is a study of the way words are built up from smaller meaning-bearingun the
called morphemes. For example, the word ‘fox’ has single morpheme while the word a8 Mor;
have two morphemes ‘cat’ and morpheme ‘-s' represents singular and plural concepts. SYN
Morphological lexicon is the list of stem and affixes together with basic informal Synt
whether the stem is 2 Noun stem or a Verb stem [21]. The detailed analysis of this com
is discussed in chapter 4. Syntax: Itis a study of formal relationships between as
is @ study of how words are clustered in classe aie
what
: SEM
; ‘Semantics: It is a study of the me ico
structure. It consists of two kinds ing
semantic grammar. The de
discourse context, the level ofresolution is replacing of words such as pronouns. Discourse structure recognition
determines the function of sentences in the text which adds meaningful representation of
the text.
Reasoning: To produce an answer to a question which is not explicitly stored in a database;
Natural Language Interface to Database (NLIDB) carries out reasoning based on data stored
inthe database. For example, consider a database that holds the academic information about
student, and user posed a query such as: ‘Which student is likely to fail in Maths subject?
To answer the query, NLIDB needs a domain expert to narrow down the reasoning process.
8. Knowledge in Language processing _
Anatural language understanding system must have knowledge about what the words
mean, how words combine to form sentences, how word meanings combine to from
sentence meanings and so on. The different forms of knowledge required for natural
language understanding are given below.
PHONETIC AND PHONOLOGICAL KNOWLEDGE
Phonetics is the study of language at the level of sounds while phonology is the study
of combination of sounds into organized units of speech, the formation of syllables
and larger units. Phonetic and phonological knowledge are essential for speech based
systems as they deal with how words are related to the sounds that realize them
MORPHOLOGICAL KNOWLEDGE
Morphology concerns word formation. Itis a study of the patterns of formation of words by
the combination of sounds into minimal distinctive units of meaning called morphemes.
Morphological knowledge concerns how words are constructed from morphemes.
SYNTACTIC KNOWLEDGE ”
Syntax is the level at which we study how words combine to form phrases, phrases
combine to form clauses and clauses join to make sentences. Syntactic analysis
concems sentence formation. It deals with how words can be put together to form correct
sentences. It also determines what structural role each word plays in the sentence and
What phrases are subparts of what other phrases.
SEMANTIC KNOWLEDGE
't concems the meanings of the words and senten
independent meaning that is the meaning a
used. Defining the meaning of a s
pludy of context
context it is
involved.eee
i] Pragmatic
a wt DP rae a
SyntacticNet_ >) Execute the command
. Ipr /ali/stuff.init
level of linguistic processing deals with the analysis of structure and meaning
Single sentence, making connections between words and sentences. Af
Resolution is also achieved by Identifying the entity referenced by an
iy in the form of, but not limited to, a Pronoun). An example Is shown
Voted for Obama because he was most
5: Anaphora Resolution Wlustration
resolve anaphora relat
» at the
effi
Lar
ano
user:
inclu
inforn
Infor)
text. |
area ir
relatio
Quest
finding
Natura
from dé11. Applications of NLP
, information extraction, machine learning systems
question answering system, dialogue system, email fouting, telephone banking, speech
management, multilingual query processing, and natural language interface to database
system. Currently interactive applications may be classified into following categories
Speech Recognition / Speech Understanding and Synthesis / Speech Generation:
Speech understanding system attempts to perform a semantic and Pragmatic processing
Of spoken utterance to understand what the user is saying and act on what is being said
The research area in this category includes: linguistic analysis, design & developing
efficient and effective algorithms for speech recognition and synthesis.
Language Translator: It is a task of automatically converting one natural language into
Snother preserving the meaning of input text and producing an equivalent text in the
‘Output language. The research area in this category includes language modelling.
Information Retrieval (IR): Itis a scientific discipline that deals with analysis, design and
implementation of a computerized system that addresses representation, organization,
and access to large amounts of heterogeneous information encoded in digital format.
The search engine is the well known application of IR which accepts query from user and
returns the relevant document to user. It returns the document, not the relevant answers;
users are left to extract answers from the returned documents. The research area in IR
includes: information searching, information extraction, information categorization and
information summarization from unstructured information.
Information Extraction: It includes extraction of structured information from unstructured
xt. It is an activity of filling predefined template from natural language text. The:
rea in this category includes identifying named entity, resolving anaphora
lationships between entities.
lestion Answering (QA): It is passage retrieval in specific domain, It
ing answers for a given question from a large collection of
‘ural Language Interface to Database (NLIDB): It is a process:
database by asking questions in natural language.ters. It determineg
4 study of dialog between hu
Dia stems The research
grammar and style of the sentence based on that it giv + human-robot dialog ang
area in this category includes the design of conventional agent, hu
analysis of human-human dialog
Text Generation: The task off
Discourse Management / Story Understanding
oe i discourse relationship
Identifying the discourse structure is to identify the nature o Staci
ast a oc
between sentences such as elaboration, explanation, contrast and also
Speech acts in a chunk of text (For example, yes-no, statement and assertion).
Expected Questions ‘ SS
7 What is Natural language Processing ( NLP) ? Discuss various stages involved in
NLP process with suitable example
_-2— What is Natural Language Understanding? Discuss various levels of analysis
under it with example.
_—4— What do you mean by ambiguity in Natural language? Explain with suitable
example. Discuss various ways to resolve ambiguity in NL.
_4What do mean by lexical ambiguity and syntactic ambiguity in Natural language?
What are different ways to resolve these ambiguities?
ot tie
_-3< List various applications of NLP and discuss any 2 applications in detail,wa —
1. Morphology anal.
What are words?
2. E uman language
jords are the fundamental building block of language. Every h z ——
and language
spoken, signed, or written, is composed of words. Every area of speec
processing, from speech recognition to machine translation to information revo
the web, requires extensive knowledge about words. Psycholinguistic models 0 —
language processing and models from generative linguistic are also heavily based on
lexical knowledge.
Words are Orthographic tokens separated by white space. In some languages the
distinction between words and sentences is less clear.
Chinese, Japanese: no white space between words
nowhitespace —- no white space/no whites pace/now hit esp ace
Turkish: words could represent a complete “sentence”
Eg: uygariastiramadiklarimizdanmissinizcasina
“(behaving) as if you are among those whom we could not civilize”
Morphology is the study of the structure and formation of words. Its most important unit
's the morpheme, which is defined as the ‘minimal unit of meaning”
Consider a word like: “unhappiness”. This has three parts:
‘morphiemes:
AN,
__- uh happyness,
6
stemSuffix : ness
Affixes : happy
Stem : happy
un ness today”.
Bound Morphemes: These are lexical items incorporated into a word as a dependent
part. They cannot stand alone, but must be connected to another morphemes. Bound
jorphemes operate in the connection processes by means of derivation, inflection, and
jompounding. Free morphemes, on the other hand, are autonomous, can occur on their
pwn and are thus also Words at the same time. Technically, bound morphemes and free
lorphemes are said to differ in terms oftheir distribution or freedom of occurrence. As a
ule, lexemes consist of at least one free morpheme
lorphology handles the formation of words by using morphemes base form (stem, lemma),
» believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
lorphological parsing is the task of recognizing the morphemes inside a wordee.g., hands,
Oxes, children and its important for many tasks like machine translation, information
Tetrieval, etc. and useful in parsing, text simplification, etc
Survey of English Morphology
Morphology is the study of the way words are built up from smaller meaning bearing
units, morphemes. A morpheme is often defined as the minimal meaning-bearing unit in
a language. So for example the word fox consists of a single morpheme (the morpheme
fox) while the word cats consists of two: the morpheme cat and the morpheme -s. As
this example suggests, it is often useful to distinguish two broad classes of morphemes:
stems and affixes. The exact details of the distinction vary from language to language,
but intuitively, the stem is the ‘main’ morpheme of the word, supplying the main meaning,
while the affixes add ‘additional’ meanings of various kinds. Affixes divided into
Prefixes, suffixes, infixes, and circum- fixes. Prefixes precede
the stem, circumfixes do both, and infixes are inserted inside
Word eats is composed of a stem eat and the suffix -s. The
a stem buckle and the prefix un-. English doesn't have anFor example, the affix
Stem hingi ‘borrow’ to produce humingi
Prefixes and suffixes are often called concatenative
composed of a number of morphemes concatenated together. A n
have extensive non-concatenative morphology, in which morphemes are cor
more complex ways. The Tagalog in- fixation example above is one example of
concatenative morphology, since two morphemes (hing! and um) are intermingled:
Another kind of non-concatenative Morphology is called templatic morphology oF roate
‘and-patiern morphology. This is very common in Arabic, Hebrew and other Semitic
languages. In Hebrew, for example, a verb isa Ti a
word stem with a grammatical morpheme, usually resulting in a word of a different class,
often with a meaning hard to predict exactly. For example, the
verb computerizes can
take the derivational suffix -ation to produce the noun computer:
ization,
inflectional morphology & Derivational morphology
Morphemes are defined as smallest meaning-bearing units. Morphemes can be
classified in various ways. One common classification is to separate those morphemes
that mark the grammatical forms of words (-s, -ed, -ing and others) from those that form
new lexemes conveying new meanings, e.g. un- and -ment. The former morphemes
are inflectional morphemes and form a key part of grammar, the latter are derivational
morphemes and play a role in word-formation, as we have seen. The following criteria
help you to distinguish the two types:
* Effect: Inflectional morphemes encode grammatical categories and relations, thus
marking word-forms, while derivational morphemes create new lexemes.
* Position: Derivational morphemes are closer to the stem than inflectional morphemes,
cf. amendments (amend[stem] - ment{derivational] - s[inflectional]) and legalized
(legal[stem] — ize[derivational] — edjinflectional]).
* Productivity: inflectional morphemes are highly productive, which means that they
can be attached to the vast majority of the members of a given class (say, verbs,
nouns or adjectives), whereas derivational morphemes tend to be more restricted
with regard to their scope of application. For example, the past morpheme can in
Principle be attached to all verbs; suffixation by means of the adjective-forming
derivational morphemes -able, however, is largely restricted to dynamic transitive
verbs, which excludes formations such as *bleedable or *lieable.
Class properties: Inflectional morphemes make up a closed and fairly stable class of
items which can be listed exhaustively, while derivatio 0
much more numerable and more open to changes in thei
Both inflectional and derivational morphemes must be
cannot occur by themselves, in isolation, and are there
Inflected words are variations of already existing lexem
grammatical shape. Therefore many of the Infic
dictionary. If you know the word surprise and look
the word surprise -s which simply expresses the |= . oy
of
affixes to create new words out
-ship and so on, These affixes
do, but instead
Derivational Morphology on the other hand uses
already existing lexemes. Typical affixes are -ness, -ish,
do not change the grammatical form of a word such as inflectional affixes ~
often create a new meaning of the base or change the word class of the ise A
Example would be the word light. The plural form light-s would be considered Inflect oa
Morphology, but if we consider de-light the prefix -de has changed the meaning of te
word completely. We now do not think of light in the form of sunshine or lamps anymore
but instead about a feeling. Also if we consider en-light the suffix -en has changed the
word class of light from noun to verb.
INFLECTIONAL MORPHOLOGY
Inflection is a morphological process that adapts existing words so that they function
effectively in sentences without changing the category of the base morphemes.
Inflection can be seen as the “realization of morphosyntactic features through:
morphological means” . But what exactly does that mean? It means that inflectional
morphemes of a word do not have to be listed in a dictionary since we can guess their
meaning from the root word. We know when we see the word what it connects to a
most times can even guess the difference to its original. For example let us cons
help-s, help-ed and help-er. According to what | have said about words listed
dictionary, all of these variants might be inflectional morphemes, but then on tl
hand does help-s really need an extra listing or can we guess from the root
the suffix -s what it means? Does our natural feeling and instinct for langua
us, that the suffix -s indicates the third person singular and that help-s therefo
a variant from help (considering help as a verb and not a noun here)? Yes it d
native speaker one instantly knows thats, as also the past form indicator eq
grammatical variant of the root lexeme help.
inflectional morpheme again, the root here beir
Ing:
emove all affixes. 7To illustrate this consider the following two sentences:
1. I help my grandmother in her garden,
2. He is my grandmother's help.
Here our general knowledge of words and their meaning shows us that in 4 help is used
as a verb and expresses us working with our grandmother in order to support her. In 2.
| help is a noun and stands for the person that re
variation of a word without actually chan:
cannot only distinguish verb and noun
also singular and plural, which makes i
later in 2.2.: Inflection in nouns, though.
gularly supports my grandmother. This
iging its form is called zero morphemes and
(which makes it a derivational morphemes) but
it an inflectional morpheme. | will talk about this
‘We may define inflectional morphology as the branch of morphology that deals with
Paradigms. It is therefore concerned with two things: on the one hand, with the semantic
oppositions among categories; and on the other, with the formal means, including
inflections, that distinguish them.” (Matthews, 1991)
Inflectional morphology is that itchanges the word form, it determines the grammar and
it does not form a new lexeme but rather a variant of a lexeme that does not need its own
entry in the dictionary.
word stem + grammatical morphemes
cat + s only for nouns, verbs, and some adjectives
Nouns
plural:
regular: +s, +es _ irregular: mouse - mice; ox - oxen ‘
many spelling rules: e.g. -y -> -ies like: butterfly - butterflies
possessive: +'s, +’
Verbs
main verbs (sleep, eat, walk)
modal verbs (can, will, should)
primary verbs (be, have, do)
VERB INFLECTIONAL SUFFIXES
1. The suffix —s functions in the Present Simple as the jing of the verb
: to work — he work-s
2. The suffix -ed functions in the past simple as in regular
verbs: to love —lov-ed
@)" in the
he perfect aspect
he fp
To study studied studied / To eat ate eaten
4. The suffix -ing functions in the marking of the present participle, the gerund andin
the marking of the continuous aspect: To eat — eating / To study - studying
NOUN INFLECTIONAL SUFFIXES
1. The suffix -s functions in the marking of the plural of nouns: dog ~ dogs
2. The suffix -s functions as a possessive marker (saxon genitive): Laura =
book
ADJECTIVE INFLECTIONAL SUFFIXES
The suffix er functions as comparative marker: quick - quicker
The suffix -est functions as superlative marker: quick - quickest
DERIVATIONAL MORPHOLOGY ot
Derivation is concerned with the way morphemes are connected to e:
as affixes. Derivational morphology is a type of word formation that or
either by changing syntactic category or by adding substantial new
a free or bound base. Derivation may be contrasted with inflection
with compounding on the other. The distinctions between derivatio
between derivation and compounding, however, are not always
may be derived by a variety of formal means including affixation,
Modification of various sorts, subtraction, and conversion. Affixatio
linguistically, especially prefixation and suffixation, Reduplica
with various internal changes like ablaut and root and pattern
ived words may fit into a number of semantic categories. F
lal and participant, collective and abstract noun are
ative categories are well-attested, as are rela
Languages frequently also have ways o
ative. Most languages have deristudy of derivation has also been important in a num
concerning the perception and Production of language.
Derivational morphology is defined as morphology that creates new lexemes, either by
langing the syntactic category (part Of speech) of a base or by adding substantial, non=
grammatical meaning or both. On the one hand, derivation may be distinguished from
change category but rather modifies
ber of psycholinguistic debates
e number, case, tense, aspect, person, among
lay be distinguished from compounding, which
combining two or more bases rather than by affi
ternal modification of various sorts, Although thi
practice applying them is not always easy,
le can distinguish affixes in two principal types:
iT.
others. On the other hand, derivation
also creates new lexemes, but by
ixation, reduplication, subtraction, or
é distinctions are generally useful, in
Prefixes - attached at the beginning of a lexical item or base-morpheme — ex: un-,
pre-, post-, dis, im-, etc.
2. Suffixes — attached at the end of a lexical item ex: -age, “ng, ful, -able, “ness,
5 Aeensipeea
-hood, -ly, etc.
AMPLES OF MORPHOLOGICAL DERIVATION
. Lexical item (free morpheme): like (verb) b. Lexical
+ prefix (bound morpheme) dis-
= dislike (verb)
Derivational affixes can cause semantic change:
- Prefix pre- means before; post- means after; un- rr
. Prefix = fixed before; Unhappy = not happy =
. Prefix de- added to a verb conveys a sense of :
sense of negativity. 7
- To decompose; to defame; to unex
aD yn of Daogest
ote: eatoshry petra
O INENogMi mart eh Kaimya.
al~~;
Derivation Versus Inflection
The distinction between derivation and inflection is a functional one rather than a formal
one, as Boolj (2000, p. 360) has pointed out. Either derivation or inflection may be
affected by formal means like affixation, ‘reduphication| intemal modification of bases,
i to create new lexemes while
and other morphological processes But derivation serves
inflection prototypically serves to modify lexemes to fit differe
the clearest cases, derivation changes category, for example taking a verb like employ
and making it a noun (employment, employer, employee) or an adjective (employa
or taking @ noun like union and making it 2 verb (unionize) or an adjective (unio,
unionesque). Derivation need not change category, however. For example, the
Of abstract nouns from concrete ones in English (king ~ kingdom; child
nt grammatical contexts, In
imperfective, habitual),
categories that languages miga
inflection from derivation; inflection is invariabh
Booij (1996) has argued that even this criterion
we mean by relevance to syntax. Case infle.
context, and are therefore clearly inflectional
inflectional when it is triggered by the number of
or tense and aspect on verbs is a matter of semai
configuration. Booij therefore distinguishes wha
triggered by distinctions elsewhere in a sente
that does not depend on the syntactic context, the latter being closer to derivation than
the former. Some theorists (Bybee, 1985; Dressler, 1989) postulate a continuum from
derivation to inflection, with no clear dividing line between them. Another position is that
of Scalise (1984), who has argued that evaluative morphology is neither inflectional nor
Gerivational but rather constitutes a third category of morphology.
ly relevant to syntax, derivation not. But
is problematic unless we are clear what
cctions, for example, mark grammatical
Number-marking on verbs is arguably
subject or object, but number on nouns
intic choice, independent of grammatical
i he calls contextual inflection, inflection
Nce, from inherent inflection, inflection
2. Stemming and Lemmatization
In natural language processing, there may come a time when
ecognize that the words “ask” and “asked”
his is the idea of reducing different forms
derived from one another can be mapped to
ave the same core meaning.
you want your program to
are just different tenses of the same verb.
of a word to a core root. Words that are
central word or symbol, especially if they
ey
jaybe this is in an information retrieval setting and you want to boost your algorithm's
ecall. Or perhaps you are trying to. analyze word usage in; nd wish to condense
elated words so that you don't have as much variability. s technique of text
jormalization may be useful to you. 4 ‘
lis is where something like stemming or lemmatizatio’
‘or grammatical reasons, documents are going to use dif
organize, organizes, and organizing. Additionally, there
iords with similar meanings, such as democracy, ade r
‘any situations, it seems as if it would be useful for a
elu documents that contain another word in the
‘word, such as
he goal of both stemming and lemmatization
sometimes derivationally related forms of a word
am, are, is > be
Car, cars, car's, cars’ => carlemming Algorithms Examples
orter stemmer: This stemming algorithm is an older one. It’s from the 1980s and its
ain concer is removing the common endings to words so that they can be resolved
common form. It's not too complex and development on it is frozen. Typically, it's
nice starting basic stemmer, but it's not feally advised to use it for any production/
mplex application. Instead, it has its place in research as a nice, basic stemming
gorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm
yen compared to others.
snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is
lost universally accepted as better than the Porter stemmer, even being acknowledged
is Such by the individual who created the Porter stemmer. That being said, it is also more
essive than the Porter stemmer. A lot of the things added to the Snowball stemmer
ere because of issues noticed with the Porter stemmer. There is about a 5% difference
the way that Snowball stems versus Porter.
ancaster stemmer: Just for fun, the Lancaster stemming algorithm is another algorithm
fat you can use. This one is the most aggressive stemming algorithm of the bunch.
lowever, if you use the stemmer in NLTK, you can add your own custom rules to this
algorithm very easily. It's a good choice for that. One complaint around this stemming
algorithm though is that it sometimes is overly aggressive and can really transform words
‘emmatization usually refers to doing things properly with the use of a vocabulary and
Morphological analysis of words, normally aiming to remove inflectional endings only
‘and to return the base or dictionary form of a word, which is known as the lemma .
lemmatization is a more calculated process. It involves resolving words to their dictionary
form. in fact, a lemma of a word is its dictionary or canonical form! +
Because lemmatization is more nuanced in this respect, it requi
make work. For lemmatization to resolve a word to its lemma, it
Of speech. That requires extra computational linguistics power
tagger. This allows it to do better resolutions (like resolving is
Another thing to note about lemmatization is that it’s often
lemmatizer in a new language than it is a stemmingture of a language, it's a much more intensive
uct
ic stemming algorithm
g might return just s, whereas lemmatization
depending on whether the use of the token
iso difer in that stemming most commoniy
was as a verb or a nou! rds, whereas lemmatization commonly only collapses
collapses derivationally related words, Linguistic processing for stemming or
lemma.
the different inflectional forms of @ } f
lemmatization is often done by an additional plug-in component fo the indexing process,
andia number of such components exist, both commercial and open-source.
The most common algorithm for stemming English, and one that has repeatedly been
shown to be empirically very effective, is Porter's algorithm (Porter, 1980). The entire
algorithm is too long and intricate to present here, but we will indicate its general nature
Porter's algorithm consists of 5 phases of word reductions, applied sequentially. Within
each phase there are various conventions to select rules, such as selecting the rule from
each rule group that applies to the longest suffix. In the first phase, this convention is
used with the following rule group:
stru
require a lot more knowledge about the
uristi
process than just trying to set up a he
nin
If confronted with the token saw, stem
it Ww
would attempt to return either see oF $2!
in, The two may al
() Rule Example
SSES + ss caresses caress
HEB lees | Ponies Poni
>
—> caress
>inereases recall while harming precision. As an example of what can go wrong, note that
the Porter stemmer stems all of the following words:
‘operate operating operates operation operative operatives operational to oper.
jever, since operate in its various forms is a common verb, we would expect to lose
siderable precision on queries such as the following with Porter stemming:
operational and research
operating and system
operative and dentistry
a case like this, moving to using a lemmatizer would not completely fix the problem
fause particular inflectional forms are used in particular collocations: a sentence with
jords operate and system is not a good match for the query operating and system.
ting better value from term normalization depends more on pragmatic issues of word
than on formal issues of linguistic morphology.
situation is different for languages with much more morphology (such as Spanish,
an, Hindi and Finnish).
Regular expression
gular expression (RE), a language for specifying text search strings or search pattern.
e regular expression languages used for searching texts in UNIX (vi, Perl, Emacs,
p) and Microsoft. Usually search patterns are used by string searching algorithms for
i” or “find and replace” operations on strings, or for input validation. Itis a technique
eloped in theoretical computer science and formal language theory.
ords are almost identical, and many RE features exist in the various Web search
ines. Besides this practical use, the regular expression is an important theoretical
pI throughout computer science and linguistics, A regular expression is a formula in
special language that is used for specifying simple classes of strings. A string isa
quence of symbols; for the purpose of most text-based search techniques, a string is a
quence of alphanumeric characters (letters, numbers, space s, and punctuation).
¢ is just a character like any ot d we represent it with
n for characterizing
r these purposes a spac
€ symbol .. Formally, @ regular expression is an algebré
set of strings. Thus, they can be used to specify sea
\guage in a formal way.attern describing a certain amount of text. Thej
is a pi ‘
m: ory on which they are ased. But we
hematical theory on whl y bi " Will gj
ated to “regex” or “regexp”. Regu
dig into that. You will usually find the name abbrevi g) ; Qular
- i 3 (regex or regexp) are extremely useful in extracting a ion from an,
expressions " m. a
a by searching for one or more matches of a specific search pattern (i.e. a specif
text by
sequence of ASCII or unicode characters).
for short, i
o
like speech and text, by
e for the field of
Basically, a regular expression |
name comes from the math
broadly defined as the
Natural Language Processing, s broadly
L language,
automatic manipulation of natura Uke
software.Statistical { aims to do statistical inferenc
natural language.
/\b(\wANLP\ we) \b/e Ez
Figure 1: shows matching of the string NLP in the given text on the site https://www-regextester.com/
Regular Expression Patterns
—“and$
matches any string that starts with The -> Try it!
matches a string that ends with end
exact string match (starts and ends with The end)
hes any string that has the text roar in it
*+?and gD
S a string that has ab followed by zero or more c
a string that has ab followed by one or more c
@ string that has ab followed by zero or one c
string that has ab followed by2c
—a[bc] same as previous
Character classes —\d \w\s and.
\d matches a single character that is a digit
\w
matches a word character (alphanumeric character plus underscore)
matches a whitespace character (includes tabs and line breaks)
matches any character
\w and \s also present their negations with \D, \W and \S respectively.
example, \D will perform the inverse match with tespect to that obtained with \d.
matches a single non-digit character
rder to be taken literally, you must escape the characters *.[$()|*+?{\with a backslash
they have special meaning.
matches a string that has a $ before one digit
can match also non-printable characters like tabs \t, new-lines \n, carriage returns \r
is
are learning how to construct a regex but forgetting a fundamental concept: flags.
ex usually comes within this form /abc/, where the search pattern is delimited by
slash characters /. At the end we can specify a flag with these values (we can also
bine them each other):
g (global) does not return after the first match, restarting the subsequent searches
from the end of the previous match
m (multi-line) when enabled * and $ will match the start and end of a line, instead of
the whole string
i (insensitive) makes the whole expression case-insensitive (for
would match AbC)
uping and capturing — ()
JaBcli
parentheses create a capturing group with value
bo)* using ?: we disable the capturing group
bc) using ? we put a name to the group
Operator is very useful when we need to extract i
your preferred programming language. Any mulcena a
4 in the form of a classical array: we will access their
t a name to the groups (using (?...)) we will be able to retrieve
ng the match result like a dictionary where the keys will be the name
Bracket expressions—[]
[abe] matches a string that has either an a or a b or ac -> is the same as albjc
[a<} matches a string that has either an a or ab or ac -> is the same as albje
[244-F0-9] a string that represents a single hexadecimal digit, case insensitively
fo-9y% @ siting that has a character from 0 to 9 before a % sign
[22A-Z] 2 string that has not a letter from a to z or from A to Z. In this case the
“'s used as negation of the expression
Greedy and Lazy match
‘The quantifiers ("+ (}) are greedy operators, so they expand the match as far as they
an through the provided text.
For example, <+> matches simple div
in This is a simple div
‘Sst In order to caich only the div tag we can use a? to make it lazy:
<4? matches any character one or more times j insi
times included
S included inside < and >,
Boundaries — \b and \B
‘Babcib performs “whole words only” search
‘© represents an anchor like caret it is similar4.Finite Automata =
regular expression is more than just a convenient metalanguage for text searching
i, a regular expression is one way of describing a finite-state automaton (FSA)
-state automata are the theoretical foundation of a good deal of the computational
we will describe in this book, Any FSA regular expression can be implement
ite-state automaton (except regular expressions that use the memory feature
this later). Symmetrically, any finite-state automaton can be described with a regular
ression. Second, a regular expression is one way of characterizing a particular kind
formal language called a regular language. Both regular expressions and finite-state
lomata can be used to describe regular languages. The relation among these three
oretical constructions is sketched out in the figure below.
regular
expressions
finite regular
automata languages
Figure 2: The relationship between finite automata, regular expressions,and regular languages
formal language is completely determined by the ‘words in the dictionary’, rather than
any grammatical rules
(formal) language L over alphabet 5 is just a set of strings in =".
us any subset L < 2* determines a language over >
Janguaage determined by a regular expression r over 5 is
L(r) def {v <=*|v matches r}.
reais expressions rand s (over the same alphabet) are Sieteaioct iff L(r) and L(s)
equal sets (i.e. have exactly the same members.)
finite automaton has a finite set of states with which it a
inite State Automata (FSA) can be:
terministic
each input there is one and only one state to which
Current stateNondeterministic
onc
An automaton can be in several states at
Deterministic finite state automaton
1. Afinite set of states, often denoted Q
2. A finite set of input symbols, often denoted = —-
3. A transition function that takes ag arguments a state an
returns a state. The transition fundtion is commonly denoted 3 If q 2 pape
a symbol, then 8(q, a) is @ state p (and in the graph that represents
there is an arc from q to p labeled-a)
4. Astart state, one of the states in @
5. Aset of final or accepting states F (F < Q)
ADFA is atuple A= (Q, &, 5, q,, F)
Other notations for DFAs
Transition diagrams
* Each state is a node
* Foreach state q < Q and each symbol a « F, let 5(q, a) = Pp
* Then the transition diagram has an are from q to p, labeled a
* There is an arrow to the start state q,
* Nodes corresponding to final states are marked with doubled circle
Transition tables
* Tabular representation of a function
* The rows correspond to the states and the columns to the input s
* The entry for the row Corresponding to state q and the column ©orTesponding to input
ais the state 6(q, a)
= {4 Gy 4, £0, 1}, 5, dye (4, })
e the transition function 6 is given by the tablescribes what happens when we start in any state and follow any sequence of inputs
is our transition function, then the extended transition function is denoted by 6
e extended transition function is a function that takes a state q and a string w and
ms a state p (the state that the automaton reaches when starting in state q and
cessing the sequence of inputs w)
‘mal definition of the extended transition function
ifinition by induction on the length of the input string
is: 5 (q, E)=q
we are in a state q and read no inputs, then we are still in state q Induction: Suppose
is a string of the form xa; that is ais the last symbol of w, and x is the string consisting
all but the last symbol ge
en: 5 (q, w) = 5(6(4.x), a) -
compute 6 (q, w), first compute 8(q,x), the state that the automation isin after procesing
but the last symbol of w a
Ippose this state is p, ie., 5(q,x) =P
en 8 (g, w) is what we get by making a transition from
bol of wFA to accept the language r of 1}
Design a DI ber of and an even numbe!
even num
L= (| whas both an
The Language of a DFA oted L(A) is defined by
5 deni
The language of a DFAA= (Q, 2, 3.40 F), det
L(A) = (w | 5 (go, #) i
sin F}
start state go to the one of
the
The language of A is the set of strings w that take
accepting states - age
If L is a L(A) from some DFA, then L is a regular languag\
Nondeterministic Finite Automata (NFA) J a
A NFA has the power to be in several states at once. This ability is often express
an ability to “guess” something about its input . Each NFA accepts a language that
also accepted by some DFA. NFA are often more succinct and easier than DFAs .
can always convert an NFA to a DFA, but the latter may have exponentially more sf
than the NEA (a rare case) .The difference between the DFA and the NFA is the type o
transition function &
For a NFA
© is a function that takes a state and input symbol as. arguments
(Ike the DFA transition function), but returns a set of zero OF more states
(rather than returning exactly one state, as the DEA must)
Example: An NFA accepting strings that end in 01
Nondeterministic automaton that a
iecepts all and only the str 7
he ly the strings of Os ang 1s thatNFA: Formal definition
Anondeterministic finite automaton (NFA) is a tuple A = (Q, £, 8, q,, F) where:
4. Qis a finite set of states
2. Zisa finite set of input symbols
q, < Qis the start state
F (F < Q)is the set of final or accepting states
6, the transition function is a function that takes a state in Q and an input symbol in
4 as arguments and returns a subset of Q
only difference between a NFA and a DFAis in the type of value that 5 returns
ple: An NFA accepting strings that end in 01
({g0, 91, 92}, {0,1}, 5, 40, {92})
the transition function 6 is given by the table
Extended Transition Function
ics: : 5°(q, 9) = {a}
\out reading any input symbols, we are only in the sta
luctionand x is the string
Suppose w is a string of the form xa; that is a is the last symbol of W
Consisting of all but the last symbol
Also suppose that 5; (g, x) = {P1, P2,..... Pk}
Let k
U BPI) = fy Faye}
i=1
Then: 8(9, W) = {Fy Fel)
from
w) by first computing 5°(q, x) and by then following any transition
any of these states that is labeled a
We compute 5°(q,
Example: An NFA accepting strings that end in 04
Processing w = 00101
1. 8(90, €) = {qo}
2. 8(40, 0) = 8(90, 0) = {g0, qt}
3. 3(90, 00) = 8(q0, 0) Ua(qt, ) = (40, q1} U8=1q0, q1}
4. (90, 001) = 5(q0, 1) Ua(qt, 1) = {40} U {92} (40, q2}
5. 8(40, 0010) = 8(g0, 0) U8(92, 0) = (40, q1} U6@={90, q1}
6. 8(90, 00101) = 3(90, 1) Ua(q7, 1) = (40} U {2} = (90, q2)
The Language of a NFA
language of aNFAA=(Q, 5, 6, WF),
w | 8(g0, w) NF # By
ge of Ais the set of strings w © E * such that 5°(q,. w) contains at least one
te.
denoted L(A) is defined by
loosing using the input symbols of w lead to a
atall, does not prevent w from being ac
Nnon-accepting State, or
ccepted by a NFA as aEquivalence of Deterministic and Nondeterministic Finite Automata
Every language that can be described by some NFA can also be described by some
DFA. The DFA in practice has about as many states as the NFA, although it has more
transitions. In the worst case, the smallest DFA can have 2" (for a smallest NFA with n
state)
5. Finite-State Morphological Parsing
insider a simple example: parsing just the productive nominal plural (-s) and the verbal
essive (-ing). Our goal will be to take input forms like those in the first column below
produce output forms like those in the second column.
Morphological Parsed Outp’*
cat +N +PL
cat +N +SG
city +N +PL
goose +N +PL
(goose +N +SG) or (goose +V)
goose +V 3SG
merge +V +PRES — PART
(catch +V +PAST — PART) or (catch +V +PAST)
ms
second column contains the stem of each word as well as assorted morphological
tures. These features specify additional information about the stem. For example the
lure +N means that the word is a noun; +SG means itis singular, +PL that it is plural.
der +SG to be a primitive unit that means ‘singular’. Note that some of the input
s (like caught or goose) will be ambiguous between different morphological parses.
der to build a morphological parser, we'll need at least the following:
a lexicon: The list of stems and affixes, together with basic information about them
(whether a stem is a Noun stem or a Verb stem, etc).
morphotactics: the model of morpheme ordering that explains which classes of
morphemes can follow other classes of morphemes inside a ord. For example,
the rule that the English plural morpheme follows the noun alg to model the changes that ogg,
se
1g rules are US: - ple the y ! ie g
2. orthographic rules: these speling combine (for example the Peli
wo morphemes ather than cities)
-s to cities rather
in a word, usually wh —
rule discussed above that chang}
6. Building a Finite-State LexicON ——
£4 ae implest possible lexicon wou! ‘ Dlg
Alexicon is a repository for words. The simp! fRelididg abbreviations (AAA) and age
list of every word of the language (every word, i.e. achon-aeniverk; serdwol, abaiaall
names (‘Jane' or ‘Beijing’) as follows: a, AAA, AA, y
ic ic - sons we discu:
Since it will often be inconvenient or impossible, for the various rea ISSeq
above, to list every word in the language, computational lexicons are usually
with a list of each of the stems and affixes of the language together with a representation
of the morphotactics that tells us how they can fit together. There are many we iys
model morphotactics; one of the most common is the finite-state automaton. A \
simple finite-state model for English nominal inflection might look like Figure 3. — 4
The FSA in Figure 3 assumes that the lexicon includes regular nouns (re
lake the regular -s plural (e.g. cat, dog,
fox, aardvark). These are the vast
louns since for now we will ignore the fact that the plural of words fil
foxes. The lexicon also includes irregular noun forms that d
g-i noun (goose, mouse) and plural itreg-pl-noun (geese,
reg-noun Plural (-s)
irreg-Sa—noun
automaton for. English nominal inflection,irreg-past-verb—form
preterite (ed)
reg-verb-stem
pst parti
{ % le (-ed)
A
irreg-verb-stem Tsing (-s)
Figure 4: A finite-state automaton for English verbal inflection
is lexicon has three stem classes (reg-verb-stem, irreg-verb-stem, and irreg-past-
rb-form), plus 4 more affix classes (-ed past, -ed participle, -ing participle, and 3rd
gular -s):
eee
e ca ES
PO ns kee ke
CC mame eer ty
cut caught
speak ate
sing eaten
sang
cut
spoken
ome models of English derivation, in fact, are based on the more complex co
jrammars
S a preliminary example, though, of the kind of analysis it would require,
small part of the morphotactics of English adjectives, taken from
\ntworth offers the following data on English adjectives:
, bigger, biggest
|, cooler, coolest, coolly
Gd, redder, reddest
clear, clearer, clearest, clearly, unclear, unclearly
appy, happier, happiest, happily
unhappy, unhappier, unhappiest, unhappily
feal, unreal, reallyave an optional pre- fix (un-), an
An initial hypothesis might be that adjectives can hi
) and an optional suffix (-er, -est, or -ly). This might suggest
obligatory root (big, cool, etc)
the FSA in Figure 5. Alas, while this FSA will recognize all the adjectives in the table
above, it will also recognize ungrammatical forms like unbig, redly, and realest. We need
to set up classes of roots and specify which can occur with which suffixes. So adj-root1
would include adjectives that can occur with un- and -ly (clear, happy, and real) while
adj-root2 will include adjectives that can’t (big, cool, and red). Antworth (1990) presents
Figure 6 as a partial solution to these problems. This gives an idea of the complexity to
be expected from English derivation.
~est
-er
un adj-root a5
€
90, q1, and q2. Similarly,
take the suffix ity,ay
g the Nominal Inflection
Figure 8 shows the noun-recognition FSA produced by expanding
ns for each class. We can us
FSA of Figure 9 with Sample regular and irregular nouns for e .
g at the in!
Figure 8 to recognize strings like aardvarks by simply starting ai
ing arc, e
comparing the input letter by letter with each word on each outgoing
Finite-State Transducers
a lexicon, and
je rep tation
can be used for word recognition. A transducer maps between one represent
and another; a finite-state transducer or FST is a type of finite automaton which nae
between two sets of symbols. We can visualize an FST as a two-tape automaton which
recognizes or generates Pairs of strings. Intuitively, we can do this by labeling each are
We've now seen that FSAs can represent the morphotactic structure o!
The FST thus has a more
language by defining a set of strings,
Another way of looking at an FST is
another. Here's a summary of this for
FST as recognizer: a transducer that takes
Pt if the string-pair is in the string-pair language, and reject if it is not.
aS generator: a machine that outputs pairs of Strings of the language. Thus
ayes or no, and a pair of Output strings.All of these have applications in speech and language processing. For morphological
parsing (and for many other NLP applications), we will apply the FST as translator
metaphor, taking as inputa string of letters and producing as outputa string of morphemes
‘An FST can be formally defined in a number of ways; we will rely on the following
definition, based on what is called the Mealy machine MEALY MACHINE extension to a
simple FSA:
a finite set of states q0, q1,....., qn—1
a finite set corresponding to the input alphabet
a finite set corresponding to the output alphabet
a the start state
the set of final states
, W) the transition function or transition matrix between states; Given a state
q€ Qand a string w € E*, 8(q, w) returns a set of new states Q' € Q. 5 is
thus a function from Q x &* to 2° (because there are 2° possible subsets
of Q). 5 returns a set of states rather than a single state because a given
input may be ambiguous in which state it maps to.
the output function giving the set of possible output strings for each state
and input. Given a state q € Q and a string w € E*, (q, w) gives a set
of output strings, each a string o € A*. cis thus a function from Q x Z* to
a
lere FSAs are isomorphic to regular languages, FSTs are isomorphic to regular
tions. Regular relations are sets of pairs of strings, a natural extension of the regular
guages, which are sets of strings. Like FSAs and regular languages, FSTs and regular
tions are closed under union, although in general they are not closed under difference,
plementation and intersection (although some useful subclasses of FSTs are closed
ler these operations; in general, FSTs that are not augmented with the ¢ are more
ly to have such closure properties). Besides union, FSTs have two additional closure
perties that turn out to be extremely useful: inversion: The inversion of a transducer T
+) simply switches the input and output labels. Thus, if T maps from the input alphabet
the output alphabet O, T’ maps from O to I. Pe noua
Position: If T1 is a transducer from I1 to 01 and T2 a transducer
*T2 maps from 11 to 02.inversion is useful because it makes it easy to convert a FS parser into an F:
h is useful because it it easy to
version is use| caus "
nerator. Composition is useft allows us to take two transdi
ator. Composition is useful because it allows u:
ger o s
re CO transducer
in series and replace them with one more complex
ical to applying T1 to $ and yp
algebra; applying T1 « T2 to an input sequence S is identical to applying a
T2 to the result; thus T1 » T2(S) = T2(T1(S))
Fig. 10, for example, shows the composition of [a:b]+ with [ r act
are
ab
ab
Figure 10: The composition of fa:bl+- with
The projection of an FST is the FSA that is Produced by extracting only one side of the
‘elation. We can refer to the projection to the left or upper side of the relation as the upper
Or first projection and the projection to the lower or right side of the relation as the lower
©F Second projection.
Morphological Parsing with Finite-State Transducers
Let's now turn to the task of morphological parsing. Given the input Cats, for instance,
we'd like to output cat +N +P, teling us that cat i
17: Schematic examples of the. ‘lexical and surface tapes;
ill involve intermediate tapes as well,i
Koskenniemi (1983), we allow each arc only to have a single symbol from each alphabet.
We can then combine the two symbol alphabets = and A to create a new alphabet, 2’,
which makes the relationship to FSAs quite clear. =’ is a finite alphabet of complex symbols.
Each complex symbol is:composed of an input output pair i : 0; one symbol i from the input
alphabet Z, and one symbol o from an output alphabet A, thus 2’ c 2A. Z and A may each
also include the epsilon symbol ¢. Thus, where an FSA accepts a language stated over a
alphabet of single symbols,
as the alphabet of our sheep language:
) = = {b,a,!}
FST defined this way accepts a language stated over pairs of symbols, as in:
) L= fara, bib, Va: ace, es}
two-level morphology, the pairs of symbols in. =’ Feasible pair are also called feasible
irs. Thus each symbol a :b in the transducer alphabet Z’expresses how the symbol a from
tape is mapped to the symbol b on the other tape. For example a : q means that ana
the upper tape will correspond to nothing on the lower tape. Just as for an FSA, we can
regular expressions in the complex alphabet 2. Since it's most common for symbols
map to themselves, in two-level morphology we call pairs like a: a default pairs, and
t refer to them by the single letter a. We are now ready to build an FST morphological
rser out of our earlier morphotactic FSAs and lexica by adding an extra “lexical” tape
\d the appropriate morphological features. Fig. 12 shows an augmentation of Fig. 13 with
nominal morphological features (+Sg and +P}) that correspond to each eS. The
bol * indicates a morpheme boundary, while the symbol # indicates a
e morphological features map to the empty string 9 or the boundary
no segment corresponding to them on the output tape.
Figure 12: A schematic transducer for English nom
The sym- bols above each arc represent ele
lexical tape; the symbols below each are rep