0% found this document useful (0 votes)
186 views

Bhawini NLP Practical

Uploaded by

Bhawini Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views

Bhawini NLP Practical

Uploaded by

Bhawini Raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

HEMWATI NANDAN BAHUGUNA GARHWAL UNIVERSITY

(A Central University), Srinagar Garhwal, Uttarakhand


School of Engineering and Technology
Department of Computer science and Technology

Session 2020 - 2021

PRACTICAL FILE
FOR

N ATU RAL LAN GU AGE PRO CESSIN G

Submitted to - Submitted by-


Ms Kanchan Naithani Bhawini Raj
B.Tech - VIII Semester
Department of Computer Science and Engineering Roll no - 17134501001

CONTENTS
1
Experiment Experiment Name Page
No. No.

1 Introduction to Natural Language Processing 01

2 Introduction to Grammars, Parsing and PoS tags 14

3 Introduction to NLTK 40

4 Write a Python Program to remove “stopwords” from a given 48


text and generate word tokens and filtered text.

5 Write a Python Program to generate “tokens” and assign 51


“PoS tags” for a given text using NLTK package.

6 Write a Python Program to generate “worldcloud” with 53


maximum words used = 100, in different shapes and save as
a .png file for a given text file.

7 Perform an experiment to learn about morphological 57


features of a word by analyzing it.

8 Perform an experiment to generate word forms from root 62


and suffix information.

9 Perform an experiment to understand the morphology of a 65


word by the use of Add-Delete table

10 Perform an experiment to learn to calculate bigrams from a 69


given corpus and calculate probability of a sentence.

11 Perform an experiment to experiment to learn how to apply 73


add-one smoothing on sparse bigram table.

12 Perform an experiment to calculate emission and transition 77


matrix which will be helpful for tagging Parts of Speech using

2
Hidden Markov Model.

13 Perform an experiment to know the importance of context 83


and size of training corpus in learning Parts of Speech

14 Perform an experiment to understand the concept of 87


chunking and get familiar with the basic chunk tagset.

15 93

3
EXPERIMENT NO 1
INTRODUCTION TO NATURAL LANGUAGE PROCESSING

1. What is NLP?

NLP is an interdisciplinary field concerned with the interactions between computers and natural
human languages (e.g., English) — speech or text. NLP-powered software helps us in our daily lives
in various ways, for example:

● Personal assistants: Siri, Cortana, and Google Assistant.


● Auto-complete: In search engines (e.g., Google, Bing).
● Spell checking: Almost everywhere, in your browser, your IDE (e.g., Visual Studio),
desktop apps (e.g., Microsoft Word).
● Machine Translation: Google Translate.

Figure1.1 - Natural language processing

NLP is divided into two fields: Linguistics and Computer Science.


The Linguistics side is concerned with language, it’s formation, syntax, meaning, different kind of
phrases (noun or verb) and whatnot.

1
The Computer Science side is concerned with applying linguistic knowledge, by transforming it into
computer programs with the help of sub-fields such as Artificial Intelligence (Machine Learning &
Deep Learning).

2. How does Natural Language Processing Work?


NLP enables computers to understand natural language as humans do. Whether the language
is spoken or written, natural language processing uses artificial intelligence to take real-world
input, process it, and make sense of it in a way a computer can understand. Just as humans
have different sensors -- such as ears to hear and eyes to see computers have programs to read
and microphones to collect audio. And just as humans have a brain to process that input,
computers have a program to process their respective inputs. At some point in processing, the
input is converted to code that the computer can understand.

There are two main phases to natural language processing: data preprocessing and algorithm
development.

Data preprocessing involves preparing and "cleaning" text data for machines to be able to
analyze it. preprocessing puts data in workable form and highlights features in the text that an
algorithm can work with. There are several ways this can be done, including:

● Tokenization. This is when text is broken down into smaller units to work with.
● Stop word removal. This is when common words are removed from text so unique words
that offer the most information about the text remain.
● Lemmatization and stemming. This is when words are reduced to their root forms to
process.
● Part-of-speech tagging. This is when words are marked based on the part-of speech they
are -- such as nouns, verbs and adjectives.

Once the data has been preprocessed, an algorithm is developed to process it. There are many
different natural language processing algorithms, but two main types are commonly used:

● Rules-based system. This system uses carefully designed linguistic rules. This approach
was used early on in the development of natural language processing, and is still used.
● Machine learning-based system. Machine learning algorithms use statistical methods.

2
They learn to perform tasks based on training data they are fed, and adjust their methods
as more data is processed. Using a combination of machine learning, deep learning and
neural networks, natural language processing algorithms hone their own rules through
repeated processing and learnin

Figure1.2 Steps of Natural Language processing

3. Phases of NLP:-

There are the following five phases of NLP:

Figure1.3 Phases of NLP

1. Lexical Analysis and Morphological

The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. It divides the whole text into paragraphs,
sentences, and words.
3
2. Syntactic Analysis (Parsing)

Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
among the words.

Example: Agra goes to the Poonam

In the real world, Agra goes to the Poonam, does not make any sense, so this sentence
is rejected by the Syntactic analyzer.

3. Semantic Analysis

Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal
meaning of words, phrases, and sentences.

4. Discourse Integration

Discourse Integration depends upon the sentences that proceeds it and also invokes the
meaning of the sentences that follow it.

5. Pragmatic Analysis

Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by
applying a set of rules that characterize cooperative dialogues.

For Example: "Open the door" is interpreted as a request instead of an order.

4. Why is Natural Language Processing Important?

Businesses use massive quantities of unstructured, text-heavy data and need a way to
efficiently process it. A lot of the information created online and stored in databases is natural
human language, and until recently, businesses could not effectively analyze this data. This is
where natural language processing is useful.

The advantage of natural language processing can be seen when considering the following two
statements: "Cloud computing insurance should be part of every service-level agreement," and,
"A good SLA ensures an easier night's sleep -- even in the cloud." If a user relies on natural
language processing for search, the program will recognize that cloud computing is an entity,
4
that cloud is an abbreviated form of cloud computing and that SLA is an industry acronym for
service-level agreement.

Figure1.4 Elements of Natural language Processing

These are some of the key areas in which a business can use natural language processing
(NLP).

These are the types of vague elements that frequently appear in human language and that
machine learning algorithms have historically been bad at interpreting. Now, with
improvements in deep learning and machine learning methods, algorithms can effectively
interpret them. These improvements expand the breadth and depth of data that can be analyzed.

5. Techniques and Methods of Natural Language Processing.

Syntax and semantic analysis are two main techniques used with natural language processing.

Semantics involves the use of and meaning behind words. Natural language processing applies
algorithms to understand the meaning and structure of sentences.

➔ Parsing. TWhat is parsing? According to the dictionary, to parse is to “resolve a sentence into
its component parts and describe their syntactic roles.”

That actually nailed it but it could be a little more comprehensive. Parsing refers to the formal
analysis of a sentence by a computer into its constituents, which results in a parse tree showing
their syntactic relation to one another in visual form, which can be used for further processing
and understanding.

5
Figure 1.5 parse tree for the sentence "The thief robbed the apartment." Included is a
description of the three different information types conveyed by the sentence.

The letters directly above the single words show the parts of speech for each word (noun, verb
and determiner). One level higher is some hierarchical grouping of words into phrases. For
example, "the thief" is a noun phrase, "robbed the apartment" is a verb phrase and when put
together the two phrases form a sentence, which is marked one level higher.

But what is actually meant by a noun or verb phrase? Noun phrases are one or more words that
contain a noun and maybe some descriptors, verbs or adverbs. The idea is to group nouns with
words that are in relation to them.

A parse tree also provides us with information about the grammatical relationships of the words
due to the structure of their representation. For example, we can see in the structure that "the
thief" is the subject of "robbed."

With structure I mean that we have the verb ("robbed"), which is marked with a "V" above it
and a "VP" above that, which is linked with a "S" to the subject ("the thief"), which has a "NP"
above it. This is like a template for a subject-verb relationship and there are many others for
other types of relationships.

➔ Word segmentation. This is the act of taking a string of text and deriving word forms from it.
Example: A person scans a handwritten document into a computer. The algorithm would be
able to analyze the page and recognize that the words are divided by white spaces.
➔ Sentence breaking. This places sentence boundaries in large texts. Example: A natural
6
language processing algorithm is fed the text, "The dog barked. I woke up." The algorithm can
recognize the period that splits up the sentences using sentence breaking.
➔ Morphological segmentation. This divides words into smaller parts called morphemes.
Example: The word untestably would be broken into [[un[[test]able]]ly], where the algorithm
recognizes "un," "test," "able" and "ly" as morphemes. This is especially useful in machine
translation and speech recognition
➔ Named Entity Recognition (NER) - This technique is one of the most popular and
advantageous techniques in Semantic analysis, Semantics is something conveyed by the text.
Under this technique, the algorithm takes a phrase or paragraph as input and identifies all the
nouns or names present in that input.
➔ Tokenization. First of all, understanding the meaning of Tokenization, it is basically splitting
of the whole text into the list of tokens, lists can be anything such as words, sentences,
characters, numbers, punctuation, etc. Tokenization has two main advantages, one is to reduce
search with a significant degree, and the second is to be effective in the use of storage space.

➔ Stemming and Lemmatization. The increasing size of data and information on the web is
all-time high from the past couple of years. This huge data and information demand necessary
tools and techniques to extract inferences with much ease.

“Stemming is the process of reducing inflected (or sometimes derived) words to their word
stem, base or root form - generally a written form of the word.” For example, what stemming
does, basically it cuts off all the suffixes. So after applying a step of stemming on the word
“playing”, it becomes “play”, or like, “asked” becomes “ask”.

Figure 1.6 Stemming and Lemmatization

7
Lemmatization usually refers to do things with the proper use of vocabulary and morphological
analysis of words, normally aiming to remove inflectional endings only and to return the base
or dictionary form of a word, which is known as the lemma. In simple words, Lemmatization
deals with the lemma of a word that involves reducing the word form after understanding the
part of speech (POS) or context of the word in any document.

➔ Bag of Words. Bag of words technique is used to pre-process text and to extract all the features
from a text document to use in Machine Learning modeling. It is also a representation of any
text that elaborates/explains the occurrence of the words within a corpus (document). It is also
called “Bag” due to its mechanism, i.e. it is only concerned with whether known words occur
in the document, not the location of the words.

Let’s take an example to understand bag-of-words in more detail. Like below, we are taking 2
text documents:

“Neha was angry on Sunil and he was angry on Ramesh.”


“Neha love animals.”

Above you see two corpora as documents, we treat both documents as a different entity and
make a list of all the words present in both documents except punctuations as here,

“Neha”, “was”, “angry”, “on”, “Sunil”, “and”, “he”, “Ramesh”, “love”,


“animals”

Then we create these documents into vectors (or we can say, creating a text into numbers is
called vectorization in ML) for further modelling.

Presentation of “Neha was angry on Sunil and he was angry on Ramesh” into vector form as
[1,1,1,1,1,1,1,0,0] , and the same as in, “Neha love animals” having vector form as
[1,0,0,0,0,0,0,0,1,1]. So, the bag-of-words technique is mainly used for featuring generation
from text data.

➔ Natural Language Generation . Natural language generation (NLG) is a technique that uses

8
raw structured data to convert it into plain English (or any other) language. We also call it data
storytelling. This technique is very helpful in many organizations where a large amount of data
is used, it converts structured data into natural languages for a better understanding of patterns
or detailed insights into any business.

There are many stages of any NLG;

1. Content Determination: Deciding what are the main content to be represented in text
or information provided in the text.
2. Document Clustering: Deciding the overall structure of the information to convey.
3. Aggregation: Merging of sentences to improve sentence understanding and
readability.
4. Lexical Choice: Putting appropriate words to convey the meaning of the sentence
more clearly.
5. Referring Expression Generation: Creating references to identify main objects and
regions of the text properly.
6. Realization: Creating and optimizing text that should follow all the norms of
grammar (like syntax, morphology, orthography).

➔ Sentiment Analysis It is one of the most common natural language processing techniques.
With sentiment analysis, we can understand the emotion/feeling of the written text. Sentiment
analysis is also known as Emotion AI or Opinion Mining.

The basic task of Sentiment analysis is to find whether expressed opinions in any document,
sentence, text, social media, film reviews are positive, negative, or neutral, it is also called
finding the Polarity of Text.

Figure1.7 Analysing sentiments

9
For example, Twitter is all filled up with sentiments, users are addressing their reactions or
expressing their opinions on each topic whichever or wherever possible. So, to access tweets
of users in a real-time scenario, there is a powerful python library called “twippy”.

➔ Sentence Segmentation The most fundamental task of this technique is to divide all text into
meaningful sentences or phrases. This task involves identifying sentence boundaries between
words in text documents. We all know that almost all languages have punctuation marks that
are presented at sentence boundaries, So sentence segmentation also referred to as sentence
boundary detection, sentence boundary disambiguation or sentence boundary recognition.

There are many libraries available to do sentence segmentation, like, NLTK, Spacy, Stanford
CoreNLP, etc, that provide specific functions to do the task.

Three tools used commonly for natural language processing include Natural Language Toolkit
(NLTK), Gensim and Intel natural language processing Architect. NLTK is an open source
Python module with data sets and tutorials. Gensim is a Python library for topic modeling and
document indexing. Intel NLP Architect is another Python library for deep learning topologies
and techniques.

6. What is Natural Language Processing Used for?

Some of the main functions that natural language processing algorithms perform are:

● Text classification. This involves assigning tags to texts to put them in categories. This
can be useful for sentiment analysis, which helps the natural language processing algorithm
determine the sentiment, or emotion behind a text. For example, when brand A is
mentioned in X number of texts, the algorithm can determine how many of those mentions
were positive and how many were negative. It can also be useful for intent detection, which
helps predict what the speaker or writer may do based on the text they are producing.
● Text extraction. This involves automatically summarizing text and finding important
pieces of data. One example of this is keyword extraction, which pulls the most important
words from the text, which can be useful for search engine optimization. Doing this with
natural language processing requires some programming -- it is not completely automated.

10
However, there are plenty of simple keyword extraction tools that automate most of the
process -- the user just has to set parameters within the program. For example, a tool might
pull out the most frequently used words in the text. Another example is named entity
recognition, which extracts the names of people, places and other entities from text.
● Machine translation. This is the process by which a computer translates text from one
language, such as English, to another language, such as French, without human
intervention.
● Natural language generation. This involves using natural language processing
algorithms to analyze unstructured data and automatically produce content based on that
data. One example of this is in language models such as GPT3, which are able to analyze
an unstructured text and then generate believable articles based on the text.

7. Benefits of Natural language Processing

The main benefit of NLP is that it improves the way humans and computers communicate with
each other. The most direct way to manipulate a computer is through code -- the computer's
language. By enabling computers to understand human language, interacting with computers
becomes much more intuitive for humans.

Other benefits include:


● improved accuracy and efficiency of documentation;
● ability to automatically make a readable summary of a larger, more complex
original text;
● useful for personal assistants such as Alexa, by enabling it to understand spoken
word;
● enables an organization to use chatbots for customer support;
● easier to perform sentiment analysis; and
● provides advanced insights from analytics that were previously unreachable due to
data volume.

8. Challenges of Natural language Processing

There are a number of challenges of natural language processing and most of them boil down to the
fact that natural language is ever-evolving and always somewhat ambiguous. They include:

11
● Precision. Computers traditionally require humans to "speak" to them in a programming
language that is precise, unambiguous and highly structured -- or through a limited number
of clearly enunciated voice commands. Human speech, however, is not always precise; it
is often ambiguous and the linguistic structure can depend on many complex variables,
including slang, regional dialects and social context.
● Tone of voice and inflection. Natural language processing has not yet been perfected. For
example, semantic analysis can still be a challenge. Other difficulties include the fact that
the abstract use of language is typically tricky for programs to understand. For instance,
natural language processing does not pick up sarcasm easily. These topics usually require
understanding the words being used and their context in a conversation. As another
example, a sentence can change meaning depending on which word or syllable the speaker
puts stress on. NLP algorithms may miss the subtle, but important, tone changes in a
person's voice when performing speech recognition. The tone and inflection of speech may
also vary between different accents, which can be challenging for an algorithm to parse.
● Evolving use of language. Natural language processing is also challenged by the fact that
language -- and the way people use it -- is continually changing. Although there are rules
to language, none are written in stone, and they are subject to change over time. Hard
computational rules that work now may become obsolete as the characteristics of real-
world language change over time.

9. The Evolution of Natural Language Processing

NLP draws from a variety of disciplines, including computer science and computational linguistics
developments dating back to the mid-20th century. Its evolution included the following major
milestones:
● 1950s. Natural language processing has its roots in this decade, when Alan Turing
developed the Turing Test to determine whether or not a computer is truly intelligent.
The test involves automated interpretation and the generation of natural language as
criterion of intelligence.
● 1950s-1990s. NLP was largely rules-based, using handcrafted rules developed by
linguists to determine how computers would process language.
● 1990s. The top-down, language-first approach to natural language processing was
replaced with a more statistical approach, because advancements in computing made this
a more efficient way of developing NLP technology. Computers were becoming faster
and could be used to develop rules based on linguistic statistics without a linguist
creating all of the rules. Data-driven natural language processing became mainstream
12
during this decade. Natural language processing shifted from a linguist-based approach to
an engineer-based approach, drawing on a wider variety of scientific disciplines instead
of delving into linguistics.
● 2000-2020s. Natural language processing saw dramatic growth in popularity as a term.
With advances in computing power, natural language processing has also gained
numerous real-world applications. Today, approaches to NLP involve a combination of
classical linguistics and statistical methods.

Natural language processing plays a vital part in technology and the way humans interact with it. It is used
in many real-world applications in both the business and consumer spheres, including chatbots,
cybersecurity, search engines and big data analytics. Though not without its challenges, NLP is expected to
continue to be an important part of both industry and everyday life.

EXPERIMENT NO 2
INTRODUCTION TO GRAMMARS, PARSERS, POS TAGS

1. What is Grammar?
Grammar is defined as the rules for forming well-structured sentences.
While describing the syntactic structure of well-formed programs, Grammar plays a very essential and
important role. In simple words, Grammar denotes syntactical rules that are used for conversation in natural
languages.

For Example, in the ‘C’ programming language, the precise grammar rules state how functions are made
with the help of lists and statements.

Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P) where,

N or VN = set of non-terminal symbols, or variables.


T or ∑ = set of terminal symbols.
S = Start symbol where S ∈ N
P = Production rules for Terminals as well as Non-terminals.
It has the form α → β, where α and β are strings on V N ∪ ∑ and at least one symbol of α

belongs to VN

2. Types of Grammar:-
A. Context Free Grammar - A context-free grammar, which is in short represented as
13
CFG, is a notation used for describing the languages and it is a superset of Regular
grammar which you can see from the following diagram:

Figure2.1 CFG - A superset of regular grammar

CFG consists of a finite set of grammar rules having the following four components

● Set of Non-Terminals
● Set of Terminals
● Set of Productions
● Start Symbol

Set of Non-terminals

It is represented by V. The non-terminals are syntactic variables that denote the sets of
strings, which helps in defining the language that is generated with the help of
grammar.

Set of Terminals

It is also known as tokens and represented by Σ. Strings are formed with the help of the
basic symbols of terminals.

Set of Productions

It is represented by P. The set gives an idea about how the terminals and nonterminals
can be combined. Every production consists of the following components:
● Non-terminals,
● Arrow,
● Terminals (the sequence of terminals).
The left side of production is called non-terminals while the right side of production is
called terminals.
14
Start Symbol

The production begins from the start symbol. It is represented by symbol S. Non-
terminal symbols are always designated as start symbols.

B. Constituency Grammar

It is also known as Phrase structure grammar. It is called constituency Grammar as it


is based on the constituency relation. It is the opposite of dependency grammar.
Before deep dive into the discussion of CG, let’s see some fundamental points about
constituency grammar and constituency relation.

● All the related frameworks view the sentence structure in terms of


constituency relation.
● To derive the constituency relation, we take the help of subject-predicate
division of Latin as well as Greek grammar.
● Here we study the clause structure in terms of noun phrase NP and verb
phrase VP.

For Example,

Sentence: This tree is illustrating the constituency relation

FIgure 2.2 A tree illustrating constituency grammar

Now, Let’s deep dive into the discussion on Constituency Grammar:

In Constituency Grammar, the constituents can be any word, group of words, or phrases

15
and the goal of constituency grammar is to organize any sentence into its constituents
using their properties. To derive these properties we generally take the help of:

● Part of speech tagging,


● A noun or Verb phrase identification, etc

For Example, constituency grammar can organize any sentence into its three
constituents- a subject, a context, and an object.

Sentence: <subject> <context> <object>

These three constituents can take different values and as a result, they can generate
different sentences. For Example, If we have the following constituents, then

C. <subject> The horses / The dogs / They


D. <context> are running / are barking / are eating

<object> in the park / happily / since the morning

Example sentences that we can be generated with the help of the above constituents
are:

E. “The dogs are barking in the park”


F. “They are eating happily”

“The horses are running since the morning”

Now, let’s look at another view of constituency grammar is to define their grammar in
terms of their part of speech tags.

Say a grammar structure containing a

[determiner, noun] [ adjective, verb] [preposition, determiner,

noun]

which corresponds to the same sentence – “The dogs are barking in the park”

16
G. Another view (Using Part of Speech)

< DT NN > < JJ VB > < PRP DT NN > -------------> The dogs are

barking in the park

C. Dependency Grammar

It is opposite to the constituency grammar and is based on the dependency relation.


Dependency grammar (DG) is opposite to constituency grammar because it lacks
phrasal nodes.

Before deep dive into the discussion of DG, let’s see some fundamental points about
Dependency grammar and Dependency relation.

● In Dependency Grammar, the words are connected to each other by directed


links.
● The verb is considered the center of the clause structure.
● Every other syntactic unit is connected to the verb in terms of directed link.
These syntactic units are called dependencies.

For Example,

Sentence: This tree is illustrating the dependency relation

FIgure 2.3 A tree illustrating a dependency relation

Now, Let’s deep dive into the discussion of Dependency Grammar:

17
1. Dependency Grammar states that words of a sentence are dependent upon other words of the sentence.
For Example, in the previous sentence which we discussed in CG, “barking dog” was mentioned and the dog
was modified with the help of barking as the dependency adjective modifier exists between the two.

2. It organizes the words of a sentence according to their dependencies. One of the words in a sentence behaves
as a root and all the other words except that word itself are linked directly or indirectly with the root using
their dependencies. These dependencies represent relationships among the words in a sentence and
dependency grammars are used to infer the structure and semantic dependencies between the words.

For Example, Consider the following sentence:

Sentence: Analytics Vidhya is the largest community of data

scientists and provides the best resources for understanding

data and analytics

The dependency tree of the above sentence is shown below:

In the above tree, the root word is “community” having NN as the part of speech tag
and every other word of this tree is connected to root, directly or indirectly, with the
help of dependency relation such as a direct object, direct subject, modifiers, etc.

These relationships define the roles and functions of each word in the sentence and
how multiple words are connected together.

We can represent every dependency in the form of a triplet which contains a governor,
a relation, and a dependent,

Relation : ( Governor, Relation, Dependent )

which implies that a dependent is connected to the governor with the help of relation,
or in other words, they are considered the subject, verb, and object respectively.

For Example, Consider the following same sentence again:

Sentence: Analytics Vidhya is the largest community of data

scientists

18
Then, we separate the sentence in the following manner:

< Analyticsvidhya> <is> <the largest community of data

scientists>

Now, let’s identify different components in the above sentence:

● Subject: “Analytics Vidhya” is the subject and is playing the role of a


governor.
● Verb: “is” is the verb and is playing the role of the relation.
● Object: “the largest community of data scientists” is the dependent or the
object.

Introduction to Parsers

1. Introduction to Parsing
Parsing is defined as "the analysis of an input to organize the data according to the rule of a grammar."

There are a few ways to define parsing. However, the gist remains the same: parsing means to find the
underlying structure of the data we are given.

Figure 2.4 Parsing example

In a way, parsing can be considered the inverse of templating: identifying the structure and extracting
the data. In templating, instead, we have a structure and we fill it with data. In the case of parsing, you
have to determine the model from the raw representation, while for templating, you have to combine

19
the data with the model to create the raw representation. Raw representation is usually text, but it can
also be binary data.

Fundamentally, parsing is necessary because different entities need the data to be in different forms.
Parsing allows transforming data in a way that can be understood by a specific software. The obvious
example is programs — they are written by humans, but they must be executed by computers. So,
humans write them in a form that they can understand, then a software transforms them in a way that
can be used by a computer.

2. Role of Parser
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by the
lexical analyzer are grouped according to the syntactic rules of the language. This is done by a parser.
The parser obtains a string of tokens from the lexical analyzer and verifies that the string can be the
grammar for the source language. It detects and reports any syntax errors and produces a parse tree
from which intermediate code can be generated.

Figure 2.5 Parsing

3. Structure of Parser

Having clarified the role of regular expressions, we can look at the general structure of a parser.
A complete parser is usually composed of two parts: a lexer, also known as scanner or
tokenizer, and the proper parser. The parser needs the lexer because it does not work directly
on the text but on the output produced by the lexer. Not all parsers adopt this two-step schema;
some parsers do not depend on a separate lexer and they combine the two steps. They are called
scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens;
the parser then scans the tokens and produces the parsing result.

20
Let’s look at the following example and imagine that we are trying to parse addition.

437 + 734

The lexer scans the text and finds 4, 3, and 7, and then a space ( ). The job of the lexer is to recognize
that the characters 437 constitute one token of type NUM. Then the lexer finds a + symbol, which
corresponds to the second token of type PLUS, and lastly, it finds another token of type NUM.

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers and parsers are called rules or productions. In our example, a lexer rule
will specify that a sequence of digits correspond to a token of type NUM, while a parser rule will
specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to a sum expression.

It is now typical to find suites that can generate both a lexer and parser. In the past, it was instead more
common to combine two different tools: one to produce the lexer and one to produce the parser. For
example, this was the case of the venerable lex and yacc couple: using lex, it was possible to generate
a lexer, while using yacc, it was possible to generate a parser.

4. Lexers and Parsers

A lexer transforms a sequence of characters in a sequence of tokens.

Lexers are also known as scanners or tokenizers. Lexers play a role in parsing because they transform
the initial input in a form that is more manageable by the proper parser, who works at a later stage.
Typically lexers are easier to write than parsers, although there are special cases when both are quite
complicated; for instance, in the case of C

A very important part of the job of the lexer is dealing with whitespace. Most of the time, you want
the lexer to discard whitespace. That is because otherwise, the parser would have to check for the
presence of whitespace between every single token, which would quickly become annoying.

5. Parsing Tree and Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and
21
abstract syntax tree (AST). Technically, the parse tree could also be called a concrete syntax tree
(CST) because it should reflect more concretely the actual syntax of the input, at least compared to
the AST.

Conceptually, they are very similar. They are both trees; there is a root that has nodes representing the
whole source code. The roots have children nodes that contain subtrees representing smaller and
smaller portions of code, until single tokens (terminals) appear in the tree.

The difference is in the level of abstraction. A parse tree might contain all the tokens that appeared in
the program and possibly, a set of intermediate rules. The AST, instead, is a polished version of the
parse tree, in which only the information relevant to understanding the code is maintained. We are
going to see an example of an intermediate rule in the next section.

Some information might be absent both in the AST and the parse tree. For instance, comments and
grouping symbols (i.e. parentheses) are usually not represented. Things like comments are superfluous
for a program and grouping symbols are implicitly defined by the structure of the tree.

Figure 2.6 Example of a parse tree

In the AST the indication of the specific operator has disappeared and all that remains is the operation
to be performed. The specific operator is an example of an intermediate rule.

Graphical Representation of a Tree


The output of a parser is a tree, but the tree can also be represented in graphical ways. That is to allow
an easier understanding to the developer. Some parsing generator tools can output a file in the DOT
language, a language designed to describe graphs (a tree is a particular kind of graph). Then this file
is fed to a program that can create a graphical representation starting from this textual description.
22
Let’s see a DOT text based on the previous sum example.

1.digraph sum {
2. sum -> 10;
3. sum -> 21;
4. }

The appropriate tool can create the following graphical representation.

6. Parsing Algorithms

Overview
Let’s start with a global overview of the features and strategies of all parsers.

Two Strategies
There are two strategies for parsing: top-down parsing and bottom-up parsing. Both terms are defined
in relation to the parse tree generated by the parser. Explained in a simple way:
● A top-down parser tries to identify the root of the parse tree first, then moves down the subtrees
until it finds the leaves of the tree.
● A bottom-up parser instead starts from the lowest part of the tree, the leaves, and rises up until
it determines the root of the tree.

Let’s see an example, starting with a parse tree.

23
Figure 2.7 Example parse tree

The same tree would be generated in a different order by a top-down and a bottom-up parser. In the
following images, the number indicates the order in which the nodes are created.

Figure2.8 Top-down order of generation of the tree

Figure 2.9 Bottom-up order of generation of the tree

Tables of Parsing Algorithms


We provide a table below to offer a summary of the main information needed to understand and
implement a specific parser algorithm. The table lists:

● A formal description, to explain the theory behind the algorithm


● A more practical explanation
● One or two implementations, usually one easier and the other a professional parser. Sometimes,
though, there is no easier version or a professional one.

24
Figure 2.10 Table for Parsing algorithms

To understand how a parsing algorithm works, you can also look at the syntax analytic toolkit. It is an
educational parser generator that describes the steps that a generated parser takes to accomplish its
objective. It implements an LL and an LR algorithm.

The second table shows a summary of the main features of the different parsing algorithms and for
what they are generally used.

25
Figure2.11 Table for features of parsing algorithms

1. Top-Down Algorithms
The top-down strategy is the most widespread of the two strategies and there are several successful
algorithms applying it.

LL Parser
LL (Left-to-right read of the input, Leftmost derivation) parsers are table-based parsers without
backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide which
rule to apply. The parsing table use as rows and columns nonterminals and terminals, respectively.

To find the correct rule to apply:

1. Firstly, the parser looks at the current token and the appropriate amount of lookahead tokens.

26
2. Then, it tries to apply the different rules until it finds the correct match.

The concept of the LL parser does not refer to a specific algorithm, but more to a class of parsers.
They are defined in relation to grammars. That is to say, an LL parser is one that can parse a LL
grammar. In turn, LL grammars are defined in relation to the number of lookahead tokens that are
needed to parse them. This number is indicated between parentheses next to LL, so in the form LL(k).

An LL(k) parser uses k tokens of lookahead and thus it can parse, at most, a grammar that needs k
tokens of lookahead to be parsed. Effectively, the concept of the LL(k) grammar is more widely
employed than the corresponding parser — which means that LL(k) grammars are used as a meter
when comparing different algorithms. For instance, you would read that PEG parsers can handle LL(*)
grammars.

The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually
compared to CYK, another chart parser, that is simpler but also usually worse in performance and
memory. The distinguishing feature of the Earley algorithm is that, in addition to storing partial results,
it implement a prediction step to decide which rule is going to try to match next.

The Earley parser fundamentally works by dividing a rule in segments, like in the following example.

Figure 2.12 example for Earley Parser

Then, working on this segment that can be connected at the dot (.), tries to reach a completed state;
that is to say. one with the dot at the end.

The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages,
while other famous algorithms (i.e. LL, LR) can parse only a subset of them. For instance, it has no
problem with left-recursive grammars. More generally, an Earley parser can also deal with
nondeterministic and ambiguous grammars.

It can do that at the risk of worse performance (O(n3)), in the worst case. However, it has a linear time
performance for normal grammars. The catch is that the set of languages parsed by more traditional

27
algorithms are the one we are usually interested in.

There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in
certain way the parsing can be more efficient, i.e., building an LL(1) grammar might be harder for the
developer, but the parser can apply it very efficiently. With Earley, you do less work, so the parser
does more of it.

In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in
terms of performance.

Recursive Descent Parser

A recursive descent parser is a parser that works with a set of (mutually) recursive procedures, usually
one for each rule of the grammars. Thus, the structure of the parser mirrors the structure of the
grammar.

The term predictive parser is used in a few different ways: some people mean it as a synonym for a
top-down parser, some as a recursive descent parser that never backtracks.

Typically, recursive descent parsers have problems parsing left-recursive rules because the algorithm
would end up calling the same function again and again. A possible solution to this problem is using
tail recursion. Parsers that use this method are called tail recursive parsers.

Pratt Parser

A Pratt parser is a widely unused, but much appreciated (by the few who know it), parsing algorithm
defined by Vaughan Pratt in a paper called Top Down Operator Precedence. The paper itself starts
with a polemic on BNF grammars, which the author argues wrongly are the exclusive concerns of
parsing studies. This is one of the reasons for the lack of success. In fact, the algorithm does not rely
on a grammar but works directly on tokens, which makes it unusual to parsing experts.

The second reason is that traditional top-down parsers work great if you have a meaningful prefix that
helps distinguish between different rules. For example, if you get the token FOR, you are looking at a
for statement. Since this essentially applies to all programming languages and their statements, it is
easy to understand why the Pratt parser did not change the parsing world.

Parser Combinator

A parser combinator is a higher-order function that accepts parser functions as input and returns a new
parser function as output. A parser function usually means a function that accepts a string and output
28
a parse tree.

A parser combinator is modular and easy to build, but they are also slower (they have O(n4)
complexity in the worst case) and less sophisticated. They are typically adopted for easier parsing
tasks or for prototyping. In a sense, the user of a parser combinator builds the parser partially by hand
but relies on the hard work done by whoever created the parser combinator.

The most basic example is the Maybe monad. This is a wrapper around a normal type, like integer,
that returns the value itself when the value is valid (i.e. 567), but a special value, Nothing, when it is
not (i.e. undefined or divided by zero). Thus, you can avoid using a null value and unceremoniously
crashing the program. Instead, the Nothing value is managed normally, like it would manage any other
value

2. Bottom-Up Algorithms

The bottom-up strategy's main success is the family of many different LR parsers. The
reason for their relative unpopularity is that historically, they've been harder to build, although LR
parsers are more powerful than traditional LL(1) grammars. So, we mostly concentrate on them, apart
from a brief description of CYK parsers.
This means that we avoid talking about the more generic class of shift-reduce parser, which also
includes LR parsers.

Shift-reduce algorithms work with two steps:

1. Shift: Read one token from the input, which will become a new (momentarily isolated) node.
2. Reduce: Once the proper rule is matched, join the resulting tree with a precedent existing
subtree.

Basically, the Shift step reads the input until completion, while the Reduce step joins the subtrees until
the final parse tree is built.

CYK Parser
The Cocke-Younger-Kasami (CYK) algorithm was formulated independently by three authors. Its
notability is due to a great worst-case performance (O(n3)), although it is hampered by comparatively
bad performance in most common scenarios.

However, the real disadvantage of the algorithm is that it requires grammars to be expressed in
Chomsky normal form.

29
The CYK algorithm is used mostly for specific problems; for instance, the membership problem: to
determine if a string is compatible with a certain grammar. It can also be used in natural language
processing to find the most probable parsing between many options.

LR Parser
LR (Left-to-right read of the input; Rightmost derivation) parsers are bottom-up parsers that can
handle deterministic context-free languages in linear time with lookahead and without backtracking.
The invention of LR parsers is credited to the renowned Donald Knuth.

Traditionally, they have been compared to and have competed with LL parsers. There's a similar
analysis related to the number of lookahead tokens necessary to parse a language. An LR(k) parser
can parse grammars that need k tokens of lookahead to be parsed. However, LR grammars are less
restrictive, and thus more powerful, than the corresponding LL grammars. For example, there is no
need to exclude left-recursive rules.

Technically, LR grammars are a superset of LL grammars. One consequence of this is that you need
only LR(1) grammars, so usually, the (k) is omitted.

They are also table-based, just like LL-parsers, but they need two complicated tables. In very simple
terms:

1. One table tells the parser what to do depending on the current token, the state it's in, and the
tokens that could possibly follow the current one (lookahead sets).

Introduction to POS TAGS

1. Part-of-Speech Tagging

Part-of-Speech(POS) Tagging is the process of assigning different labels known as POS tags to the words in
a sentence that tells us about the part-of-speech of the word.

It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a
form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,
adjective, verb, and so on.

It is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in
correspondence with a particular part of speech, depending on the definition of the word and its context.

Broadly there are two types of POS tags:

1. Universal POS Tags: These tags are used in the Universal Dependencies (UD) (latest version 2), a
30
project that is developing cross-linguistically consistent treebank annotation for many languages. These
tags are based on the type of words. E.g., NOUN(Common Noun), ADJ(Adjective), ADV(Adverb)

2. Detailed POS Tags: These tags are the result of the division of universal POS tags into various tags,
like NNS for common plural nouns and NN for the singular common noun compared to NOUN for
common nouns in English. These tags are language-specific.

Figure 2.13 List of Universal POS Tags

Example 1 of Part-of-speech (POS) tagged corpus

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.

format for a tagged corpus is of the form word/tag. Each word is with a tag denoting its POS. For example,
nn refers to a noun, vb is a verb.

Example 2 of Part-of-speech (POS) tagged corpus


31
Figure 2.14: Example of POS tagging

In Figure 1, we can see each word has its own lexical term written underneath, however, having to constantly
write out these full terms when we perform text analysis can very quickly become cumbersome — especially
as the size of the corpus grows. Then, we use a short representation referred to as “tags” to represent the
categories.

As earlier mentioned, the process of assigning a specific tag to a word in our corpus is referred to as part-of-
speech tagging (POS tagging for short) since the POS tags are used to describe the lexical terms that we have
within our text.

Figure 2.15: Grid displaying different types of lexical terms, their tags, and random examples

Part-of-speech tags describe the characteristic structure of lexical terms within a sentence or text, therefore,
we can use them for making assumptions about semantics. Other applications of POS tagging include:

● Named Entity Recognition

32
● Co-reference Resolution
● Speech Recognition
When we perform POS tagging, it’s often the case that our tagger will encounter words that were not within
the vocabulary that was used. Consequently, augmenting your dataset to include unknown word tokens will
aid the tagger in selecting appropriate tags for those words.

Markov Chains
Taking the example text we used in Figure 1, “Why not tell someone?”, imaging the sentence is truncated to
“Why not tell … ” and we want to determine whether the following word in the sentence is a noun, verb,
adverb, or some other part-of-speech.
Now, if you are familiar with English, you’d instantly identify the verb and assume that it is more likely the
word is followed by a noun rather than another verb. Therefore, the idea as shown in this example is that the
POS tag that is assigned to the next word is dependent on the POS tag of the previous word.

Figure 2.16: Representing Likelihoods visually

By associating numbers with each arrow direction, of which imply the likelihood of the next word given the
current word, we can say there is a higher likelihood the next word in our sentence would be a noun since it
has a higher likelihood than the next word being a verb if we are currently on a verb. The image in Figure is
a great example of how a Markov Model works on a very small scale.

Given this example, we may now describe markov models as “a stochastic model used to model randomly
changing systems. It is assumed that future states depend only on the current state, not on the events that
occurred before it (that is, it assumes the Markov property)”. Therefore to get the probability of the next
event, it needs only the states of the current event.

We can depict a markov chain as directed graph:

33
Figure 2.17: Depiction of Markov Model as Graph

The lines with arrows are an indication of the direction hence the name “directed graph”, and the circles may
be regarded as the states of the model — a state is simply the condition of the present moment.

We could use this Markov model to perform POS. Considering we view a sentence as a sequence of words,
we can represent the sequence as a graph where we use the POS tags as the events that occur which would be
illustrated by the stats of our model graph.

For example, q1 in Figure would become NN indicating a noun, q2 would be VB which is short for verb, and
q3 would be O signifying all other tags that are not NN or VB. Like in Figure 3, the directed lines would be
given a transition probability that define the probability of going from one state to the next.

Figure 2.17: Example of Markov Model to perform POS tagging.

A more compact way to store the transition and state probabilities is using a table, better known as a “transition
matrix”.

34
Figure 2.18: Transition Matrix (Image by Author)

Notice this model only tells us the transition probability of one state to the next when we know the previous
word. Hence, this model does not show us what to do when there is no previous word. To handle this case,
we add what is known as the “initial state”.

Figure 2.19: Adding an Initial State to deal with beginning of word matrix

You may now be wondering, how did we populate the transition matrix? Great Question. I will use 3 sentences
for our corpus. The first is “<s> in a station of the metro”, “<s> the apparition of these faces in the crowd”,
“<s> petals on a wet, black bough.” (Note these are the same sentences used in the course). Next, we will
break down how to populate the matrix into steps:

1. Count occurrences of tag pairs in the training dataset

Formula 1: Counting the occurrences of the tag

At the end of step one, our table would look something like this…

Figure 2.20: applying step one with our corpus.

2. Calculate the probability of using the counts

35
Formula 2: Calculate probabilities using the counts

Applying the above formula to the table in Figure 2.20, our new table would look as follows…

Figure 2.21: Probabilities populating the transition matrix.

You may notice that there are many 0’s in our transition matrix which would result in our model being
incapable of generalizing to other text that may contain verbs. To overcome this problem, we add
smoothing.

Adding smoothing requires we slightly we adjust the formula by adding a small value, epsilon, to each of
the counts in the numerator, and add N * epsilon to the denominator, such that the row sum still adds up to
1.

Formula 3: Calculating the probabilities with smoothing

Figure 2.22: New probabilities with smoothing added. N is the length of the corpus and epsilon is some
very small number.

Hidden Markov Model

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed
to be a Markov process with unobservable (“hidden”) states . In our case, the unobservable states are the POS
36
tags of a word.

If we rewind back to our Markov Model , we see that the model has states for part of speech such as VB for
verb and NN for a noun. We may now think of these as hidden states since they are not directly observable
from the corpus. Though a human may be capable of deciphering what POS applies to a specific word, a
machine only sees the text, hence making it observable, and is unaware of whether that word POS tag is noun,
verb, or something else which in-turn means they are unobservable.
The emission probabilities describe the transitions from the hidden states in the model — remember the hidden
states are the POS tags — to the observable states — remember the observable states are the words.

Figure 2.23: Example of Hidden Markov model.

In Figure 2.23 we see that for the hidden VB state we have observable states. The emission probability from
the hidden states VB to the observable eat is 0.5 hence there is a 50% chance that the model would output this
word when the current hidden state is VB.
We can also represent the emission probabilities as a table…

Figure 2.24: Emission matrix expressed as a table — The numbers are not accurate representations,
they are just random

Similar to the transition probability matrix, the row values must sum to 1. Also, the reason all of our POS tags
emission probabilities are more than 0 since words can have a different POS tag depending on the context.

To populate the emission matrix, we’d follow a procedure very similar to the way we’d populate the transition
37
matrix. We’d first count how often a word is tagged with a specific tag.

Figure 2.25: Calculating the counts of a word and how often it is tagged with a specific tag.

Since the process is so similar to calculating the transition matrix, I will instead provide you with the formula
with smoothing applied to see how it would be calculated.

Formula 4: Formula for calculating transition probabilities where N is the number of tags and epsilon is
a very small number

EXPERIMENT NO 3
INTRODUCTION TO NLTK

What is the Natural Language Toolkit(NLTK) in NLP?

Natural language processing is about building applications and assistance/services that can understand
human languages. It is a field that interacts amidst computers and humans. It is mainly used for text
analysis that provides computers with a way to recognize the human language.

Moreover, NLP is the technology that provides the potential to all the chatbots, voice assistants,
predictive text and text applications, has unfolded in recent years. There is a wide variety of open-
source NLP tools available.

With the help of NLP tools and techniques, most of the NLP task can be performed, a few examples
of NLP tasks involve speech recognition, summarization, topic segmentation, understanding what the
content is about or sentiment analysis etc.

38
Understanding NLTK

NLTK, a preeminent platform, that is used for developing Python programs for operating with human
language data. It is a suite of open source program modules, tutorials and problem sets for presenting
prepared computational linguistics courseware. NLTK incorporates symbolic and statistical Natural
Language Processing and is assimilated to interpreted corpora for teachers and students especially.

Most significant features of NLTK includes;


1. It presents easy-to-implement interfaces across 50 corpora and linguistics sources, for
example, WordNet, text processing libraries for classification, tokenization, and wrappers for
industrial-strength NLP libraries.
2. NLTK is suitable for translators, educators, researchers, and industrial applications and is
accessible on Windows, Mac OS X, and Linux.
3. It attains a firsthand guide that introduces in computational linguistics and programming
fundamentals for Python due to which it becomes a proper fit for lexicographers who don’t
have intense knowledge in programming.
4. NLTK is an ultimate combination of three factors; first, it was intentionally designed as
courseware and provides pedagogical objectives as primary status, second, its target audience
comprises both linguists and computer specialists, and it is not only convenient but challenging
also at various levels of early computational skill and thirdly, it deeply depends on an object-
oriented composing language that supports swift prototyping and intelligent programming.

Requirements of NLTK

1. Easy to implement: One of the main objectives behind using this toolkit is to enable users to
focus on developing NLP components and system. The more time students must spend learning
to use the toolkit, the less useful it is.
2. Consistency: The toolkit must apply compatible data structures and interfaces.
3. Extensibility: The toolkit easily adapts novel components, whether such components imitate
or prolong the existing functionality and performance of the toolkit. The toolkit should be
arranged in a precise manner that appending new extensions would match into the toolkit’s
existing infrastructure.
4. Documentation: There is a need to cite the toolkit, its data structure and its implementation
delicately. The whole nomenclature must be picked out very sparingly and to be applied

39
consistently.
5. Monotony: The toolkit should make up the ramification of producing NLP systems, and do not
drop them. So, every class, determined by the tool, must be accessible for users that they could
complete by the time of the rudimentary course in computational linguistics.
6. Modularity: To maintain interaction amid various components of the toolkit, it should be
retained in a minimum, mild, and sharp interfaces. However, it should be plausible to finish
different projects by tiny parts of the toolkit, without agonising about how to cooperate with
the rest of the toolkit.

Uses of NLTK

1. Assignments: NLTK can be used to create assignments for students of various difficulties and
scopes. After becoming familiar with the toolkit, users can make trivial changes or extensions
in an existing module in NLTK. When developing a new module, NLTK gives few useful
initiating points: pre-defined interfaces and data structures, and existing modules that apply
the same interface.
2. Class demonstrations: NLTK offers graphical tools that can be utilized in the class
demonstrations, to assist in explaining elementary NLP concepts and algorithms. Such
interactive tools are accepted to represent associated data structures and to bestow the step-by-
step execution of algorithms.
3. Advanced Projects: NLTK presents users with an amenable framework for advanced projects.
Standard projects include the development of totally new functionality for a priorly
unsupported NLP task or the development of an entire system from existing and new modules.

Text Analysis Operations using NLTK

NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is
free, opensource, easy to use, large community, and well documented. NLTK consists of the most
common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic
segmentation, and named entity recognition. NLTK helps the computer to analyze, preprocess, and
understand the written text.

Now we first install and import nltk in our system. Open the terminal and type the following command-

!pip install nltk

40
Now we can see the following message .
Requirement already satisfied: nltk in /home/northout/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: six in /home/northout/anaconda2/lib/python2.7/site-packages (from
nltk)
[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

Further we import the nltk using the following command and start with the operations.

#Loading NLTK
import nltk
1. Tokenization

Tokenization is the first step in text analytics. The process of breaking down a text paragraph
into smaller chunks such as words or sentence is called Tokenization. Token is a single entity
that is building blocks for sentence or paragraph.

2. Sentence Tokenization

Sentence tokenizer breaks text paragraph into sentences.


from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is
awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

Output -

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is
awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

Here, the given text is tokenized into sentences.

3. Word Tokenization

Word tokenizer breaks text paragraph into words.

41
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',',
'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat',
'cardboard']

4. Frequency Distribution

from nltk.probability import FreqDist


fdist = FreqDist(tokenized_word)
print(fdist)

<FreqDist with 25 samples and 30 outcomes>

fdist.most_common(2)

[('is', 3), (',', 2)]

# Frequency Distribution Plot


import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()

5. Stopwords

Stopwords considered as noise in the text. Text may contain stop words such as is, am, are,
this, a, an, the, etc.

In NLTK for removing stopwords, you need to create a list of stopwords and filter out your
list of tokens from these words.
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)
42
output:-
{'their', 'then', 'not', 'ma', 'here', 'other', 'won', 'up', 'weren', 'being', 'we', 'those', 'an', 'them',
'which', 'him', 'so', 'yourselves', 'what', 'own', 'has', 'should', 'above', 'in', 'myself', 'against', 'that',
'before', 't', 'just', 'into', 'about', 'most', 'd', 'where', 'our', 'or', 'such', 'ours', 'of', 'doesn', 'further',
'needn', 'now', 'some', 'too', 'hasn', 'more', 'the', 'yours', 'her', 'below', 'same', 'how', 'very', 'is',
'did', 'you', 'his', 'when', 'few', 'does', 'down', 'yourself', 'i', 'do', 'both', 'shan', 'have', 'itself',
'shouldn', 'through', 'themselves', 'o', 'didn', 've', 'm', 'off', 'out', 'but', 'and', 'doing', 'any', 'nor',
'over', 'had', 'because', 'himself', 'theirs', 'me', 'by', 'she', 'whom', 'hers', 're', 'hadn', 'who', 'he',
'my', 'if', 'will', 'are', 'why', 'from', 'am', 'with', 'been', 'its', 'ourselves', 'ain', 'couldn', 'a', 'aren',
'under', 'll', 'on', 'y', 'can', 'they', 'than', 'after', 'wouldn', 'each', 'once', 'mightn', 'for', 'this', 'these',
's', 'only', 'haven', 'having', 'all', 'don', 'it', 'there', 'until', 'again', 'to', 'while', 'be', 'no', 'during',
'herself', 'as', 'mustn', 'between', 'was', 'at', 'your', 'were', 'isn', 'wasn'}

Removing Stopwords

filtered_sent=[]
for w in tokenized_sent:
if w not in stop_words:
filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_sent)
print("Filterd Sentence:",filtered_sent)

output:-
Tokenized Sentence: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?']
Filterd Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']

6. Lexicon Normalization

Lexicon normalization considers another type of noise in the text. For example, connection,
connected, connecting word reduce to a common word "connect". It reduces derivationally
related forms of a word to a common root word.

7. Stemming
43
Stemming is a process of linguistic normalization, which reduces words to their word root
word or chops off the derivational affixes. For example, connection, connected, connecting
word reduce to a common word "connect".
# Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

stemmed_words=[]
for w in filtered_sent:
stemmed_words.append(ps.stem(w))

print("Filtered Sentence:",filtered_sent)
print("Stemmed Sentence:",stemmed_words)

Output:-
Filtered Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']
Stemmed Sentence: ['hello', 'mr.', 'smith', ',', 'today', '?']

8. Lemmatization

Lemmatization reduces words to their base word, which is linguistically correct lemmas. It
transforms root word with the use of vocabulary and morphological analysis. Lemmatization
is usually more sophisticated than stemming. Stemmer works on an individual word without
knowledge of the context. For example, The word "better" has "good" as its lemma. This thing
will miss by stemming because it requires a dictionary look-up.
#Lexicon Normalization
#performing stemming and Lemmatization

from nltk.stem.wordnet import WordNetLemmatizer


lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer


stem = PorterStemmer()

44
word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

output-
Lemmatized Word: fly
Stemmed Word: fli

9. POS Tagging

The primary target of Part-of-Speech(POS) tagging is to identify the grammatical


group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB,
ADVERBS, etc. based on the context. POS Tagging looks for relationships within the
sentence and assigns a corresponding tag to the word.

sent = "Albert Einstein was born in Ulm, Germany in 1879."

tokens=nltk.word_tokenize(sent)
print(tokens)
output:-
['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.'

nltk.pos_tag(tokens)
Output-
[('Albert', 'NNP'),
('Einstein', 'NNP'),
('was', 'VBD'),
('born', 'VBN'),
('in', 'IN'),
('Ulm', 'NNP'),
(',', ','),
('Germany', 'NNP'),
('in', 'IN'),

45
('1879', 'CD'),
('.', '.')]

EXPERIMENT NO 4
WRITE A PYTHON PROGRAM TO REMOVE “STOPWORDS” FROM A GIVEN
TEXT AND GENERATE WORD TOKENS AND FILTERED TEXT

In NLTK for removing stopwords, you need to create a list of stopwords and filter out your
list of tokens from these words.

Code:-
Import nltk
from nltk.corpus import stopwords
import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

Output:-

46
Removing Stopwords

Code:-

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize
example_sent = """ A stop word is a commonly used word that a search engine
has been programmed to ignore, both when indexing entries for
searching and when retrieving them as the result of a search query."""
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print("Tokenized_Sentence:",word_tokens)
print("Filtered_Sentence:",filtered_sentence)

Output

47
EXPERIMENT NO 5
WRITE A PYTHON PROGRAM TO GENERATE “TOKENS” AND ASSIGN “POS
48
TAGS” FOR A GIVEN TEXT USING NLTK PACKAGE

Code -
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words('english'))

text = "Tokenization is one of the least glamorous parts of NLP. How do we split our text"\
"so that we can do interesting things on it. "\
"Despite its lack of glamour, it’s super important."\
"Tokenization defines what our NLP models can express. "\
"Even though tokenization is super important, it’s not always top of mind."\
"In the rest of this article, I’d like to give you a high-level overview of tokenization, where it came from,"\
"what forms it takes, and when and how tokenization is important "\

tokenized = sent_tokenize(text)
for i in tokenized:

wordsList = nltk.word_tokenize(i)

wordsList = [w for w in wordsList if not w in stop_words]

pos_tag= nltk.pos_tag(wordsList)

print("Pos-tags",pos_tag)

Output:-

49
EXPERIMENT NO 6
WRITE A PYTHON PROGRAM TO GENERATE “WORLDCLOUD” WITH
MAXIMUM WORDS USED = 100, IN DIFFERENT SHAPES AND SAVE AS
A .PNG FILE FOR A GIVEN TEXT FILE.

50
Wordcloud 1
Code:-

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

text = open('batman.txt', 'r').read()


stopwords = set(STOPWORDS)

custom_mask = np.array(Image.open('like.png'))
wc = WordCloud(background_color = 'black',
stopwords = stopwords,
mask = custom_mask,
contour_width = 3,
contour_color = 'black')

wc.generate(text)
image_colors = ImageColorGenerator(custom_mask)
wc.recolor(color_func = image_colors)

#Plotting

wc.to_file('like_cloud.png')

The Image :-

51
Output :-

Wordcloud 2
Code:-

52
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

text = open('batman.txt', 'r').read()


stopwords = set(STOPWORDS)

custom_mask = np.array(Image.open('girl.png'))
wc = WordCloud(background_color = 'white',
stopwords = stopwords,
mask = custom_mask,
contour_width = 3,
contour_color = 'black')

wc.generate(text)
image_colors = ImageColorGenerator(custom_mask)
wc.recolor(color_func = image_colors)

wc.to_file('girl_cloud.png')

The Image : -

Output :-

53
EXPERIMENT NO 7
PERFORM AN EXPERIMENT TO LEARN ABOUT MORPHOLOGICAL
FEATURES OF A WORD BY ANALYZING IT.

54
Introduction : Word Analysis

A word can be simple or complex. For example, the word 'cat' is simple because one cannot further
decompose the word into smaller part. On the other hand, the word 'cats' is complex, because the word is
made up of two parts: root 'cat' and plural suffix '-s'

Theory

Analysis of a word into root and affix(es) is called Morphological analysis of a word. It is
mandatory to identify the root of a word for any natural language processing task. A root word can
have various forms. For example, the word 'play' in English has the following forms: 'play', 'plays',
'played' and 'playing'. Hindi shows a greater number of forms for the word 'खेल' (khela) which is
equivalent to 'play'. The forms of 'खेल'(khela) are the following:

खेल(khela), खेला(khelaa), खेली(khelii), खेलंगू ा(kheluungaa), खेलंगू ी(kheluungii), खेलेगा(khelegaa),


खेलेगी(khelegii), खेलते(khelate), खेलती(khelatii), खेलने(khelane), खेलकर(khelakar)

ForTelugurootë™
™é(Adadam),theformsarethefollowing::

Adutaanu, AdutunnAnu, Adenu, Ademu, AdevA, AdutAru, Adutunnaru, AdadAniki, Adesariki,


AdanA, Adinxi, Adutunxi, AdinxA, AdeserA, Adestunnaru

Thus we understand that the morphological richness of one language might vary from one language
to another. Indian languages are generally morphologically rich languages and therefore
morphological analysis of words becomes a very significant task for Indian languages.

Types of Morphology

Morphology is of two types,

1. Inflectional morphology
55
Deals with word forms of a root, where there is no change in lexical category. For example, 'played'
is an inflection of the root word 'play'. Here, both 'played' and 'play' are verbs.

2. Derivational morphology

Deals with word forms of a root, where there is a change in the lexical category. For example, the
word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness' is a derived noun form of
the adjective 'happy'.

Morphological Features:

All words will have their lexical category attested during morphological analysis.

A noun and pronoun can take suffixes of the following features: gender, number, person, case

For example, morphological analysis of a few words is given below:

Language input: word output :analysis


Hindi ♠ ░╦Ċ(ladake) rt=♠ Į ╦ □(ladakaa), cat=n, gen=m, num=sg, case=obl

Hindi ♠ ░╦Ċ(ladake) rt=♠ Į ╦ □(ladakaa), cat=n, gen=m, num=pl, case=dir

Hindi ♠ Į ╦ �(ladakoM) rt=♠ Į ╦ □(ladakaa), cat=n, gen=m, num=pl, case=obl

English boy rt=boy, cat=n, gen=m, num=sg

English boys rt=boy, cat=n, gen=m, num=pl

A verb can take suffixes of the following features: tense, aspect, modality, gender, number, person.

Language input: word output :analysis


Hindi ♠ ░╦Ċ(ladake) rt=♠ Į ╦ □(ladakaa), cat=n, gen=m, num=sg, case=obl

Hindi ♠ ░╦Ċ(ladake) rt=♠ Į ╦ □(ladakaa), cat=n, gen=m, num=pl, case=dir

'rt' stands for root. 'cat' stands for lexical category. Thev value of lexicat category can be noun, verb,
adjective, pronoun, adverb, preposition. 'gen' stands for gender. The value of gender can be
masculine or feminine.

◆ 'num' stands for number. The value of number can be singular (sg) or plural (pl).
56
◆ 'per' stands for person. The value of person can be 1, 2 or 3
◆ The value of tense can be present, past or future. This feature is applicable for verbs.
◆ The value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature is
not applicable for verbs.
◆ 'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique case
when a postposition occurs after noun. If no postposition can occur after noun, then the
case is a direct case. This is applicable for hindi but not english as it doesn't have any
postpositions. Some of the postpsitions in hindi are: का(kaa), क�(kii), के (ke), को(ko), में(meM)

Objective :- The objective of the experiment is to learn about morphological features of a word by
analysing it.

Procedure and Experiment

STEP1: Select the language.

OUTPUT: Drop down for selecting words will appear.

STEP2: Select the word.

OUTPUT: Drop down for selecting features will appear.

57
STEP3: Select the features.

STEP4: Click "Check" button to check your answer.

58
OUTPUT: Right features are marked by tick and wrong features are marked by cross.

EXPERIMENT NO 8

59
PERFORM AN EXPERIMENT TO GENERATE WORD FORMS FROM
ROOT AND SUFFIX INFORMATION

Introduction : Word Generation

A word can be simple or complex. For example, the word 'cat' is simple because one cannot further
decompose the word into smaller part. On the other hand, the word 'cats' is complex, because the word is
made up of two parts: root 'cat' and plural suffix '-s'

Theory:- Given the root and suffix information, a word can be generated. For example,

Language input:analysis output:word

rt=♠ Į ╦ □(ladakaa), cat=n, gen=m, ♠ Į ╦Ċ(ladake)


Hindi
num=sg, case=obl

Hindi rt=♠ Į ╦ □(ladakaa), cat=n, gen=m, ♠ Į ╦Ċ(ladake)

num=pl, case=dir

English rt=boy, cat=n, num=pl boys

English rt=play, cat=v, num=sg, per=3, tense=pr plays

- Morphological analysis and generation: Inverse processes.

- Analysis may involve non-determinism, since more than one analysis is possible.

60
- Generation is a deterministic process. In case a language allows spelling variation, then till that
extent, generation would also involve non-determinism

Objective : The objective of the experiment is to generate word forms from root and suffix information

Procedure :-

STEP1: Select the language.

OUTPUT: Drop downs for selecting root and other features will appear.

STEP2: Select the root and other features.

61
STEP3: After selecting all the features, select the word corresponding above features selected.

STEP4: Click the check button to see whether right word is selected or not

OUTPUT: Output tells whether the word selected is right or wrong

EXPERIMENT NO 9
62
PERFORM AN EXPERIMENT TO UNDERSTAND THE MORPHOLOGY OF
A WORD BY THE USE OF ADD-DELETE TABLE

Introduction : Morphology

Morphology is the study of the way words are built up from smaller meaning bearing units i.e.,
morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:

● बच्चों(bachchoM) consists of two morphemes, बच्चा(bachchaa) has the information of the root word
noun "बच्चा"(bachchaa) and ओ(ं oM) has the information of plural and oblique case.
● played has two morphemes play and -ed having information verb "play" and "past tense", so given
word is past tense form of verb "play".

Words can be analysed morphologically if we know all variants of a given root word. We can use an 'Add-
Delete' table for this analysis.

Theory :-

Morph Analyser

Definition
Morphemes are considered as smallest meaningful units of language. These morphemes can either
be a root word(play) or affix(-ed). Combination of these morphemes is called morphological
process. So, word "played" is made out of 2 morphemes "play" and "-ed". Thus finding all parts of a
word(morphemes) and thus describing properties of a word is called "Morphological Analysis". For
example, "played" has information verb "play" and "past tense", so given word is past tense form of

63
verb "play".

Analysis of a word :
बच्चों (bachchoM) = बच्चा(bachchaa)(root) + ओ(ं oM)(suffix)

(ओ=ं 3 plural oblique)

A linguistic paradigm is the complete set of variants of a given lexeme. These variants can be
classified according to shared inflectional categories (eg: number, case etc) and arranged into tables.

Paradigm for बच्चा

plural
case/num singular

direct बच्चा(bach बच्चे(bachche)


chaa)
oblique बच्चे(bach बच्चों (bachchoM)
che)

Algorithm to get बच्चों(bachchoM) from बच्चा(bachchaa)

1. Take Root बच्च(bachch)आ(aa)


2. Delete आ(aa)

3. output बच्च(bachch)

4. Add ओ(ं oM) to output

5. Return बच्चों (bachchoM)

Therefore आ is deleted and ओ ं is added to get बच्चों

Add-Delete table for बच्चा

Delete Add Number Case Variants

64
आ(aa) आ(aa) sing dr बच्चा(bachchaa)

आ(aa) ए(e) plu dr बच्चे(bachche)

आ(aa) ए(e) sing ob बच्चे(bachche)

आ(aa) ओ(ं oM) plu ob बच्चों(bachchoM)

Paradigm Class
Words in the same paradigm class behave similarly, for Example लड़क is in the same paradigm class
as बच्च, so लड़का would behave similarly as बच्चा as they share the same paradigm class.

Objective :- Understanding the morphology of a word by the use of Add-Delete table

Procedure :-

STEP1: Select a word root.

STEP2: Fill the add-delete table and submit.

65
STEP3: If wrong, see the correct answer or repeat STEP1.

Wrong output:-

Right output:-

66
EXPERIMENT NO 10
PERFORM AN EXPERIMENT TO LEARN TO CALCULATE BIGRAMS
FROM A GIVEN CORPUS AND CALCULATE PROBABILITY OF A
SENTENCE.

Introduction :- N - Grams
Probability of a sentence can be calculated by the probability of sequence of words occuring
in it. We can use Markov assumption, that the probability of a word in a sentence depends on the
probability of the word occuring just before it. Such a model is called first order Markov model or the
bigram model.

Here, Wn refers to the word token corresponding to the nth word in a sequence.

Theory

A combination of words forms a sentence. However, such a formation is meaningful only when the
words are arranged in some order.

Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However some perfectly grammatical sentences can
be nonsensical too!

Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning probabilities to the strings of
words i.e, how likely the sentence is.
67
Probability of a sentence

If we consider each word occurring in its correct location as an independent event,the probability of
the sentences is : P(w(1), w(2)..., w(n-1), w(n))

Using chain rule:

= P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2)…w(n-1))

Bigrams

We can avoid this very long calculation by approximating that the probability of a given word depends
only on the probability of its previous words. This assumption is called Markov assumption and such
a model is called Markov model- bigrams. Bigrams can be generalized to the n-gram which looks at
(n-1) words in the past. A bigram is a first-order Markov model.

Therefore ,

P(w(1), w(2)..., w(n-1), w(n))= P(w(2)|w(1)) P(w(3)|w(2)) …. P(w(n)|w(n-1))

We use (eos) tag to mark the beginning and end of a sentence

A bigram table for a given corpus can be generated and used as a lookup table for calculating
probability of sentences.

Eg: Corpus – (eos) You book a flight (eos) I read a book (eos) You read (eos)

Bigram Table:

(eos) you book a flight I read

(eos) 0 0.33 0 0 0 0.25 0

you 0 0 0.5 0 0 0 0.5

book 0.5 0 0 0.5 0 0 0

68
a 0 0 0.5 0 0.5 0 0

flight 1 0 0 0 0 0 0

I 0 0 0 0 0 0 1

read 0.5 0 0 0.5 0 0 0

P((eos) you read a book (eos))

= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)

= 0.33 * 0.5 * 0.5 * 0.5 * 0.5

=.020625

Objective :- The objective of this experiment is to learn to calculate bigrams from a given corpus and
calculate probability of a sentence.

Procedure:-
STEP1: Select a corpus and click on

Generate bigram table

69
STEP2: Fill up the table that is generated and hit

Submit

STEP3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step 2.

STEP4: If correct (green), click on take a quiz and fill the correct answer

70
EXPERIMENT NO 11
PERFORM AN EXPERIMENT TO LEARN HOW TO APPLY ADD-ONE
SMOOTHING ON SPARSE BIGRAM TABLE.

Introduction : - N-Grams Smoothing


One major problem with standard N-gram models is that they must be trained from some corpus, and
because any particular training corpus is finite, some perfectly acceptable N-grams are bound to be
missing from it. We can see that bigram matrix for any given training corpus is sparse. There are large
number of cases with zero probabilty bigrams and that should really have some non-zero probability.
This method tend to underestimate the probability of strings that happen not to have occurred nearby
in their training corpus.

There are some techniques that can be used for assigning a non-zero probabilty to these 'zero
probability bigrams'. This task of reevaluating some of the zero-probability and low-probabilty N-
grams, and assigning them non-zero values, is called smoothing.

71
Theory :-
The standard N-gram models are trained from some corpus. The finiteness of the training corpus leads
to the absence of some perfectly acceptable N-grams. This results in sparse bigram matrices. This
method tend to underestimate the probability of strings that do not occur in their training corpus.

There are some techniques that can be used for assigning a non-zero probabilty to these 'zero
probability bigrams'. This task of reevaluating some of the zero-probability and low-probabilty N-
grams, and assigning them non-zero values, is called smoothing. Some of the techniques are: Add-
One Smoothing, Witten-Bell Discounting, Good-Turing Discounting.

Add-One Smoothing
In Add-One smooting, we add one to all the bigram counts before normalizing them into probabilities.
This is called add-one smoothing.

Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram probability can be computed by
dividing the count of the word by the total number of word tokens N
P(wx) = c(wx)/sumi{c(wi)}
= c(wx)/N

Let there be an adjusted count c*.


c i* = (c < sub="">+1)*N/(N+V)
where where V is the total number of word types in the language.
Now, probabilities can be calculated by normalizing counts by N.
p i* = (c < sub="">+1)/(N+V)

Application on bigrams
Normal bigram probabilities are computed by normalizing each row of counts by the unigram count:
P(w n|wn-1) = C(wn-1wn)/C(wn-1)
For add-one smoothed bigram counts we need to augment the unigram count by the number of total
word types in the vocabulary V:
p *(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )

Objective:- The objective of this experiment is to learn how to apply add-one smoothing on sparse bigram
table.

Procedure :-
72
STEP1: Select a corpus

STEP2: Apply add one smoothing and calculate bigram probabilities using the given bigram counts,N
and V. Fill the table and hit

Submit

STEP3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step 2

73
EXPERIMENT NO 12
PERFORM AN EXPERIMENT TO CALCULATE EMISSION AND TRANSITION
MATRIX WHICH WILL BE HELPFUL FOR TAGGING PARTS OF SPEECH
USING HIDDEN MARKOV MODEL.

Introduction:-
POS TAGGING - Hidden Markov Model

POS tagging or part-of-speech tagging is the procedure of assigning a grammatical category like noun,
verb, adjective etc. to a word. In this process both the lexical information and the context play an
important role as the same lexical form can behave differently in a different context.

For example the word "Park" can have two different lexical categories based on the context.

1. The boy is playing in the park. ('Park' is Noun)


2. Park the car. ('Park' is Verb)

Assigning part of speech to words by hand is a common exercise one can find in an elementary
grammar class. But here we wish to build an automated tool which can assign the appropriate part-of-
speech tag to the words of a given sentence. One can think of creating hand crafted rules by observing
74
patterns in the language, but this would limit the system's performance to the quality and number of
patterns identified by the rule crafter. Thus, this approach is not practically adopted for building POS
Tagger. Instead, a large corpus annotated with correct POS tags for each word is given to the computer
and algorithms then learn the patterns automatically from the data and store them in form of a trained
model. Later this model can be used to POS tag new sentences

In this experiment we will explore how such a model can be learned from the data.

Theory : -

A Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is
assumed to be a Markov process with unobserved (hidden) states.In a regular Markov model (Markov
Model (Ref: http://en.wikipedia.org/wiki/Markov_model)), the state is directly visible to the observer,
and therefore the state transition probabilities are the only parameters. In a hidden Markov model, the
state is not directly visible, but output, dependent on the state, is visible.

Hidden Markov Model has two important components-

1)Transition Probabilities: The one-step transition probability is the probability of transitioning from
one state to another in a single step.

2)Emission Probabilties: : The output probabilities for an observation from state. Emission
probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where okis an Observation. Informally, B is the
probability that the output is ok given that the current state is qi

For POS tagging, it is assumed that POS are generated as random process, and each process randomly
generates a word. Hence, transition matrix denotes the transition probability from one POS to another
and emission matrix denotes the probability that a given word can have a particular POS. Word acts
as the observations. Some of the basic assumptions are:

1. First-order (bigram) Markov assumptions:


75
a. Limited Horizon: Tag depends only on previous tag

P(ti+1 = tk | t1=tj1,âŚ,ti=tji) = P(ti+1 = tk | ti = tj)

b. Time invariance: No change over time

P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj -> tk)

2. Output probabilities:

Probability of getting word wk for tag tj: P(wk | tj) is independent of other tags or words!

Calculating the Probabilities

Consider the given toy corpus

EOS/eos

They/pronoun

cut/verb

the/determiner

paper/noun

EOS/eos He/pronoun

asked/verb

for/preposition

his/pronoun

cut/noun.

EOS/eos

Put/verb

the/determiner

paper/noun

in/preposition

the/determiner

cut/noun

EOS/eos
76
Calculating Emission Probability Matrix

Count the no. of times a specific word occus with a specific POS tag in the corpus.

Here, say for "cut"

count(cut,verb)=1

count(cut,noun)=2

count(cut,determiner)=0

... and so on zero for other tags too.

count(cut) = total count of cut = 3

Now, calculating the probability

Probability to be filled in the matrix cell at the intersection of cut and verb

P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33

Similarly,

Probability to be filled in the cell at he intersection of cut and determiner

P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0

Repeat the same for all the word-tag combination and fill the

Calculating Transition Probability Matrix

Count the no. of times a specific tag comes after other POS tags in the corpus.

Here, say for "determiner"

count(verb,determiner)=2

count(preposition,determiner)=1

count(determiner,determiner)=0

count(eos,determiner)=0

count(noun,determiner)=0

... and so on zero for other tags too.

count(determiner) = total count of tag 'determiner' = 3

77
Now, calculating the probability

Probability to be filled in the cell at he intersection of determiner(in the column) and verb(in the row)

P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66

Similarly,

Probability to be filled in the cell at the intersection of determiner(in the column) and noun(in the
row)

P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0

Repeat the same for all the tags

Note: EOS/eos is a special marker which represents End Of Sentence.

Objective - The objective of the experiment is to calculate emission and transition matrix which will be
helpful for tagging Parts of Speech using Hidden Markov Model.

Procedure :-

STEP1: Select the corpus.

STEP2: For the given corpus fill the emission and transition matrix. Answers are rounded to 2 decimal
digits.

78
STEP3: Press Check to check your answer.

W rong answers are indicated by the red cell.

79
EXPERIMENT NO 13
80
PERFORM AN EXPERIMENT TO KNOW THE IMPORTANCE OF CONTEXT
AND SIZE OF TRAINING CORPUS IN LEARNING PARTS OF SPEECH

Introduction-
Building POS Tagger

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or
word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a
particular part of speech, based on both its definition, as well as its contextâi.e. relationship with adjacent and
related words in a phrase, sentence, or paragraph. A simplified form of this is identification of words as nouns,
verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of
computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech,
in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups: rule-
based and stochastic.

Theory:-

Hidden Markov Model

In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs) to disambiguate
parts of speech. HMMs involve counting cases, and making a table of the probabilities of certain sequences.
For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an
adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more
likely to be a noun than a verb or a modal. The same method can of course be used to benefit from
knowledge about the following words.

More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger
sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a
preposition, article, or noun, but much less likely another verb.

When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate
every combination and to assign a relative probability to each one, by multiplying together the probabilities
81
of each choice in turn.

It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language
parsing, that merely assigning the most common tag to each known word and the tag "proper noun" to all
unknowns, will approach 90% accuracy because many words are unambiguous.

HMMs underlie the functioning of stochastic taggers and are used in various algorithms. Accuracies for one
such algorithm (TnT) on various training data is shown here.

Conditional Random Field

Conditional random fields (CRFs) are a class of statistical modelling method often applied in machine
learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a
single sample without regard to "neighboring" samples, a CRF can take context into account. Since it can
consider context, therefore CRF can be used in Natural Language Processing. Hence, Parts of Speech
tagging is also possible. It predicts the POS using the lexicons as the context.

If only one neighbour is considered as a context, then it is called bigram. Similarly, two neighbours as the
context is called trigram. In this experiment, size of training corpus and context were varied to know their
importance.

Objective - The objective of the experiment is to know the importance of context and size of training
corpus in learning Parts of Speech

Procedure :-

STEP1: Select the language.

82
OUTPUT: Drop down to select size of corpus, algorithm and features will appear.

STEP2: Select corpus size.

STEP3: Select algorithm "CRF" or "HMM".

STEP4:

Select feature "bigram" or "trigram".

83
OUTPUT: Corresponding accuracy will be shown.

84
EXPERIMENT NO 14
PERFORM AN EXPERIMENT TO UNDERSTAND THE CONCEPT OF
CHUNKING AND GET FAMILIAR WITH THE BASIC CHUNK TAGSET.

Introduction : - Chunking
Chunking of text invloves dividing a text into syntactically correlated words. For example, the
sentence 'He ate an apple.' can be divided as follows:

Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit. This can be formally expressed by using IOB prefixes.

Theory : -
85
Chunking of text invloves
dividing a text into
syntactically correlated words.

Eg: He ate an apple to satiate his hunger. [NP He ] [VP ate


] [NP an apple] [VP to satiate] [NP his hunger]

Eg: दरवाज़ा खल
ु गया
[NP दरवाज़ा] [VP खलु गया]

Chunk Types

The chunk types are based on the


syntactic category part. Besides the head a chunk also
contains modifiers (like determiners, adjectives,
postpositions in NPs).

The basic types of chunks in English are:


Chunk Type Tag Name
1. Noun NP
2. Verb VP
3. Adverb ADVP
4. Adjectivial ADJP
5. Prepositional PP

The basic Chunk Tag Set for Indian Languages

Sl. No Chunk Type Tag Name


1 Noun Chunk NP
2.1 Finite Verb Chunk VGF
2.2 Non-finite Verb Chunk VGNF
2.3 Verb Chunk (Gerund) VGNN
3 Adjectival Chunk JJP
4 Adverb Chunk RBP

NP Noun Chunks

Noun Chunks will be given the tag NP and include


non-recursive noun phrases and postposition for Indian
languages and preposition for English. Determiners,
adjectives and other modifiers will be part of the noun
chunk.

Eg:

(इस/DEM िकताब/NN में/PSP)NP


'this' 'book' 'in'

((in/IN the/DT big/ADJ room/NN))NP

86
Verb Chunks

The verb chunks are marked as VP for English, however they


would be of several types for Indian languages. A verb
group will include the main verb and its auxiliaries, if
any.

For English:

I (will/MD be/VB loved/VBD)VP

The types of verb chunks and their tags are described below.

1. VGF Finite Verb Chunk

The auxiliaries in the verb group mark the finiteness of the


verb at the chunk level. Thus, any verb group which is
finite will be tagged as VGF. For example,

Eg: मैंने घर पर (खाया/VM)VGF


'I erg''home' 'at''meal' 'ate'

2. VGNF Non-finite Verb Chunk

A non-finite verb chunk will be tagged as VGNF.

Eg: सेब (खाता/VM ह�आ/VAUX)VGNF लड़का जा रहा है


'apple' 'eating' 'PROG' 'boy' go' 'PROG' 'is'

3. VGNN Gerunds

A verb chunk having a gerund will be annotated as VGNN.

Eg: शराब (पीना/VM)VGNN सेहत के िलए हािनकारक है sharAba


'liquor' 'drinking' 'heath' 'for' 'harmful' 'is'

JJP/ADJP Adjectival Chunk

An adjectival chunk will be tagged as ADJP for English and


JJP for Indian languages. This chunk will consist of all
adjectival chunks including the predicative adjectives.

Eg:

वह लड़क� है (सन्ु दर/JJ)JJP

The fruit is (ripe/JJ)ADJP

Note: Adjectives appearing before a noun will be grouped


together within the noun chunk.

87
RBP/ADVP Adverb Chunk

This chunk will include all pure adverbial phrases.

Eg:

वह (धीरे -धीरे /RB)RBP चल रहा था


'he' 'slowly' 'walk' 'PROG' 'was'

He walks (slowly/ADV)/ADVP

PP Prepositional Chunk

This chunk type is present


for only English and not for Indian languages. It consists
of only the preposition and not the NP argument.

Eg:

(with/IN)PP a pen

IOB prefixes

Each chunk has an open boundary and close boundary that


delimit the word groups as a minimal non-recursive
unit. This can be formally expressed by using IOB prefixes:
B-CHUNK for the first word of the chunk and I-CHUNK for each
other word in the chunk. Here is an example of the file
format:

Tokens POS Chunk-Tags

He PRP B-NP
ate VBD B-VP
an DT B-NP
apple NN I-NP
to TO B-VP
satiate VB I-VP
his PRP$ B-NP
hunger NN I-NP

Objective : - The objective of this experiment is to understand the concept of chunking and get familiar
with the basic chunk tagset.

Procedure : -

88
STEP1: Select a language

STEP2: Select a sentence

STEP3: Select the corresponding chunk-tag for each word in the sentence and click the

Submit button.

89
OUTPUT1: The submitted answer will be checked.

Click on the Get Answer button for the correct answer.

EXPERIMENT NO 15
THE OBJECTIVE OF THIS EXPERIMENT IS TO FIND POS TAGS OF
WORDS IN A SENTENCE USING VITERBI DECODING.

Introduction- POS Tagging - Viterbi Decoding


In this experiment the transmission and emission matrix will be used to find the POS tag sequence for a
given sentence. When we have an emission and transition matrix, various algorithms can be applied to find
out the POS tags for words. Some of the possible algorithms are: Backward algorithm, forward algorithm
and viterbi algorithm. Here, in this experiment, you can get familiar with Viterbi Decoding
90
Theory - Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about probabilities of a POS tag for
a given word and transmission matrix gives the probability of transition from one POS tag to another POS
tag. It observes sequence of words and returns the state sequences of POS tags along with its probability.

Here "s" denotes words and "t" denotes tags. "a" is transmission matrix and "b" is emission matrix

Using above algorithm, we have to fill the viterbi table column by column.

Objective :- The objective of this experiment is to find POS tags of words in a sentence using Viterbi
decoding.

Procedure : -

STEP1:Select the corpus.

91
OUTPUT: Emission and Transmission matrix will appear.

STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e. form the viterbi
matrix by filling colum for each observation). Answers submitted are rounded off to 3 digits after decimal
and are than checked.

STEP3: Check the column.

92
Wrong answers are indicated by red backgound in a cell.

If answers are right, then go to step2

STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered.

93
STEP5: At last check the POS tag for each word obtained from backtracking

94
95

You might also like