0% found this document useful (0 votes)
14 views

UNIT-2 NLP

The document discusses the role of lexicons in natural language processing (NLP), detailing their importance for tasks such as sentiment analysis, named entity recognition, and part-of-speech tagging. It outlines various types of lexicons, their applications, and the challenges faced in lexicon-based NLP, including lexical gaps and polysemy. Additionally, it describes methods for building lexicons, including manual annotation and automatic extraction, and highlights the significance of part-of-speech tagging in understanding grammatical structures and improving NLP task accuracy.

Uploaded by

udayasai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

UNIT-2 NLP

The document discusses the role of lexicons in natural language processing (NLP), detailing their importance for tasks such as sentiment analysis, named entity recognition, and part-of-speech tagging. It outlines various types of lexicons, their applications, and the challenges faced in lexicon-based NLP, including lexical gaps and polysemy. Additionally, it describes methods for building lexicons, including manual annotation and automatic extraction, and highlights the significance of part-of-speech tagging in understanding grammatical structures and improving NLP task accuracy.

Uploaded by

udayasai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT – 2 MORPHOLOGYANDPART-OF-SPEECHPROCESSING

2.1 Lexicons
Lexicons in natural language programming refer to the group of words, as those of words or phrases are
associated with certain features and in a certain style, such as the part of speech segmentation, as well as
sentiment analysis, and they are also used as a source for interpreting human language by providing
special information about meanings and uses, as well as grammatical properties those words since they
contain information about the semantic relationship between words such as synonyms, antonyms, and
hyper-expressions in that part, these lexicons can also be generated by hand automatically using machine
learning, and they can also be of a multilingual nature.

Importance of Lexicons
Dictionaries are an important resource in nervous linguistic programming for a number of reasons, as they
provide information about the meanings of words and grammatical characteristics of words, as it is
necessary for the interpretation and precise analysis of human life.

It is used on the feast from the famous and essential applications in the programming of the nervous
abysses, such as refining feelings, translation, and detection of fraud messages and other
applications. Here, without these dictionaries, these mentioned applications can be at risk.

Determine the patterns and the relationship between excessive words. This information is greatly
necessary for a type of project such as text classification and also information retrieval process.
Reducing the arithmetic resources required for certain tasks, and this makes the efficiency of applications
significantly improve, such as pre-hidden trends by the dictionary, the way that we can reduce the number
of potential interpretations and focus on the most relevant interpretations on the required topic that is being
worked on as we cannot exaggerate in the importance of dictionaries because it is considered a resource
from Among the basic resources that support many major developments in the field of natural linguistic
programming.

Types of Lexicons
 Sentiment Lexicons: These lexicons contain words or phrases annotated with sentiment labels
(positive, negative, neutral). They are used in sentiment analysis tasks to classify the sentiment of
texts.
 WordNet: WordNet is a lexical database that groups words into sets of synonyms (synsets),
provides short definitions, and records the semantic relationships between these sets. It's widely
used in tasks involving semantic analysis and word sense disambiguation.
 Named Entity Recognition (NER) Lexicons: Lexicons used to identify and classify named
entities (such as persons, organizations, locations) in text. These lexicons often contain lists of
names, places, and other entities of interest.
 Part-of-Speech (POS) Lexicons: Lexicons that map words to their corresponding part-of-speech
tags (e.g., noun, verb, adjective) based on their syntactic properties. These are essential for tasks
like POS tagging and syntactic parsing.
 Domain-Specific Lexicons: Lexicons tailored to specific domains or industries, containing
terminology and jargon relevant to those domains. These lexicons help improve the accuracy of
NLP tasks in specialized contexts.
 Emotion Lexicons: Similar to sentiment lexicons, these contain words or phrases annotated with
emotional labels (e.g., joy, sadness, anger) and are used in emotion detection tasks.
 Slang and Informal Lexicons: Lexicons that capture informal language, slang, abbreviations,
and colloquial expressions commonly used in social media and informal communication.
 Multi-lingual Lexicons: Lexicons that provide translations or equivalents of words and phrases
across multiple languages, facilitating tasks like machine translation and cross-lingual information
retrieval.
 Phonetic Lexicons: Lexicons that include information about the pronunciation of words,
including phonetic transcriptions, stress patterns, and other phonological features. These are
useful for speech synthesis and recognition tasks.
 Acronym and Abbreviation Lexicons: Lexicons containing mappings of acronyms and
abbreviations to their expanded forms, helping in tasks involving text normalization and
disambiguation.

Applications of Lexicons in NLP

1. Sentiment Analysis:
Here, dictionaries are used to classify the text based on feelings or emotional tone, as it contains a
dictionary of feelings on words and phrases associated with positive or negative feelings. By matching
words in the lexicon with words in the text, sentiment analysis algorithms can determine the general feeling
of the text being worked on.

2. Named Entity Recognition:


This dictionary contains lists of named entities such as people, organizations, and locations. Named entity
recognition algorithms to use these lexicons to identify and categorize named entities in localized and
highly searched text.

3. Part-of-Speech Tagging:
The dictionaries that we are explaining now can be used to work on assigning signs of speech to the words
in the sentence. A part-of-speech lexicon contains lists of words and their associated parts of speech (eg,
noun, verb, adjective). Marking a part of speech is an important step and is greatly appreciated in popular
applications such as parsing and machine translation.
4. Word Sense Disambiguation:
Dictionaries can also be used to clarify multisensory words. The word meaning dictionary contains lists of
words and their associated meanings. As it uses algorithms to remove ambiguity in the meaning of these
dictionaries to determine the correct meaning of the word in a specific context.

5. Machine translation:
Glossaries can be used in machine translation systems to map words and phrases from one language to
another. A bilingual dictionary contains pairs of words or phrases in two languages and their corresponding
translations.

6. Information retrieval:
Dictionaries are used in information retrieval systems to improve the accuracy of search results. An index
dictionary contains lists of words and their associated documents or web pages. By matching search
queries to comments in the index dictionary, information retrieval systems can quickly retrieve relevant
documents or web pages.
In general, lexicons play an important role in many NLP applications by providing a rich source of linguistic
information that can be used to improve the accuracy and efficiency of text analysis and processing.

Challenges of Lexicon-based NLP

1. Lexical gaps:
When we talk about the main challenges in lexical-based NLP, we have to mention lexical gaps, which
occur when a word or phrase is not included in the lexicon. New words appear over time to meet this limit,
as there are researchers who have developed methods to expand the vocabulary automatically such as
bootstrapping and active learning.

2. Polysemy and homonymy:


When we look at another challenge for lexical-based NLP is dealing with the polysemy and symmetry of
words, we see what happens when one word has multiple meanings or when multiple words have the
same spelling or pronunciation but different meanings. This can lead to ambiguity and actual errors in NLP
systems that are constantly working on it, especially in tasks such as tagging a part of speech and
clarifying the meaning of a word. To address this challenge, NLP researchers have developed context-
sensitive demystification methods, such as word embedding and deep learning models.

3. Domain specificity:
At this point, specific dictionaries can cover specific fields or languages because they depend on the
applicability of NLP systems to the specific tasks or contexts in which they are worked on, for example, a
lexicon designed for English may not be useful in parsing the text we have in other languages. Also, there
may be medical documents, technologies, or general purposes effective in analyzing the text of a specific
field, such as medical or legal documents for the work of investigating this part of the work. Developers and
researchers have developed that part and how to navigate through these languages.
Methods for Building Lexicons
There are several methods for building lexicons in natural language processing (NLP), each with its own
strengths and weaknesses. Some methods rely on manual annotation, while others use automatic
extraction techniques. Hybrid approaches that combine manual and automatic methods are also commonly
used.
1. Manual Annotation: Manual annotation involves human experts or crowdsourcing workers adding
linguistic information to a corpus of text. This information can include part-of-speech tags, word
senses, named entities, and sentiment labels. Manual annotation can be time-consuming and
expensive, but it is often necessary for creating high-quality lexicons for specialized domains or low-
resource languages.
2. Automatic Extraction: Automatic extraction methods use statistical and machine learning
techniques to extract linguistic information from large amounts of unannotated text. For example,
collocation extraction can be used to identify words that tend to co-occur with other words, which can
be a useful way to identify synonyms and related words. Word sense induction can be used to group
words with similar meanings together, even if they have different surface forms. Automatic extraction
methods can be fast and scalable, but they are also prone to errors and may require significant
manual validation.
3. Hybrid Approaches: Hybrid approaches combine manual and automatic methods to leverage the
strengths of both. For example, a lexicon may be created using automatic extraction methods, and
then manually validated and corrected by human experts. This can help to ensure the accuracy and
completeness of the lexicon while also reducing the time and cost required for manual annotation.
In recent years, there has been growing interest in using neural language models, such as BERT and
GPT, for building lexicons. These models are trained on large amounts of text and can learn to represent
the meanings of words and phrases in a dense vector space. By clustering these vectors, it is possible to
identify groups of words that have similar meanings, which can be used to create a word embedding
lexicon. Neural language models can be highly effective for building lexicons, but they also require large
amounts of training data and significant computational resources.
2.2 POS Tagging
Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each word in a text
is labeled with its corresponding part of speech. This can include nouns, verbs, adjectives, and other
grammatical categories.
POS tagging is useful for a variety of NLP tasks, such as information extraction, named entity recognition,
and machine translation. It can also be used to identify the grammatical structure of a sentence and to
disambiguate words that have multiple meanings.
Let’s take an example,
Text: “The cat sat on the mat.”
POS tags:
 The: determiner
 cat: noun
 sat: verb
 on: preposition
 the: determiner
 mat: noun

Use of Parts of Speech Tagging in NLP


There are several reasons why we might tag words with their parts of speech (POS) in natural language
processing (NLP):
 To understand the grammatical structure of a sentence: By labeling each word with its POS,
we can better understand the syntax and structure of a sentence. This is useful for tasks such as
machine translation and information extraction, where it is important to know how words relate to
each other in the sentence.
 To disambiguate words with multiple meanings: Some words, such as “bank,” can have
multiple meanings depending on the context in which they are used. By labeling each word with its
POS, we can disambiguate these words and better understand their intended meaning.
 To improve the accuracy of NLP tasks: POS tagging can help improve the performance of
various NLP tasks, such as named entity recognition and text classification. By providing additional
context and information about the words in a text, we can build more accurate and sophisticated
algorithms.
To facilitate research in linguistics: POS tagging can also be used to study the patterns and characteristics
of language use and to gain insights into the structure and function of different parts of speech.

Steps Involved in the POS tagging


Here are the steps involved in a typical example of part-of-speech (POS) tagging in natural language
processing (NLP):
 Collect a dataset of annotated text: This dataset will be used to train and test the POS tagger.
The text should be annotated with the correct POS tags for each word.
 Preprocess the text: This may include tasks such as tokenization (splitting the text into individual
words), lowercasing, and removing punctuation.
 Divide the dataset into training and testing sets: The training set will be used to train the POS
tagger, and the testing set will be used to evaluate its performance.
 Train the POS tagger: This may involve building a statistical model, such as a hidden Markov
model (HMM), or defining a set of rules for a rule-based or transformation-based tagger. The
model or rules will be trained on the annotated text in the training set.
 Test the POS tagger: Use the trained model or rules to predict the POS tags of the words in the
testing set. Compare the predicted tags to the true tags and calculate metrics such as precision
and recall to evaluate the performance of the tagger.
 Fine-tune the POS tagger: If the performance of the tagger is not satisfactory, adjust the model or
rules and repeat the training and testing process until the desired level of accuracy is achieved.
 Use the POS tagger: Once the tagger is trained and tested, it can be used to perform POS
tagging on new, unseen text. This may involve preprocessing the text and inputting it into the
trained model or applying the rules to the text. The output will be the predicted POS tags for each
word in the text.

Application of POS Tagging


There are several real-life applications of part-of-speech (POS) tagging in natural language processing
(NLP):
 Information extraction: POS tagging can be used to identify specific types of information in a
text, such as names, locations, and organizations. This is useful for tasks such as extracting data
from news articles or building knowledge bases for artificial intelligence systems.
 Named entity recognition: POS tagging can be used to identify and classify named entities in a
text, such as people, places, and organizations. This is useful for tasks such as building customer
profiles or identifying key figures in a news story.
 Text classification: POS tagging can be used to help classify texts into different categories, such
as spam emails or sentiment analysis. By analyzing the POS tags of the words in a text,
algorithms can better understand the content and tone of the text.
 Machine translation: POS tagging can be used to help translate texts from one language to
another by identifying the grammatical structure and relationships between words in the source
language and mapping them to the target language.
 Natural language generation: POS tagging can be used to generate natural-sounding text by
selecting appropriate words and constructing grammatically correct sentences. This is useful for
tasks such as chatbots and virtual assistants.

Types of POS Tagging in NLP

Rule Based POS Tagging


Rule-based part-of-speech (POS) tagging is a method of labeling words with their corresponding parts of
speech using a set of pre-defined rules. This is in contrast to machine learning-based POS tagging, which
relies on training a model on a large annotated corpus of text.
In a rule-based POS tagging system, words are assigned POS tags based on their characteristics and the
context in which they appear. For example, a rule-based POS tagger might assign the tag “noun” to any
word that ends in “-tion” or “-ment,” as these suffixes are often used to form nouns.
Rule-based POS taggers can be relatively simple to implement and are often used as a starting point for
more complex machine learning-based taggers. However, they can be less accurate and less efficient than
machine learning-based taggers, especially for tasks with large or complex datasets.

Here is an example of how a rule-based POS tagger might work:


 Define a set of rules for assigning POS tags to words. For example:
 If the word ends in “-tion,” assign the tag “noun.”
 If the word ends in “-ment,” assign the tag “noun.”
 If the word is all uppercase, assign the tag “proper noun.”
 If the word is a verb ending in “-ing,” assign the tag “verb.”
 Iterate through the words in the text and apply the rules to each word in turn. For
example:
 “Nation” would be tagged as “noun” based on the first rule.
 “Investment” would be tagged as “noun” based on the second rule.
 “UNITED” would be tagged as “proper noun” based on the third rule.
 “Running” would be tagged as “verb” based on the fourth rule.
 Output the POS tags for each word in the text.
This is a very basic example of a rule-based POS tagger, and more complex systems can include
additional rules and logic to handle more varied and nuanced text.

Statistical POS Tagging


Statistical part-of-speech (POS) tagging is a method of labeling words with their corresponding parts of
speech using statistical techniques. This is in contrast to rule-based POS tagging, which relies on pre-
defined rules, and to unsupervised learning-based POS tagging, which does not use any annotated
training data.
In statistical POS tagging, a model is trained on a large annotated corpus of text to learn the patterns and
characteristics of different parts of speech. The model uses this training data to predict the POS tag of a
given word based on the context in which it appears and the probability of different POS tags occurring in
that context.
Statistical POS taggers can be more accurate and efficient than rule-based taggers, especially for tasks
with large or complex datasets. However, they require a large amount of annotated training data and can
be computationally intensive to train.
Here is an example of how a statistical POS tagger might work:
 Collect a large annotated corpus of text and divide it into training and testing sets.
 Train a statistical model on the training data, using techniques such as maximum
likelihood estimation or hidden Markov models.
 Use the trained model to predict the POS tags of the words in the testing data.
 Evaluate the performance of the model by comparing the predicted tags to the true tags
in the testing data and calculating metrics such as precision and recall.
 Fine-tune the model and repeat the process until the desired level of accuracy is
achieved.
 Use the trained model to perform POS tagging on new, unseen text.
There are various statistical techniques that can be used for POS tagging, and the choice of technique will
depend on the specific characteristics of the dataset and the desired level of accuracy.

Transformation-based tagging (TBT)


Transformation-based tagging (TBT) is a method of part-of-speech (POS) tagging that uses a series of
rules to transform the tags of words in a text. This is in contrast to rule-based POS tagging, which assigns
tags to words based on pre-defined rules, and to statistical POS tagging, which relies on a trained model to
predict tags based on probability.
In TBT, a set of rules is defined to transform the tags of words in a text based on the context in which they
appear. For example, a rule might change the tag of a verb to a noun if it appears after a determiner such
as “the.” The rules are applied to the text in a specific order, and the tags are updated after each
transformation.
TBT can be more accurate than rule-based tagging, especially for tasks with complex grammatical
structures. However, it can be more computationally intensive and requires a larger set of rules to achieve
good performance.
Here is an example of how a TBT system might work:
 Define a set of rules for transforming the tags of words in the text. For example:
 If the word is a verb and appears after a determiner, change the tag to “noun.”
 If the word is a noun and appears after an adjective, change the tag to “adjective.”
 Iterate through the words in the text and apply the rules in a specific order. For example:
 In the sentence “The cat sat on the mat,” the word “sat” would be changed from a verb to
a noun based on the first rule.
 In the sentence “The red cat sat on the mat,” the word “red” would be changed from an
adjective to a noun based on the second rule.
 Output the transformed tags for each word in the text.
This is a very basic example of a TBT system, and more complex systems can include additional rules and
logic to handle more varied and nuanced text.

Hidden Markov Model POS tagging


Hidden Markov models (HMMs) are a type of statistical model that can be used for part-of-speech (POS)
tagging in natural language processing (NLP). In an HMM-based POS tagger, a model is trained on a large
annotated corpus of text to learn the patterns and characteristics of different parts of speech. The model
uses this training data to predict the POS tag of a given word based on the probability of different tags
occurring in the context of the word.
An HMM-based POS tagger consists of a set of states, each corresponding to a possible POS tag, and a
set of transitions between the states. The model is trained on the training data to learn the probabilities of
transitioning from one state to another and the probabilities of observing different words given a particular
state.
To perform POS tagging on a new text using an HMM-based tagger, the model uses the probabilities
learned during training to compute the most likely sequence of POS tags for the words in the text. This is
typically done using the Viterbi algorithm, which calculates the probability of each possible sequence of
tags and selects the most likely one.
HMMs are widely used for POS tagging and other tasks in NLP due to their ability to model complex
sequential data and their efficiency in computation. However, they can be sensitive to the quality of the
training data and may require a large amount of annotated data to achieve good performance.

Challenges in POS Tagging


Some common challenges in part-of-speech (POS) tagging include:
 Ambiguity: Some words can have multiple POS tags depending on the context in which they
appear, making it difficult to determine their correct tag. For example, the word “bass” can be a
noun (a type of fish) or an adjective (having a low frequency or pitch).
 Out-of-vocabulary (OOV) words: Words that are not present in the training data of a POS tagger
can be difficult to tag accurately, especially if they are rare or specific to a particular domain.
 Complex grammatical structures: Languages with complex grammatical structures, such as
languages with many inflections or free word order, can be more challenging to tag accurately.
 Lack of annotated training data: Some languages or domains may have limited annotated
training data, making it difficult to train a high-performing POS tagger.
 Inconsistencies in annotated data: Annotated data can sometimes contain errors or
inconsistencies, which can negatively impact the performance of a POS tagger.

2.3 Word sense disambiguation


 Word sense disambiguation, in natural language processing (NLP), may be defined as the ability
to determine which meaning of word is activated by the use of word in a particular context.
 Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP system
faces. Part-of-speech (POS) taggers with high level of accuracy can solve Word’s syntactic
ambiguity.
 On the other hand, the problem of resolving semantic ambiguity is called WSD (word sense
disambiguation). Resolving semantic ambiguity is harder than resolving syntactic ambiguity.

For example, consider the two examples of the distinct sense that exist for the word “bass” −

I can hear bass sound.

He likes to eat grilled bass.

The occurrence of the word bass clearly denotes the distinct meaning. In first sentence, it means
frequency and in second, it means fish. Hence, if it would be disambiguated by WSD then the correct
meaning to the above sentences can be assigned as follows −

I can hear bass/frequency sound.

He likes to eat grilled bass/fish.

Evaluation of WSD

The evaluation of WSD requires the following two inputs −

A Dictionary

The very first input for evaluation of WSD is dictionary, which is used to specify the senses to be
disambiguated.

Test Corpus

Another input required by WSD is the high-annotated test corpus that has the target or correct-senses.
The test corpora can be of two types

Lexical sample − This kind of corpora is used in the system, where it is required to disambiguate
a small sample of words.

All-words − This kind of corpora is used in the system, where it is expected to disambiguate all
the words in a piece of running text.

Approaches and Methods to Word Sense Disambiguation (WSD)

Approaches and methods to WSD are classified according to the source of knowledge used in
word disambiguation.

Methods in WSD
1. Dictionary-based or Knowledge-based Methods

As the name suggests, for disambiguation, these methods primarily rely on dictionaries, treasures
and lexical knowledge base. They do not use corpora evidences for disambiguation. The Lesk
method is the seminal dictionary-based method introduced by Michael Lesk in 1986. The Lesk
definition, on which the Lesk algorithm is based is “measure overlap between sense definitions for
all words in context”. However, in 2000, Kilgarriff and Rosensweig gave the simplified Lesk definition
as “measure overlap between sense definitions of word and current context”, which further means
identify the correct sense for one word at a time. Here the current context is the set of words in
surrounding sentence or paragraph.

2. Supervised Methods

For disambiguation, machine learning methods make use of sense-annotated corpora to train. These
methods assume that the context can provide enough evidence on its own to disambiguate the
sense. In these methods, the words knowledge and reasoning are deemed unnecessary. The context
is represented as a set of “features” of the words. It includes the information about the surrounding
words also. Support vector machine and memory-based learning are the most successful supervised
learning approaches to WSD. These methods rely on substantial amount of manually sense-tagged
corpora, which is very expensive to create.

3. Semi-supervised Methods

Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi-
supervised learning methods. It is because semi-supervised methods use both labelled as well as
unlabeled data. These methods require very small amount of annotated text and large amount of
plain unannotated text. The technique that is used by semisupervised methods is bootstrapping
from seed data.

4. Unsupervised Methods

These methods assume that similar senses occur in similar context. That is why the senses can be
induced from text by clustering word occurrences by using some measure of similarity of the
context. This task is called word sense induction or discrimination. Unsupervised methods have
great potential to overcome the knowledge acquisition bottleneck due to non-dependency on
manual efforts.

Applications of Word Sense Disambiguation (WSD)

Word sense disambiguation (WSD) is applied in almost every application of language technology.

1. Machine Translation

Machine translation or MT is the most obvious application of WSD. In MT, Lexical choice for the
words that have distinct translations for different senses, is done by WSD. The senses in MT are
represented as words in the target language. Most of the machine translation systems do not use
explicit WSD module.

2. Information Retrieval (IR)

Information retrieval (IR) may be defined as a software program that deals with the organization,
storage, retrieval and evaluation of information from document repositories particularly textual
information. The system basically assists users in finding the information they required but it does
not explicitly return the answers of the questions. WSD is used to resolve the ambiguities of the
queries provided to IR system. As like MT, current IR systems do not explicitly use WSD module and
they rely on the concept that user would type enough context in the query to only retrieve relevant
documents.

3. Text Mining and Information Extraction (IE)

In most of the applications, WSD is necessary to do accurate analysis of text. For example, WSD
helps intelligent gathering system to do flagging of the correct words. For example, medical
intelligent system might need flagging of “illegal drugs” rather than “medical drugs”

4. Lexicography

WSD and lexicography can work together in loop because modern lexicography is corpusbased. With
lexicography, WSD provides rough empirical sense groupings as well as statistically significant
contextual indicators of sense.

Difficulties in Word Sense Disambiguation (WSD)

Followings are some difficulties faced by word sense disambiguation (WSD) −

1. Differences between dictionaries

The major problem of WSD is to decide the sense of the word because different senses can be very
closely related. Even different dictionaries and thesauruses can provide different divisions of words
into senses.

2. Different algorithms for different applications

Another problem of WSD is that completely different algorithm might be needed for different
applications. For example, in machine translation, it takes the form of target word selection; and in
information retrieval, a sense inventory is not required.

3. Inter-judge variance

Another problem of WSD is that WSD systems are generally tested by having their results on a task
compared against the task of human beings. This is called the problem of interjudge variance.
4. Word-sense discreteness

Another difficulty in WSD is that words cannot be easily divided into discrete submeanings.

You might also like