UNIT-2 NLP
UNIT-2 NLP
2.1 Lexicons
Lexicons in natural language programming refer to the group of words, as those of words or phrases are
associated with certain features and in a certain style, such as the part of speech segmentation, as well as
sentiment analysis, and they are also used as a source for interpreting human language by providing
special information about meanings and uses, as well as grammatical properties those words since they
contain information about the semantic relationship between words such as synonyms, antonyms, and
hyper-expressions in that part, these lexicons can also be generated by hand automatically using machine
learning, and they can also be of a multilingual nature.
Importance of Lexicons
Dictionaries are an important resource in nervous linguistic programming for a number of reasons, as they
provide information about the meanings of words and grammatical characteristics of words, as it is
necessary for the interpretation and precise analysis of human life.
It is used on the feast from the famous and essential applications in the programming of the nervous
abysses, such as refining feelings, translation, and detection of fraud messages and other
applications. Here, without these dictionaries, these mentioned applications can be at risk.
Determine the patterns and the relationship between excessive words. This information is greatly
necessary for a type of project such as text classification and also information retrieval process.
Reducing the arithmetic resources required for certain tasks, and this makes the efficiency of applications
significantly improve, such as pre-hidden trends by the dictionary, the way that we can reduce the number
of potential interpretations and focus on the most relevant interpretations on the required topic that is being
worked on as we cannot exaggerate in the importance of dictionaries because it is considered a resource
from Among the basic resources that support many major developments in the field of natural linguistic
programming.
Types of Lexicons
Sentiment Lexicons: These lexicons contain words or phrases annotated with sentiment labels
(positive, negative, neutral). They are used in sentiment analysis tasks to classify the sentiment of
texts.
WordNet: WordNet is a lexical database that groups words into sets of synonyms (synsets),
provides short definitions, and records the semantic relationships between these sets. It's widely
used in tasks involving semantic analysis and word sense disambiguation.
Named Entity Recognition (NER) Lexicons: Lexicons used to identify and classify named
entities (such as persons, organizations, locations) in text. These lexicons often contain lists of
names, places, and other entities of interest.
Part-of-Speech (POS) Lexicons: Lexicons that map words to their corresponding part-of-speech
tags (e.g., noun, verb, adjective) based on their syntactic properties. These are essential for tasks
like POS tagging and syntactic parsing.
Domain-Specific Lexicons: Lexicons tailored to specific domains or industries, containing
terminology and jargon relevant to those domains. These lexicons help improve the accuracy of
NLP tasks in specialized contexts.
Emotion Lexicons: Similar to sentiment lexicons, these contain words or phrases annotated with
emotional labels (e.g., joy, sadness, anger) and are used in emotion detection tasks.
Slang and Informal Lexicons: Lexicons that capture informal language, slang, abbreviations,
and colloquial expressions commonly used in social media and informal communication.
Multi-lingual Lexicons: Lexicons that provide translations or equivalents of words and phrases
across multiple languages, facilitating tasks like machine translation and cross-lingual information
retrieval.
Phonetic Lexicons: Lexicons that include information about the pronunciation of words,
including phonetic transcriptions, stress patterns, and other phonological features. These are
useful for speech synthesis and recognition tasks.
Acronym and Abbreviation Lexicons: Lexicons containing mappings of acronyms and
abbreviations to their expanded forms, helping in tasks involving text normalization and
disambiguation.
1. Sentiment Analysis:
Here, dictionaries are used to classify the text based on feelings or emotional tone, as it contains a
dictionary of feelings on words and phrases associated with positive or negative feelings. By matching
words in the lexicon with words in the text, sentiment analysis algorithms can determine the general feeling
of the text being worked on.
3. Part-of-Speech Tagging:
The dictionaries that we are explaining now can be used to work on assigning signs of speech to the words
in the sentence. A part-of-speech lexicon contains lists of words and their associated parts of speech (eg,
noun, verb, adjective). Marking a part of speech is an important step and is greatly appreciated in popular
applications such as parsing and machine translation.
4. Word Sense Disambiguation:
Dictionaries can also be used to clarify multisensory words. The word meaning dictionary contains lists of
words and their associated meanings. As it uses algorithms to remove ambiguity in the meaning of these
dictionaries to determine the correct meaning of the word in a specific context.
5. Machine translation:
Glossaries can be used in machine translation systems to map words and phrases from one language to
another. A bilingual dictionary contains pairs of words or phrases in two languages and their corresponding
translations.
6. Information retrieval:
Dictionaries are used in information retrieval systems to improve the accuracy of search results. An index
dictionary contains lists of words and their associated documents or web pages. By matching search
queries to comments in the index dictionary, information retrieval systems can quickly retrieve relevant
documents or web pages.
In general, lexicons play an important role in many NLP applications by providing a rich source of linguistic
information that can be used to improve the accuracy and efficiency of text analysis and processing.
1. Lexical gaps:
When we talk about the main challenges in lexical-based NLP, we have to mention lexical gaps, which
occur when a word or phrase is not included in the lexicon. New words appear over time to meet this limit,
as there are researchers who have developed methods to expand the vocabulary automatically such as
bootstrapping and active learning.
3. Domain specificity:
At this point, specific dictionaries can cover specific fields or languages because they depend on the
applicability of NLP systems to the specific tasks or contexts in which they are worked on, for example, a
lexicon designed for English may not be useful in parsing the text we have in other languages. Also, there
may be medical documents, technologies, or general purposes effective in analyzing the text of a specific
field, such as medical or legal documents for the work of investigating this part of the work. Developers and
researchers have developed that part and how to navigate through these languages.
Methods for Building Lexicons
There are several methods for building lexicons in natural language processing (NLP), each with its own
strengths and weaknesses. Some methods rely on manual annotation, while others use automatic
extraction techniques. Hybrid approaches that combine manual and automatic methods are also commonly
used.
1. Manual Annotation: Manual annotation involves human experts or crowdsourcing workers adding
linguistic information to a corpus of text. This information can include part-of-speech tags, word
senses, named entities, and sentiment labels. Manual annotation can be time-consuming and
expensive, but it is often necessary for creating high-quality lexicons for specialized domains or low-
resource languages.
2. Automatic Extraction: Automatic extraction methods use statistical and machine learning
techniques to extract linguistic information from large amounts of unannotated text. For example,
collocation extraction can be used to identify words that tend to co-occur with other words, which can
be a useful way to identify synonyms and related words. Word sense induction can be used to group
words with similar meanings together, even if they have different surface forms. Automatic extraction
methods can be fast and scalable, but they are also prone to errors and may require significant
manual validation.
3. Hybrid Approaches: Hybrid approaches combine manual and automatic methods to leverage the
strengths of both. For example, a lexicon may be created using automatic extraction methods, and
then manually validated and corrected by human experts. This can help to ensure the accuracy and
completeness of the lexicon while also reducing the time and cost required for manual annotation.
In recent years, there has been growing interest in using neural language models, such as BERT and
GPT, for building lexicons. These models are trained on large amounts of text and can learn to represent
the meanings of words and phrases in a dense vector space. By clustering these vectors, it is possible to
identify groups of words that have similar meanings, which can be used to create a word embedding
lexicon. Neural language models can be highly effective for building lexicons, but they also require large
amounts of training data and significant computational resources.
2.2 POS Tagging
Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each word in a text
is labeled with its corresponding part of speech. This can include nouns, verbs, adjectives, and other
grammatical categories.
POS tagging is useful for a variety of NLP tasks, such as information extraction, named entity recognition,
and machine translation. It can also be used to identify the grammatical structure of a sentence and to
disambiguate words that have multiple meanings.
Let’s take an example,
Text: “The cat sat on the mat.”
POS tags:
The: determiner
cat: noun
sat: verb
on: preposition
the: determiner
mat: noun
For example, consider the two examples of the distinct sense that exist for the word “bass” −
The occurrence of the word bass clearly denotes the distinct meaning. In first sentence, it means
frequency and in second, it means fish. Hence, if it would be disambiguated by WSD then the correct
meaning to the above sentences can be assigned as follows −
Evaluation of WSD
A Dictionary
The very first input for evaluation of WSD is dictionary, which is used to specify the senses to be
disambiguated.
Test Corpus
Another input required by WSD is the high-annotated test corpus that has the target or correct-senses.
The test corpora can be of two types
Lexical sample − This kind of corpora is used in the system, where it is required to disambiguate
a small sample of words.
All-words − This kind of corpora is used in the system, where it is expected to disambiguate all
the words in a piece of running text.
Approaches and methods to WSD are classified according to the source of knowledge used in
word disambiguation.
Methods in WSD
1. Dictionary-based or Knowledge-based Methods
As the name suggests, for disambiguation, these methods primarily rely on dictionaries, treasures
and lexical knowledge base. They do not use corpora evidences for disambiguation. The Lesk
method is the seminal dictionary-based method introduced by Michael Lesk in 1986. The Lesk
definition, on which the Lesk algorithm is based is “measure overlap between sense definitions for
all words in context”. However, in 2000, Kilgarriff and Rosensweig gave the simplified Lesk definition
as “measure overlap between sense definitions of word and current context”, which further means
identify the correct sense for one word at a time. Here the current context is the set of words in
surrounding sentence or paragraph.
2. Supervised Methods
For disambiguation, machine learning methods make use of sense-annotated corpora to train. These
methods assume that the context can provide enough evidence on its own to disambiguate the
sense. In these methods, the words knowledge and reasoning are deemed unnecessary. The context
is represented as a set of “features” of the words. It includes the information about the surrounding
words also. Support vector machine and memory-based learning are the most successful supervised
learning approaches to WSD. These methods rely on substantial amount of manually sense-tagged
corpora, which is very expensive to create.
3. Semi-supervised Methods
Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi-
supervised learning methods. It is because semi-supervised methods use both labelled as well as
unlabeled data. These methods require very small amount of annotated text and large amount of
plain unannotated text. The technique that is used by semisupervised methods is bootstrapping
from seed data.
4. Unsupervised Methods
These methods assume that similar senses occur in similar context. That is why the senses can be
induced from text by clustering word occurrences by using some measure of similarity of the
context. This task is called word sense induction or discrimination. Unsupervised methods have
great potential to overcome the knowledge acquisition bottleneck due to non-dependency on
manual efforts.
Word sense disambiguation (WSD) is applied in almost every application of language technology.
1. Machine Translation
Machine translation or MT is the most obvious application of WSD. In MT, Lexical choice for the
words that have distinct translations for different senses, is done by WSD. The senses in MT are
represented as words in the target language. Most of the machine translation systems do not use
explicit WSD module.
Information retrieval (IR) may be defined as a software program that deals with the organization,
storage, retrieval and evaluation of information from document repositories particularly textual
information. The system basically assists users in finding the information they required but it does
not explicitly return the answers of the questions. WSD is used to resolve the ambiguities of the
queries provided to IR system. As like MT, current IR systems do not explicitly use WSD module and
they rely on the concept that user would type enough context in the query to only retrieve relevant
documents.
In most of the applications, WSD is necessary to do accurate analysis of text. For example, WSD
helps intelligent gathering system to do flagging of the correct words. For example, medical
intelligent system might need flagging of “illegal drugs” rather than “medical drugs”
4. Lexicography
WSD and lexicography can work together in loop because modern lexicography is corpusbased. With
lexicography, WSD provides rough empirical sense groupings as well as statistically significant
contextual indicators of sense.
The major problem of WSD is to decide the sense of the word because different senses can be very
closely related. Even different dictionaries and thesauruses can provide different divisions of words
into senses.
Another problem of WSD is that completely different algorithm might be needed for different
applications. For example, in machine translation, it takes the form of target word selection; and in
information retrieval, a sense inventory is not required.
3. Inter-judge variance
Another problem of WSD is that WSD systems are generally tested by having their results on a task
compared against the task of human beings. This is called the problem of interjudge variance.
4. Word-sense discreteness
Another difficulty in WSD is that words cannot be easily divided into discrete submeanings.