0% found this document useful (0 votes)
30 views

Nlp Unit i Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Nlp Unit i Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

lOMoARcPSD|44355733

NLP-Unit-I - Notes

Master of information technology (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Bhavesh Fanade ([email protected])
lOMoARcPSD|44355733

UNIT – I
Introduction to NLP, brief history, NLP applications: Speech to
Text(STT), Text to Speech(TTS), Story Understanding, NL Generation,
QA system, Machine Translation, Text Summarization, Text
classification, Sentiment Analysis, Grammar/Spell Checkers etc.,
challenges/Open Problems, NLP abstraction levels, Natural Language
(NL) Characteristics and NL computing approaches/techniques and
steps, NL tasks: Segmentation, Chunking, tagging, NER, Parsing, Word
Sense Disambiguation, NL Generation, Web 2.0 Applications : Sentiment
Analysis; Text Entailment; Cross Lingual Information Retrieval
(CLIR).

Introduction to NLP:

NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence. It is the technology that is used by machines to
understand, analyse, manipulate, and interpret human's languages. It helps developers to
organize knowledge for performing tasks such as translation, automatic summarization, Named
Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.

Natural Language is a powerful tool of Artificial Intelligence that enables computers to


understand, interpret and generate human readable text that is meaningful. NLP is a method
used for processing and analyzing the text data. In Natural Language Processing the text is
tokenized means the text is break into tokens, it could be words, phrases or character. It is
the first step in NLP task. The text is cleaned and preprocessed before applying Natural
Language Processing technique.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Natural Language Processing technique is used in machine translation, healthcare, finance,


customer service, sentiment analysis and extracting valuable information from the text data.
NLP is also used in text generation and language modeling. Natural Processing technique can
also be used in answering the questions. Many companies uses Natural Language Processing
technique to solve their text related problems. Tools such as ChatGPT, Google Bard that
trained on large corpus of test of data uses Natural Language Processing technique to solve
the user queries.

Brief History:
(1940-1960) - Focused on Machine Translation (MT)
The Natural Languages Processing started in the year 1940s.
1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck College,
London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and computer science.
Now, Chomsky developed his first book syntactic structures and claimed that language is generative
in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based descriptions
of syntactic structures.
(1960-1980) - Flavored with Artificial Intelligence (AI)
In the year 1960 to 1980, the key developments were:
Augmented Transition Networks (ATN)
Augmented Transition Networks is a finite state machine that is capable of recognizing regular
languages.
Case Grammar
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case Grammar uses
languages such as English to express the relationship between nouns and verbs by using the
preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.
For example: "Neha broke the mirror with the hammer". In this example case grammar identify
Neha as an agent, mirror as a theme, and hammer as an instrument.
In the year 1960 to 1980, key systems were:
SHRDLU
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate with
the computer and moving objects. It can handle instructions such as "pick up the green boll" and
also answer the questions like "What is inside the black box." The main importance of SHRDLU is
that it shows those syntax, semantics, and reasoning about the world that can be combined to
produce a system that understands a natural language.
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is used ATNs
and Woods' Procedural Semantics. It was capable of translating elaborate natural language
expressions into database queries and handle 78% of requests without errors.
1980 - Current
Till the year 1980, natural language processing systems were based on complex sets of hand-written
rules. After 1980, NLP introduced machine learning algorithms for language processing.
In the beginning of the year 1990s, NLP started growing faster and achieved good process accuracy,
especially in English Grammar. In 1990 also, an electronic text introduced, which provided a good
resource for training and examining natural language programs. Other factors may include the

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

availability of computers with fast CPUs and more memory. The major factor behind the
advancement of natural language processing was the Internet.
Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it allows the
artificial intelligence to gain knowledge of the world. Let's consider the example of AMAZON ALEXA,
using this robot you can ask the question to Alexa, and it will reply to you.

NLP applications:
There are the following applications of NLP -
1. Question Answering
2. Spam Detection.
3. Sentiment Analysis
4. Machine Translation
5. Spelling correction
6. Speech Recognition
7. Chatbot
8. Information extraction
9. Natural Language Understanding (NLU)

Speech to Text(STT):
STT applications convert spoken language into text format, enabling users to dictate text instead of
typing. Examples include voice-activated assistants like Siri, Google Assistant, and Amazon Alexa, as
well as dictation software for transcription purposes.

STT systems use various techniques such as acoustic modeling, language modeling, and speech
recognition algorithms to accurately transcribe spoken words into written text.

Transcription is a complex process that involves multiple stages and AI models working together.
Here's an overview of key steps in speech-to-text:

• Pre-processing: Before the input audio can be transcribed, it often undergoes some pre-
processing steps. This can include noise reduction, echo cancellation, and other techniques to
enhance the quality of the audio signal.
• Feature extraction: The audio waveform is then converted into a more suitable representation
for analysis. This usually involves extracting features from the audio signal that capture
important characteristics of the sound, such as frequency, amplitude, and duration. Mel-
frequency cepstral coefficients (MFCCs) are commonly used features in speech processing.
• Acoustic modelling: Involves training a statistical model that maps the extracted features to
phonemes, the smallest units of sound in a language.
• Language modelling: Language modeling focuses on the linguistic aspect of speech. It involves
creating a probabilistic model of how words and phrases are likely to appear in a particular
language. This helps the system make informed decisions about which words are more likely to
occur, given the previous words in the sentence.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

• Decoding: In the decoding phase, the system uses the acoustic and language models to
transcribe the audio into a sequence of words or tokens. This process involves searching for the
most likely sequence of words that correspond to the given audio features.
• Post-processing: The decoded transcription may still contain errors, such as misrecognitions or
homophones (words that sound the same but have different meanings). Post-processing
techniques, including language constraints, grammar rules, and contextual analysis, are applied
to improve the accuracy and coherence of the transcription before producing the final output.

Text to Speech(TTS):
TTS applications convert written text into spoken language, allowing computers to generate
human-like speech. Examples include screen readers for visually impaired users, virtual
assistants, and audiobook narration.

TTS systems utilize techniques such as text analysis, prosody modeling, and speech synthesis
to produce natural-sounding speech from written text.

The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes
with 6 built-in voices and can be used to:

• Narrate a written blog post


• Produce spoken audio in multiple languages
• Give real time audio output using streaming

The speech endpoint takes in three key inputs: the model, the text that should be turned into
audio, and the voice to be used for the audio generation.

By default, the endpoint will output a MP3 file of the spoken audio but it can also be configured
to output any of our supported formats.

Audio quality

For real-time applications, the standard tts-1 model provides the lowest latency but at a lower
quality than the tts-1-hd model. Due to the way the audio is generated, tts-1 is likely to
generate content that has more static in certain situations than tts-1-hd. In some cases, the
audio may not have noticeable differences depending on your listening device and the individual
person.

Voice options

Experiment with different voices (alloy, echo, fable, onyx, nova, and shimmer) to find one that
matches your desired tone and audience.

Supported output formats

The default response format is "mp3", but other formats like "opus", "aac", "flac", and "pcm"
are available.

• Opus: For internet streaming and communication, low latency.


• AAC: For digital audio compression, preferred by YouTube, Android, iOS.
• FLAC: For lossless audio compression, favored by audio enthusiasts for archiving.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

• WAV: Uncompressed WAV audio, suitable for low-latency applications to avoid decoding
overhead.
• PCM: Similar to WAV but containing the raw samples in 24kHz (16-bit signed, low-
endian), without the header.

There is no direct mechanism to control the emotional output of the audio generated. Certain
factors may influence the output audio like capitalization or grammar but our internal tests
with these have yielded mixed results.

Story Understanding:
Story understanding involves analyzing and comprehending the structure, themes, and meaning
of written narratives.

NLP techniques such as semantic analysis, entity recognition, and topic modeling can be used
to extract key information and infer relationships between characters, events, and plot
elements in a story.

A large body of work in story understanding has focused on learning scripts.


Scripts represent structured knowledge about stereotypical event, sequences together with
their participants.
‘A narrative or story is anything which is told in the form of a causally (logically) linked set of
events involving some shared characters’.
Instead of predicting an event, the system is tasked with choosing an entire sentence to
complete the given story.
Application: Story Wizard

NL Generation:
NL Generation involves generating human-like text based on input data or prompts.
Applications include chatbots, language translation, and text generation for creative writing
or content generation purposes.

NL Generation techniques include language modeling, sequence-to-sequence models, and neural


text generation algorithms.

Natural Language Generation (NLG) is the process of generating descriptions or narratives in


natural language from structured data.
NLG often works closely with Natural Language Understanding (NLU).
NLG, along with NLU, is at the core of chatbots and voice assistants.
For computer science domain, NLG has been used to write specifications from UML diagrams,
or describe source code changes.
Application: Tableau

QA system:
QA (Question Answering) systems provide answers to user queries based on a given context
or knowledge base.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Examples include virtual assistants, search engines, and customer support chatbots.

QA systems use techniques such as information retrieval, natural language understanding, and
machine learning to analyze questions and retrieve relevant answers from structured or
unstructured data sources.

It is used to answer questions in the form of natural language and has a wide range of
applications.
Typical applications include: intelligent voice interaction, online customer service, knowledge
acquisition, personalized emotional chatting, and more.
There are two types of QA systems, open and closed.
A system that tries to answer any question you could possibly ask is called an open system or
open domain system – think of Google or Alexa.
And then there are closed (or closed domain) systems that are built for a certain subject or
function or domain of knowledge or company. E.g. Paypal, Finn AI, Digital genius etc.

Machine Translation:
Machine translation systems automatically translate text from one language to another.
Examples include Google Translate, Microsoft Translator, and language translation features in
mobile devices and web browsers.

Machine translation techniques include statistical machine translation, neural machine


translation, and rule-based translation approaches.

Machine Translation (MT) is the task of automatically converting one natural language into
another, preserving the meaning of the input text, and producing fluent text in the output
language.
There are many challenging aspects of MT:
the large variety of languages, alphabets and grammars;
the task to translate a sequence (a sentence for example) to a sequence is harder for a
computer than working with numbers only;
there is no one correct answer
Application: google translator

Text Summarization:
Text summarization involves condensing large amounts of text into shorter, more concise
summaries while preserving key information and main ideas. Applications include summarization
of news articles, research papers, and legal documents.

Text summarization techniques include extractive methods (selecting important sentences or


phrases) and abstractive methods (generating summaries using natural language generation
techniques).

It is a process of generating a concise and meaningful summary of text from multiple text
resources such as books, news articles, blog posts, research papers, emails, and tweets.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Text summarization can broadly be divided into two categories — Extractive


Summarization and Abstractive Summarization.
Extractive Summarization: These methods rely on extracting several parts, such as phrases
and sentences, from a piece of text and stack them together to create a summary. E.g. Text-
Processing, Skyttle 2.0, Textuality.
Abstractive Summarization: These methods use advanced NLP techniques to generate an
entirely new summary. Some parts of this summary may not even appear in the original text.
E.g. Digital Tesseract

Text classification:
Text classification involves categorizing text documents into predefined classes or categories
based on their content or topic. Examples include spam detection, sentiment analysis, and topic
classification in news articles.

Text classification techniques include machine learning algorithms such as Naive Bayes,
Support Vector Machines (SVM), and deep learning models like Convolutional Neural Networks
(CNN) and Recurrent Neural Networks (RNN).

Text classification also known as text tagging or text categorization is the process of
categorizing text into organized groups.
By using NLP, text classifiers can automatically analyse text and then assign a set of pre-
defined tags or categories based on its content.
Unstructured text is everywhere, such as emails, chat conversations, websites, and social
media but it’s hard to extract value from this data unless it’s organized in a certain way.
Text classifiers with NLP have proven to be a great alternative to structure textual data in a
fast, cost-effective, and scalable way.
Application: Email services / medical report / news

Sentiment Analysis:
Sentiment analysis involves analyzing text data to determine the sentiment or emotion
expressed within it. Applications include social media monitoring, customer feedback analysis,
and brand sentiment analysis.

Sentiment analysis techniques include lexicon-based methods, machine learning classifiers, and
deep learning models for emotion detection and sentiment classification.

Sentiment analysis (or opinion mining) is a NLP technique used to determine whether data is
positive, negative or neutral.
Sentiment analysis is often performed on textual data to help businesses monitor brand and
product sentiment in customer feedback, and understand customer needs.
Sentiment analysis models focus on polarity (positive, negative, neutral) and also on feelings
and emotions (angry, happy, sad, etc), urgency (urgent, not urgent) and even intentions
(interested v. not interested).
Application: pros and cons of a product – useful in e-commerce

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Grammar/Spell Checkers:
Grammar and spell checkers automatically detect and correct errors in written text, such as
grammatical mistakes, spelling errors, and punctuation errors. Examples include built-in spell
checkers in word processing software, grammar checking tools, and browser extensions.

Grammar and spell checkers use language rules, dictionaries, and statistical models to identify
and suggest corrections for errors in text.

A word needs to be checked for spelling correctness and corrected if necessary, many a time
in the context of the surrounding words.
Spell Check is a process of detecting and sometimes providing suggestions for incorrectly
spelled words in a text.
A basic spell checker carries out the following processes:
It scans the text and extracts the words contained in it.
It then compares each word with a known list of correctly spelled words (i.e. a dictionary).
Application: word /power point / google search
Autocorrect

Challenges/Open Problems:
Natural Language Processing (NLP) faces various challenges due to the complexity and
diversity of human language. Let’s discuss 10 major challenges in NLP:

1. Language differences

The human language and understanding is rich and intricated and there many languages spoken
by humans. Human language is diverse and thousand of human languages spoken around the
world with having its own grammar, vocabular and cultural nuances. Human cannot understand
all the languages and the productivity of human language is high. There is ambiguity in natural
language since same words and phrases can have different meanings and different context.
This is the major challenges in understating of natural language.

There is a complex syntactic structures and grammatical rules of natural languages. The rules
are such as word order, verb, conjugation, tense, aspect and agreement. There is rich semantic
content in human language that allows speaker to convey a wide range of meaning through words
and sentences. Natural Language is pragmatics which means that how language can be used in
context to approach communication goals. The human language evolves time to time with the
processes such as lexical change. The change in language represents cultural, social and
historical factors.

2.Training Data

Training data is a curated collection of input-output pairs, where the input represents the
features or attributes of the data, and the output is the corresponding label or target.
Training data is composed of both the features (inputs) and their corresponding labels
(outputs). For NLP, features might include text data, and labels could be categories,
sentiments, or any other relevant annotations.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

It helps the model generalize patterns from the training set to make predictions or
classifications on new, previously unseen data.

3. Development Time and Resource Requirements

Development Time and Resource Requirements for Natural Language Processing (NLP) projects
depends on various factors consisting the task complexity, size and quality of the data,
availability of existing tools and libraries, and the team of expert involved. Here are some key
points:

• Complexity of the task: Task such as classification of text or analyzing the sentiment
of the text may require less time compared to more complex tasks such as machine
translation or answering the questions.
• Availability and Quality Data: For Natural Language Processing models requires high-
quality of annotated data. It can be time consuming to collect, annotate, and preprocess
the large text datasets and can be resource-intensive specially for tasks that requires
specialized domain knowledge or fine-tuned annotations.
• Selection of algorithm and development of model: It is difficult to choose the right
algorithms machine learning algorithms that is best for Natural Language Processing
task.
• Evaluation and Training: It requires powerful computation resources that consists of
powerful hardware (GPUs or TPUs) and time for training the algorithms iteration. It is
also important to evaluate the performance of the model with the help of suitable
metrics and validation techniques for conforming the quality of the results.

4. Navigating Phrasing Ambiguities in NLP

It is a crucial aspect to navigate phrasing ambiguities because of the inherent complexity of


human languages. The cause of phrasing ambiguities is when a phrase can be evaluated in
multiple ways that leads to uncertainty in understanding the meaning. Here are some key points
for navigating phrasing ambiguities in NLP:

• Contextual Understanding: Contextual information like previous sentences, topic focus,


or conversational cues can give valuable clues for solving ambiguities.
• Semantic Analysis: The content of the semantic text is analyzed to find meaning based
on word, lexical relationships and semantic roles. Tools such as word sense
disambiguation, semantics role labeling can be helpful in solving phrasing ambiguities.
• Syntactic Analysis: The syntactic structure of the sentence is analyzed to find the
possible evaluation based on grammatical relationships and syntactic patterns.
• Pragmatic Analysis: Pragmatic factors such as intentions of speaker, implicatures to
infer meaning of a phrase. This analysis consists of understanding the pragmatic
context.
• Statistical methods: Statistical methods and machine learning models are used to learn
patterns from data and make predictions about the input phrase.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

5. Misspellings and Grammatical Errors

Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there are
different forms of linguistics noise that can impact accuracy of understanding and analysis.
Here are some key points for solving misspelling and grammatical error in NLP:

• Spell Checking: Implement spell-check algorithms and dictionaries to find and correct
misspelled words.
• Text Normalization: The is normalized by converting into a standard format which may
contains tasks such as conversion of text to lowercase, removal of punctuation and
special characters, and expanding contractions.
• Tokenization: The text is split into individual tokens with the help of tokenization
techniques. This technique allows to identify and isolate misspelled words and
grammatical error that makes it easy to correct the phrase.
• Language Models: With the help of language models that is trained on large corpus of
data to predict the likelihood of word or phrase that is correct or not based on its
context.

6. Mitigating Innate Biases in NLP Algorithms

It is a crucial step of mitigating innate biases in NLP algorithm for conforming fairness, equity,
and inclusivity in natural language processing applications. Here are some key points for
mitigating biases in NLP algorithms.

• Collection of data and annotation: It is very important to confirm that the training
data used to develop NLP algorithms is diverse, representative and free from biases.
• Analysis and Detection of bias: Apply bias detection and analysis method on training
data to find biases that is based on demographic factors such as race, gender, age.
• Data Preprocessing: Data Preprocessing the most important process to train data to
mitigate biases like debiasing word embeddings, balance class distributions and
augmenting underrepresented samples.
• Fair representation learning: Natural Language Processing models are trained to learn
fair representations that are invariant to protect attributes like race or gender.
• Auditing and Evaluation of Models: Natural Language models are evaluated for fairness
and bias with the help of metrics and audits. NLP models are evaluated on diverse
datasets and perform post-hoc analyses to find and mitigate innate biases in NLP
algorithms.

7. Words with Multiple Meanings

Words with multiple meaning plays a lexical challenge in Nature Language Processing because
of the ambiguity of the word. These words with multiple meaning are known as polysemous or
homonymous have different meaning based on the context in which they are used. Here are
some key points for representing the lexical challenge plays by words with multiple meanings
in NLP:

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

• Semantic analysis: Implement semantic analysis techniques to find the underlying


meaning of the word in various contexts. Word embedding or semantic networks are the
semantic representation can find the semantic similarity and relatedness between
different word sense.
• Domain specific knowledge: It is very important to have a specific domain-knowledge
in Natural Processing tasks that can be helpful in providing valuable context and
constraints for determining the correct context of the word.
• Multi-word Expression (MWEs): The meaning of the entire sentence or phrase is
analyzed to disambiguate the word with multiple meanings.
• Knowledge Graphs and Ontologies: Apply knowledge graphs and ontologies to find the
semantic relationships between different words context.

8. Addressing Multilingualism

It is very important to address language diversity and multilingualism in Natural Language


Processing to confirm that NLP systems can handle the text data in multiple languages
effectively. Here are some key points to address language diversity and multilingualism:

• Multilingual Corpora: Multilingual corpus consists of text data in various languages and
serve as valuable resources for training NLP models and systems.
• Cross-Lingual Transfer Learning: This is a type of techniques that is used to transfer
knowledge learned from one language to another.
• Language Identification: Design language identification models to automatically detect
the language of a given text.
• Machine Translation: Machine Translation provides the facility to communicate and
inform access across language barriers and can be used as preprocessing step for
multilingual NLP tasks.

9. Reducing Uncertainty and False Positives in NLP

It is very crucial task to reduce uncertainty and false positives in Natural Language Process
(NLP) to improve the accuracy and reliability of the NLP models. Here are some key points to
approach the solution:

• Probabilistic Models: Use probabilistic models to figure out the uncertainty in


predictions. Probabilistic models such as Bayesian networks gives probabilistic
estimates of outputs that allow uncertainty quantification and better decision making.
• Confidence Scores: The confidence scores or probability estimates is calculated for
NLP predictions to assess the certainty of the output of the model. Confidence scores
helps us to identify cases where the model is uncertain or likely to produce false
positives.
• Threshold Tuning: For the classification tasks the decision thresholds is adjusted to
make the balance between sensitivity (recall) and specificity. False Positives in NLP can
be reduced by setting the appropriate thresholds.
• Ensemble Methods: Apply ensemble learning techniques to join multiple model to reduce
uncertainty.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

10. Facilitating Continuous Conversations with NLP

Facilitating continuous conversations with NLP includes the development of system that
understands and responds to human language in real-time that enables seamless interaction
between users and machines. Implementing real time natural language processing pipelines
gives to capability to analyze and interpret user input as it is received involving algorithms are
optimized and systems for low latency processing to confirm quick responses to user queries
and inputs.

Building an NLP models that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history
tracking, and generating relevant responses based on the ongoing dialogue. Apply intent
recognition algorithm to find the underlying goals and intentions expressed by users in their
messages.

How to overcome NLP Challenges

It requires a combination of innovative technologies, experts of domain, and methodological


approached to over the challenges in NLP. Here are some key points to overcome the challenge
of NLP tasks:

• Quantity and Quality of data: High quality of data and diverse data is used to train
the NLP algorithms effectively. Data augmentation, data synthesis, crowdsourcing are
the techniques to address data scarcity issues.
• Ambiguity: The NLP algorithm should be trained to disambiguate the words and phrases.
• Out-of-vocabulary Words: The techniques are implemented to handle out-of-
vocabulary words such as tokenization, character-level modeling, and vocabulary
expansion.
• Lack of Annotated Data: Techniques such transfer learning and pre-training can be
used to transfer knowledge from large dataset to specific tasks with limited labeled
data.

NLP abstraction levels:


In natural language processing (NLP), abstraction levels refer to different levels of
representation and analysis of language data, each capturing different aspects of linguistic
information. These abstraction levels help in understanding and processing language at various
granularities. Here are the common abstraction levels in NLP:

• Tokenization and sentence segmentation


• Lexical analysis
• Syntactic Analysis
• Semantic Analysis
• Pragmatic Analysis
Tokenization and sentence segmentation: Natural language text is generally not made up of
the short, neat, well-formed, and well-delimited sentences.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Tokenization and sentence segmentation So, we need to separate all individual tokens having
the “minimal unit of meaning”, is referred to as the morpheme. E.g. happiness The stem happy
is considered as a free morpheme since it is a “word” in its own. Bound morphemes (prefixes
and suffixes) require a free morpheme to which it can be attached to and can therefore not
appear as a “word” on their own.

Lexical analysis: The lexical analysis in NLP deals with the study at the level of words with
respect to their lexical (word) meaning and part-of-speech. This level of linguistic processing
utilizes a language’s lexicon (dictionary), which is a collection of individual lexemes. A lexeme
is a basic unit of lexical meaning; which is an abstract unit of morphological analysis that
represents the set of forms or “senses” taken by a single morpheme.
“duck”, for example, can take the form of a noun (swimming bird) or a verb (bow down) but its
part-of-speech and lexical meaning can only be derived in context with other words used in
the phrase/sentence. This, in fact, is an early step towards a more sophisticated Information
Retrieval system where precision is improved through part-of-speech tagging.

Syntactic Analysis: Syntactic Analysis also referred to as “parsing”, allows the extraction of
phrases which convey more meaning than just the individual words by themselves. In
Information Retrieval, parsing can be forced to improve indexing since phrases can be used as
representations of documents which provide better information than just single-word indices.
In the same way, phrases that are syntactically derived from the query offers better search
keys to match with documents that are similarly parsed.
Nevertheless, syntax can still be ambiguous at times as in the case of the news headline:
“Boy paralyzed after tumour fights back to gain black belt” — which actually refers to how a
boy was paralyzed because of a tumour but endured the fight against the disease and
ultimately gained a high level of competence in martial arts.

Semantic Analysis: understand meaning of syntactically correct tokens or statements.


The semantic level of linguistic processing deals with the determination of what a sentence
really means by relating syntactic features and disambiguating words with multiple definitions
to the given context. This level entails the appropriate interpretation of the meaning of
sentences, rather than the analysis at the level of individual words or phrases.

Pragmatic Analysis: The pragmatic(Realistic / Logical) level of linguistic processing deals with
the use of real-world knowledge and understanding of how this impacts the meaning of what is
being communicated. By analyzing the contextual dimension of the documents and queries, a
more detailed representation is derived.
In Information Retrieval, this level of Natural Language Processing primarily engages query
processing and understanding by integrating the user’s history and goals as well as the context
upon which the query is being made. Contexts may include time and location.
This level of analysis enables major breakthroughs in Information Retrieval as it facilitates
the conversation between the information retrieval system and the users, allowing the
induction of the purpose upon which the information being sought is planned to be used,
thereby ensuring that the information retrieval system is fit for purpose.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Natural Language (NL) Characteristics:


Natural Language (NL) has several characteristics that distinguish it from other forms of
communication and make it challenging yet fascinating to work with in the field of Natural
Language Processing (NLP). Some key characteristics of natural language:

• Language is Arbitrary
• Language is Social
• Language is Symbolic
• Language is Systematic
• Language is Vocal
• Language is Non-instinctive(instinct – natural / unlearn)
• Language is Productive and Creative

Language is Arbitrary: Language is arbitrary in the sense that there is no inherent relation
between the words of a language and their meanings or the ideas conveyed by them.

Language is Social: Language is a set of conventional communicative signals used by humans


for communication in a community.

Language is Symbolic: Language consists of various sound symbols and their graphological
counterparts that are employed to denote some objects, occurrences or meaning.

Language is Systematic: Although language is symbolic, yet its symbols are arranged in a
particular system. All languages have their system of arrangements. Furthermore, all
languages have phonological and syntactic systems and within a system, there are also several
sub-systems.

Language is Vocal: Language is primarily made up of vocal sounds only produced by a


physiological articulatory mechanism (accent) in the human body. Writing is only the graphic
representation of the sounds of the language.

Language is Non-instinctive(instinct – natural / unlearn): No language was created in a day


out of a mutually agreed upon formula by a group of humans. Language is the outcome of
evolution and convention. Each generation transmits this convention on to the next.

Language is Productive and Creative: Language has creativity and productivity. The structural
elements of human language can be combined to produce new utterances(words), which neither
the speaker nor his hearers may ever have made or heard before any, listener, yet which both
sides understand without difficulty.

NL computing approaches/techniques and steps:


1. Text Pre-processing(Tokenization):
Not all languages deliver text in the form of words neatly delimited by spaces.
It is of crucial importance to ensure that, given a text, we can break it into sentence-sized
pieces.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Tokenization involves breaking down text into smaller units, such as words, subwords, or
characters, for further processing.
Techniques include whitespace tokenization, word tokenization, and subword tokenization
(e.g., Byte-Pair Encoding, WordPiece).

2. Lexical Analysis:
Decomposing words into their parts and maintaining rules for how combinations are formed.
morphological processing can go some way toward handling unrecognized words.
Morphological processes alter stems to derive new words. They may change the word’s
meaning (derivational) or grammatical function (inflectional).

3. Syntactic Parsing:
The basic unit of meaning analysis is the sentence: a sentence expresses a proposition, an idea,
or a thought, and says something about some real or imaginary world. In NLP approaches based
on generative linguistics, this is generally taken to involve the determining of the syntactic or
grammatical structure of each sentence.
Syntactic parsing analyzes the grammatical structure of sentences to identify phrases and
dependencies between words. Techniques include constituency parsing (e.g., Context-Free
Grammars), dependency parsing (e.g., Transition-Based Parsing, Graph-Based Parsing), and
neural network-based parsers (e.g., Recursive Neural Networks, Transformer models).

4. Semantic Analysis:
It is these subsequent steps that derive a meaning for the sentence in question. Semantic
analysis focuses on extracting meaning from text, including tasks such as semantic role
labeling, semantic parsing, and word sense disambiguation.
Techniques include statistical models, graph-based approaches, and neural network-based
methods (e.g., Transformers).

5. Sentiment Analysis
This is the dissection of data (text, voice, etc) in order to determine whether it’s positive,
neutral, or negative. It tags each statement with ‘sentiment’ then aggregates the sum of all
the statements in a given dataset.

Sentiment analysis can transform large archives of customer feedback, reviews, or social
media reactions into actionable, quantified results. These results can then be analyzed for
customer insight and further strategic results.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Sentiment analysis determines the sentiment expressed in text, such as positive, negative,
or neutral. Techniques include lexicon-based methods, machine learning classifiers (e.g.,
Support Vector Machines, Naive Bayes), and deep learning models (e.g., CNNs, LSTMs,
Transformers).

6. Named Entity Recognition

Named Entity Recognition, or NER (because we in the tech world are huge fans of our
acronyms) is a Natural Language Processing technique that tags ‘named identities’ within text
and extracts them for further analysis.

As you can see in the example below, NER is similar to sentiment analysis. NER, however, simply
tags the identities, whether they are organization names, people, proper nouns, locations, etc.,
and keeps a running tally of how many times they occur within a dataset.

How many times an identity (meaning a specific thing) crops up in customer feedback can
indicate the need to fix a certain pain point. Within reviews and searches it can indicate a
preference for specific kinds of products, allowing you to custom tailor each customer journey
to fit the individual user, thus improving their customer experience.

The limits to NER’s application are only bounded by your feedback and content teams’
imaginations.

NER identifies and categorizes named entities such as persons, organizations, locations, dates,
and numerical expressions in text.

Techniques include rule-based approaches, statistical models (e.g., Conditional Random Fields),
and neural network-based models (e.g., BiLSTM-CRF).

7. Lemmatization and Stemming

More technical than our other topics, lemmatization and stemming refers to the breakdown,
tagging, and restructuring of text data based on either root stem or definition.

That might seem like saying the same thing twice, but both sorting processes can lend
different valuable data. Discover how to make the best of both techniques in our guide to Text
Cleaning for NLP.

That’s a lot to tackle at once, but by understanding each process and combing through the
linked tutorials, you should be well on your way to a smooth and successful NLP application.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

NL tasks:
Natural language processing (NLP) encompasses a wide range of tasks aimed at understanding,
generating, and processing human language. Some common NLP tasks:

Text Classification:

Assigning predefined categories or labels to text documents based on their content. Examples
include sentiment analysis, topic classification, spam detection, and language identification.

Named Entity Recognition (NER):

Identifying and classifying named entities mentioned in text, such as persons, organizations,
locations, dates, and numerical expressions.

Part-of-Speech (POS) Tagging:

Assigning grammatical categories (e.g., noun, verb, adjective) to each word in a sentence.

Syntactic Parsing:

Analyzing the grammatical structure of sentences to identify phrases and dependencies


between words. This includes constituency parsing and dependency parsing.

Semantic Parsing:

Translating natural language utterances into formal representations, such as logical forms or
executable queries, for tasks like question answering and database querying.

Semantic Role Labeling (SRL):

Identifying the predicate-argument structure of sentences by labeling words or phrases with


their semantic roles, such as agent, patient, and instrument.

Coreference Resolution:

Identifying which words or phrases in a text refer to the same entity, resolving pronouns and
noun phrases to their corresponding antecedents.

Sentiment Analysis:

Determining the sentiment expressed in text, such as positive, negative, or neutral. This can
be applied at the document, sentence, or aspect level.

Text Summarization:

Generating concise summaries of longer texts while preserving the most important information
and key points.

Machine Translation:

Translating text from one language to another automatically using computational methods.

Question Answering (QA):

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Generating answers to questions posed in natural language, often based on a given context or
a large corpus of knowledge.

Language Generation:

Generating human-like text based on input prompts or contexts. This includes tasks such as
text generation, dialogue generation, and story generation.

Text Similarity and Clustering:

Measuring the similarity between text documents or clustering them into groups based on
their semantic content or features.

Text Alignment and Alignment-based Tasks:

Aligning text segments across languages, documents, or versions for tasks such as bilingual
dictionary creation, parallel corpus alignment, and plagiarism detection.

Speech Recognition and Speech-to-Text:

Transcribing spoken language into written text, often as part of automatic speech recognition
(ASR) systems.

Language Understanding and Generation:

Understanding natural language commands, queries, or instructions and generating appropriate


responses or actions. This includes tasks like intent recognition, dialogue management, and
task-oriented dialogue systems.

Segmentation:
Text segmentation is the process of dividing written text into meaningful units, such as words,
sentences, or topics. The term applies both to mental processes used by humans when reading
text, and to artificial processes implemented in computers, which are the subject of natural
language processing.
• Word segmentation - Word segmentation is the problem of dividing a string of written
language into its component words.
• Intent segmentation - Intent segmentation is the problem of dividing written words into
key phrases (2 or more group of words).
• Sentence segmentation - Sentence segmentation is the problem of dividing a string of
written language into its component sentences.
• Topic segmentation - Topic analysis consists of two main tasks: topic identification and
text segmentation. While the first is a simple classification of a specific text, the latter
case implies that a document may contain multiple topics, and the task of computerized
text segmentation may be to discover these topics automatically and segment the text
accordingly. The topic boundaries may be apparent from section titles and paragraphs.
In other cases, one needs to use techniques similar to those used in document
classification.
Segmenting the text into topics or discourse turns might be useful in some natural processing
tasks: it can improve information retrieval or speech recognition significantly (by

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

indexing/recognizing documents more precisely or by giving the specific part of a document


corresponding to the query as a result). It is also needed in topic detection and tracking
systems and text summarizing problems.

Tagging:
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for
a particular part of a speech based on its definition and context. It is responsible for text
reading in a language and assigning some specific token (Parts of Speech) to each word. It is
also called grammatical tagging.
For example:
Input: Everything to permit us.
Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]
Steps Involved in the POS tagging example:
Tokenize text (word_tokenize)
apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

Chunking:
Chunking in NLP is a process to take small pieces of information and group them into large
units. The primary use of Chunking is making groups of "noun phrases." It is used to add
structure to the sentence by following POS tagging combined with regular expressions. The
resulted group of words are called "chunks." It is also called shallow parsing.
In shallow parsing, there is maximum one level between roots and leaves while deep parsing
comprises of more than one level. Shallow parsing is also called light parsing or chunking.

Rules for Chunking:

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

There are no pre-defined rules, but you can combine them according to need and
requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating
conjunction(e.g. and, but..) from the sentence. You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}

Following table shows what the various symbol means:

#Example:
from nltk import pos_tag
from nltk import RegexpParser
text ="learn NLP from NPTEL and make study easy".split()
print("After Split:",text)
import nltk
nltk.download('averaged_perceptron_tagger')
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)
Output
After Split: ['learn', 'NLP', 'from', 'NPTEL', 'and', 'make', 'study', 'easy’]

After Token: [('learn', 'JJ'), ('NLP', 'NNP'), ('from', 'IN'), ('NPTEL', 'NNP'), ('and', 'CC'),
('make', 'VB'), ('study', 'NN'), ('easy', 'JJ’)]

After Regex: chunk.RegexpParser with 1 stages: RegexpChunkParser with 1 rules:


<ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?’>

After Chunking (S
mychunk learn/JJ)
mychunk NLP/NNP)
from/IN
mychunk NPTEL/NNP and/CC)
make/VB
mychunk study/NN easy/JJ))

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Chunking is used for entity detection. An entity is that part of the sentence by which
machine gets the value for any intention.
Example:
Temperature of India.
Here Temperature is the intention and India is an entity.
In other words, chunking is used as selecting the subsets of tokens.

NER:
Named entity recognition (NER) is one of the most common uses of Information Extraction
(IE) technology.
NER systems identify different types of proper names, such as person and company names,
and sometimes special types of entities, such as dates and times, that can be easily.
NER is especially important in biomedical applications, where terminology is a formidable
(difficult) problem. But IE is much more than just NER.

A much more difficult and potentially much more significant capability is the recognition of
events and their participants.
For example, in each of the sentences
“Microsoft acquired Powerset.”
“Powerset was acquired by Microsoft.”
we would like to recognize not only that Microsoft and Powerset are company names, but also
that an acquisition event took place, that the acquiring company was Microsoft, and the
acquired company was Powerset.

How Does Named Entity Recognition Work?


When we read a text, we naturally recognize named entities like people, organization name,
locations, and so on. For example, in the sentence “Mark Zuckerberg is one of the founders
of Facebook, a company from the United States” we can identify three types of entities:
“Person”: Mark Zuckerberg
“Company”: Facebook
“Location”: United States

For computers, however, we need to help them recognize entities first so that they can
categorize them.
This is done through machine learning and Natural Language Processing (NLP).
NLP studies the structure and rules of language and creates intelligent systems capable of
deriving meaning from text and speech, while machine learning helps machines learn and
improve over time.
To learn what an entity is, an NER model needs to be able to detect a word, or string of
words that form an entity (e.g. New York City) and know which entity category it belongs to.

So first, we need to create entity categories, like Name, Location, Event, Organization, etc.,
and feed NER model relevant training data. Then, by tagging some word and phrase samples

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

with their corresponding entities, you’ll eventually teach your NER model how to detect
entities itself.
You may create entity extractor to extract people from a given corpus, or company name
from a given corpus or any other entity.

Extracting the main entities in a text helps sort unstructured data and detect important
information, which is crucial if you have to deal with large datasets.

Some interesting use cases of named entity recognition:


Categorize Tickets in Customer Support: You can also use entity extraction to pull relevant
pieces of data, like product names or serial numbers, making it easier to route tickets to the
most suitable agent or team for handling that issue.
Gain Insights from Customer Feedback: you could use NER to detect locations that are
mentioned most often in negative customer feedback, which might lead you to focus on a
particular office branch.

Content Recommendation: if you watch a lot of comedies on Netflix, you’ll get more
recommendations that have been classified as the entity Comedy.
Process Resumes: By using an entity extractor, recruitment teams can instantly extract the
most relevant information about candidates, from personal information (like name, address,
phone number, date of birth and email), to data related to their training and experience
(such as certifications, degree, company names, skills, etc).

How to use NER?


By adding a sufficient number of examples in the doc_list, one can produce a customized
NER using spaCy.
spaCy supports the following entity types:
PERSON, NORP (nationalities, religious and political groups), FAC (buildings, airports etc.),
ORG (organizations), GPE (countries, cities etc.), LOC (mountain ranges, water bodies etc.),
PRODUCT (products), EVENT (event names), WORK_OF_ART (books, song titles), LAW
(legal document titles), LANGUAGE (named languages), DATE, TIME, PERCENT, MONEY,
QUANTITY, ORDINAL and CARDINAL.

Parsing:
There is a significant difference between NLP and traditional machine learning tasks, with
the former dealing with unstructured text data while the latter deals with structured
tabular data. Hence, it is necessary to understand how to deal with text before applying
machine learning techniques to it. This is where text parsing comes into the picture.
Text parsing is a common programming task that separates the given series of text into
smaller components based on some rules. Its application ranges from document parsing to
deep learning NLP.
Text parsing can be done with the two popular options are regular expressions and word
tokenization.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

An assumption in most work in natural language processing is that the basic unit of meaning
analysis is the sentence: a sentence expresses a proposition(plan / suggestion) , an idea, or a
thought, and says something about some real or imaginary world.
Extracting the meaning from a sentence is thus a key issue. Sentences are not, however, just
linear sequences of words, and so it is widely recognized that to carry out this task requires
an analysis of each sentence, which determines its structure in one way or another.
In NLP approaches based on generative linguistics, this is generally taken to involve the
determining of the syntactic or grammatical structure of each sentence.

Word Sense Disambiguation:


We understand that words have different meanings based on the context of its usage in the
sentence. If we talk about human languages, then they are ambiguous too because many words
can be interpreted in multiple ways depending upon the context of their occurrence.
Word sense disambiguation, in NLP, may be defined as the ability to determine which meaning
of word is activated by the given word in a particular context.
Lexical(word / vocabulary) ambiguity, syntactic or semantic, is one of the very first problem
that any NLP system faces.

Part-of-speech (POS) taggers with high level of accuracy can solve Word’s syntactic
ambiguity.
On the other hand, the problem of resolving semantic ambiguity is called WSD (word sense
disambiguation).
Resolving semantic ambiguity is harder than resolving syntactic ambiguity.
For example, consider the two examples of the distinct sense that exist for the word “bass”

I can hear bass sound.
He likes to eat grilled bass.

The occurrence of the word bass clearly denotes the distinct meaning.
In first sentence, it means frequency(low-pitched) and in second, it means fish. Hence, if it
would be disambiguated by Word sense disambiguate(WSD) then the correct meaning to the
above sentences can be assigned as follows −
I can hear bass/low-pitched frequency sound.
He likes to eat grilled bass/fish.

NL Generation:
Natural language generation is one of the frontiers of artificial intelligence. It is the idea that
computers and technologies can take non-language sources, for example, Excel spreadsheets,
videos, metadata and other sources, and create natural language outputs that seem human,
given that humans are the only biological creatures that use complex natural language.
NLG solutions are made of three main components:
▪ the data behind the narrative,
▪ the conditional logic and software that makes sense of that data, and

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

▪ the resulting content that is generated.

From video game and fantasy football match recaps, to custom BI dashboard analysis and client
communications, natural language generation is valuable wherever there is a need to generate
content from data.
Natural language generation empowers organizations to create data-driven narratives that are
personalized, insightful, and sounds as if a human wrote each one individually—all at a massive
scale.
NLG is different than NLP. NLP is focused on deriving analytic insights from textual data, NLG
is used to synthesize textual content by combining analytic output with contextualized
narratives.
In other words, NLP reads while NLG writes. NLP systems look at language and figure out what
ideas are being communicated. NLG systems start with a set of ideas locked in data and turn
them into language that, in turn, communicates them.

For example, using the historical data for July 1, 2005, the software produces:
Grass pollen levels for Friday have increased from the moderate to high levels of yesterday
with values of around 6 to 7 across most parts of the country. However, in Northern areas,
pollen levels will be moderate with values of 4.

How does NLG work?


An automated text generation process involves 6 stages. For the sake of simplicity, let’s
consider each stage from an example of robot journalist news on a football match:
Content Determination: The limits of the content should be determined. The data often
contains more information than necessary. In football news example, content regarding goals,
cards, and penalties will be important for readers.

• Data interpretation: The analyzed data is interpreted. With machine learning


techniques, patterns can be recognized in the processed data. This is where data is put
into context. For instance, information such as the winner of the match, goal scorers &
assisters, minutes when goals are scored are identified in this stage.
• Document planning: In this stage, the structures in the data are organized with the aim
of creating a narrative structure and document plan. Football news generally starts with
a paragraph that indicates the score of the game with a comment that describes the
level of intensity and competitiveness in the game, then the writer reminds the pre-
game standings of teams, describes other highlights of the game in the next paragraphs,
and ends with player and coach interviews.
• Sentence Aggregation: It is also called micro planning, and this stage is where different
sentences are aggregated in context because of their relevance.

For example, below first two sentences provide different meanings. However, if the second
event occurs right before half time, then these two sentences can be aggregated like the third
sentence:
“[X team] maintained their lead into halftime. “
“VAR(virtual assistant referee) overruled a decision to award [Y team]’s [Football player Z] a
penalty after replay showed [Football player T]’s apparent kick didn’t connect.”

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

“[X team] maintained their lead into halftime after VAR overruled a decision to award [Y
team]’s [Football player Z] a penalty after replay showed [Football player T]’s apparent kick
didn’t connect.”

• Grammaticalization: Grammaticalization stage makes sure that the whole report follows
the correct grammatical form, spelling, and punctuation. This includes validation of
actual text according to the rules of syntax, morphology, and orthography. For instance,
football games are written in the past tense.
• Language Implementation: This stage involves inputting data into templates and
ensuring that the document is output in the right format and according to the
preferences of the user.

Web 2.0 Applications:-Sentiment Analysis:


Sentiment Analysis is the computational treatment of opinions, sentiments and subjectivity of
text and use them for the benefit of the business operations.
It is the process of examining a piece of text for opinions and feelings. There are innumerable
real-life use cases for sentiment analysis that include understanding how consumers feel about
a product or service, looking for signs of depression, or to see how people respond to certain
advertise and political campaigns.
There are three main classification levels in SA:
• document-level,
• sentence-level and
• aspect-level SA.

Document-level SA aims to classify an opinion; whether the document as expressing a positive


or negative opinion or sentiment. It considers the whole document a basic information unit.

Sentence-level SA aims to classify sentiment expressed in each sentence. The first step is to
identify whether the sentence is subjective or objective. If the sentence is subjective,
Sentence-level SA will determine whether the sentence expresses positive or negative
opinions. Classifying text at the document level or at the sentence level does not provide the
necessary detail needed in many applications. To obtain these details; we need to go to the
aspect level.

Aspect-level SA aims to classify the sentiment with respect to the specific aspects of
entities. The first step is to identify the entities and their aspects.

It involves breaking down text data into smaller fragments, allowing you to obtain more
granular and accurate insights from your data.
For example: “The food was great but the service was poor.”
In cases like this, there is more than one sentiment and more than one topic in a single
sentence, so to label the whole review as either positive or negative would be incorrect. Use
aspect-based sentiment analysis here, which extracts and separates each aspect and
sentiment polarity in the sentence.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

In this instance, the aspects are Food and Service, resulting in the following sentiment
attribution:
The food was great = Food → Positive
The service was poor = Service → Negative

With the proper representation of the text after text pre-processing, many of the
techniques, such as clustering and classification, can be adapted to text mining.
For example, the k-means can be modified to cluster text documents into groups, where each
group represents a collection of documents with a similar topic.
The distance of a document to a centroid represents how closely the document talks about
that topic.
Classification tasks such as sentiment analysis and spam filtering are prominent use cases for
the Naive Bayes classifier. Text mining may utilize methods and techniques from natural
language processing.

Sentiment Analysis Steps

Text Entailment:
Textual entailment (TE- semantic interpretation) in natural language processing is a directional
relation between text fragments. The relation holds whenever the truth of one text fragment
follows from another text.
In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h),
respectively.
Textual entailment measures natural language understanding as it asks for a semantic
interpretation of the text, and due to its generality remains an active area of research.

Examples
Textual entailment can be illustrated with examples of three different relations:
An example of a positive TE (text entails hypothesis) is:
text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has good consequences.
An example of a negative TE (text contradicts hypothesis) is:
text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has no consequences.
An example of a non-TE (text does not entail nor contradict) is:
text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man will make you a better person.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

Cross Lingual Information Retrieval (CLIR).


Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with
retrieving information written in a language different from the language of the user's query.
The term "cross-language information retrieval" has many synonyms, of which the following
are perhaps the most frequent: cross-lingual information retrieval, translingual information
retrieval, multilingual information retrieval.
The term "multilingual information retrieval" refers more generally both to technology for
retrieval of multilingual collections and to technology which has been moved to handle material
in one language to another.
The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept
queries for information in various languages and return objects (text, and other media) of
various languages, translated into the user's language.

Cross-language information retrieval refers more specifically to the use case where users
formulate their information need in one language and the system retrieves relevant documents
in another.

Approaches to CLIR
Various approaches can be adopted to create a cross lingual search system. They are as
follows:
Query translation approach: In this approach, the query is translated into the language of
the document. Many translation schemes could be possible like dictionary-based translation or
more sophisticated machine translations.

The dictionary-based approach uses a lexical resource like bi-lingual dictionary to translate
words from source language to target document language. This translation can be done at word
level or phrase level.
The main assumption in this approach is that user can read and understand documents in target
language. In case, the user is not conversant with the target language, he/she can use some
external tools to translate the document in foreign language to his/her native language. Such
tools need not be available for all language pairs.

Document translation approach: This approach translates the documents in foreign languages
to the query language. Although this approach eases the problem stated above, this approach
has scalability issues.
There are too many documents to be translated and each document is quite large as compared
to a query. This makes the approach practically unsuitable.

Interlingual based approach: In this case, the documents and the query are both translated
into some common Interlingua. This approach generally requires huge resources as the
translation needs to be done online. A possible solution to overcome the problems in query and
document translations is to use query translation followed by snippet translation instead of
document translation.

Downloaded by Bhavesh Fanade ([email protected])


lOMoARcPSD|44355733

A snippet generally contains parts of a document containing query terms. This can give a clue
to the end user about usability of document. If the user finds it useful, then document
translation can be used to translate the document in language of the user. With every approach
comes a challenge with an associated cost.

Downloaded by Bhavesh Fanade ([email protected])

You might also like