Nlp Unit i Notes
Nlp Unit i Notes
NLP-Unit-I - Notes
UNIT – I
Introduction to NLP, brief history, NLP applications: Speech to
Text(STT), Text to Speech(TTS), Story Understanding, NL Generation,
QA system, Machine Translation, Text Summarization, Text
classification, Sentiment Analysis, Grammar/Spell Checkers etc.,
challenges/Open Problems, NLP abstraction levels, Natural Language
(NL) Characteristics and NL computing approaches/techniques and
steps, NL tasks: Segmentation, Chunking, tagging, NER, Parsing, Word
Sense Disambiguation, NL Generation, Web 2.0 Applications : Sentiment
Analysis; Text Entailment; Cross Lingual Information Retrieval
(CLIR).
Introduction to NLP:
NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence. It is the technology that is used by machines to
understand, analyse, manipulate, and interpret human's languages. It helps developers to
organize knowledge for performing tasks such as translation, automatic summarization, Named
Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.
Brief History:
(1940-1960) - Focused on Machine Translation (MT)
The Natural Languages Processing started in the year 1940s.
1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck College,
London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and computer science.
Now, Chomsky developed his first book syntactic structures and claimed that language is generative
in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based descriptions
of syntactic structures.
(1960-1980) - Flavored with Artificial Intelligence (AI)
In the year 1960 to 1980, the key developments were:
Augmented Transition Networks (ATN)
Augmented Transition Networks is a finite state machine that is capable of recognizing regular
languages.
Case Grammar
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case Grammar uses
languages such as English to express the relationship between nouns and verbs by using the
preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.
For example: "Neha broke the mirror with the hammer". In this example case grammar identify
Neha as an agent, mirror as a theme, and hammer as an instrument.
In the year 1960 to 1980, key systems were:
SHRDLU
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to communicate with
the computer and moving objects. It can handle instructions such as "pick up the green boll" and
also answer the questions like "What is inside the black box." The main importance of SHRDLU is
that it shows those syntax, semantics, and reasoning about the world that can be combined to
produce a system that understands a natural language.
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is used ATNs
and Woods' Procedural Semantics. It was capable of translating elaborate natural language
expressions into database queries and handle 78% of requests without errors.
1980 - Current
Till the year 1980, natural language processing systems were based on complex sets of hand-written
rules. After 1980, NLP introduced machine learning algorithms for language processing.
In the beginning of the year 1990s, NLP started growing faster and achieved good process accuracy,
especially in English Grammar. In 1990 also, an electronic text introduced, which provided a good
resource for training and examining natural language programs. Other factors may include the
availability of computers with fast CPUs and more memory. The major factor behind the
advancement of natural language processing was the Internet.
Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it allows the
artificial intelligence to gain knowledge of the world. Let's consider the example of AMAZON ALEXA,
using this robot you can ask the question to Alexa, and it will reply to you.
NLP applications:
There are the following applications of NLP -
1. Question Answering
2. Spam Detection.
3. Sentiment Analysis
4. Machine Translation
5. Spelling correction
6. Speech Recognition
7. Chatbot
8. Information extraction
9. Natural Language Understanding (NLU)
Speech to Text(STT):
STT applications convert spoken language into text format, enabling users to dictate text instead of
typing. Examples include voice-activated assistants like Siri, Google Assistant, and Amazon Alexa, as
well as dictation software for transcription purposes.
STT systems use various techniques such as acoustic modeling, language modeling, and speech
recognition algorithms to accurately transcribe spoken words into written text.
Transcription is a complex process that involves multiple stages and AI models working together.
Here's an overview of key steps in speech-to-text:
• Pre-processing: Before the input audio can be transcribed, it often undergoes some pre-
processing steps. This can include noise reduction, echo cancellation, and other techniques to
enhance the quality of the audio signal.
• Feature extraction: The audio waveform is then converted into a more suitable representation
for analysis. This usually involves extracting features from the audio signal that capture
important characteristics of the sound, such as frequency, amplitude, and duration. Mel-
frequency cepstral coefficients (MFCCs) are commonly used features in speech processing.
• Acoustic modelling: Involves training a statistical model that maps the extracted features to
phonemes, the smallest units of sound in a language.
• Language modelling: Language modeling focuses on the linguistic aspect of speech. It involves
creating a probabilistic model of how words and phrases are likely to appear in a particular
language. This helps the system make informed decisions about which words are more likely to
occur, given the previous words in the sentence.
• Decoding: In the decoding phase, the system uses the acoustic and language models to
transcribe the audio into a sequence of words or tokens. This process involves searching for the
most likely sequence of words that correspond to the given audio features.
• Post-processing: The decoded transcription may still contain errors, such as misrecognitions or
homophones (words that sound the same but have different meanings). Post-processing
techniques, including language constraints, grammar rules, and contextual analysis, are applied
to improve the accuracy and coherence of the transcription before producing the final output.
Text to Speech(TTS):
TTS applications convert written text into spoken language, allowing computers to generate
human-like speech. Examples include screen readers for visually impaired users, virtual
assistants, and audiobook narration.
TTS systems utilize techniques such as text analysis, prosody modeling, and speech synthesis
to produce natural-sounding speech from written text.
The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes
with 6 built-in voices and can be used to:
The speech endpoint takes in three key inputs: the model, the text that should be turned into
audio, and the voice to be used for the audio generation.
By default, the endpoint will output a MP3 file of the spoken audio but it can also be configured
to output any of our supported formats.
Audio quality
For real-time applications, the standard tts-1 model provides the lowest latency but at a lower
quality than the tts-1-hd model. Due to the way the audio is generated, tts-1 is likely to
generate content that has more static in certain situations than tts-1-hd. In some cases, the
audio may not have noticeable differences depending on your listening device and the individual
person.
Voice options
Experiment with different voices (alloy, echo, fable, onyx, nova, and shimmer) to find one that
matches your desired tone and audience.
The default response format is "mp3", but other formats like "opus", "aac", "flac", and "pcm"
are available.
• WAV: Uncompressed WAV audio, suitable for low-latency applications to avoid decoding
overhead.
• PCM: Similar to WAV but containing the raw samples in 24kHz (16-bit signed, low-
endian), without the header.
There is no direct mechanism to control the emotional output of the audio generated. Certain
factors may influence the output audio like capitalization or grammar but our internal tests
with these have yielded mixed results.
Story Understanding:
Story understanding involves analyzing and comprehending the structure, themes, and meaning
of written narratives.
NLP techniques such as semantic analysis, entity recognition, and topic modeling can be used
to extract key information and infer relationships between characters, events, and plot
elements in a story.
NL Generation:
NL Generation involves generating human-like text based on input data or prompts.
Applications include chatbots, language translation, and text generation for creative writing
or content generation purposes.
QA system:
QA (Question Answering) systems provide answers to user queries based on a given context
or knowledge base.
Examples include virtual assistants, search engines, and customer support chatbots.
QA systems use techniques such as information retrieval, natural language understanding, and
machine learning to analyze questions and retrieve relevant answers from structured or
unstructured data sources.
It is used to answer questions in the form of natural language and has a wide range of
applications.
Typical applications include: intelligent voice interaction, online customer service, knowledge
acquisition, personalized emotional chatting, and more.
There are two types of QA systems, open and closed.
A system that tries to answer any question you could possibly ask is called an open system or
open domain system – think of Google or Alexa.
And then there are closed (or closed domain) systems that are built for a certain subject or
function or domain of knowledge or company. E.g. Paypal, Finn AI, Digital genius etc.
Machine Translation:
Machine translation systems automatically translate text from one language to another.
Examples include Google Translate, Microsoft Translator, and language translation features in
mobile devices and web browsers.
Machine Translation (MT) is the task of automatically converting one natural language into
another, preserving the meaning of the input text, and producing fluent text in the output
language.
There are many challenging aspects of MT:
the large variety of languages, alphabets and grammars;
the task to translate a sequence (a sentence for example) to a sequence is harder for a
computer than working with numbers only;
there is no one correct answer
Application: google translator
Text Summarization:
Text summarization involves condensing large amounts of text into shorter, more concise
summaries while preserving key information and main ideas. Applications include summarization
of news articles, research papers, and legal documents.
It is a process of generating a concise and meaningful summary of text from multiple text
resources such as books, news articles, blog posts, research papers, emails, and tweets.
Text classification:
Text classification involves categorizing text documents into predefined classes or categories
based on their content or topic. Examples include spam detection, sentiment analysis, and topic
classification in news articles.
Text classification techniques include machine learning algorithms such as Naive Bayes,
Support Vector Machines (SVM), and deep learning models like Convolutional Neural Networks
(CNN) and Recurrent Neural Networks (RNN).
Text classification also known as text tagging or text categorization is the process of
categorizing text into organized groups.
By using NLP, text classifiers can automatically analyse text and then assign a set of pre-
defined tags or categories based on its content.
Unstructured text is everywhere, such as emails, chat conversations, websites, and social
media but it’s hard to extract value from this data unless it’s organized in a certain way.
Text classifiers with NLP have proven to be a great alternative to structure textual data in a
fast, cost-effective, and scalable way.
Application: Email services / medical report / news
Sentiment Analysis:
Sentiment analysis involves analyzing text data to determine the sentiment or emotion
expressed within it. Applications include social media monitoring, customer feedback analysis,
and brand sentiment analysis.
Sentiment analysis techniques include lexicon-based methods, machine learning classifiers, and
deep learning models for emotion detection and sentiment classification.
Sentiment analysis (or opinion mining) is a NLP technique used to determine whether data is
positive, negative or neutral.
Sentiment analysis is often performed on textual data to help businesses monitor brand and
product sentiment in customer feedback, and understand customer needs.
Sentiment analysis models focus on polarity (positive, negative, neutral) and also on feelings
and emotions (angry, happy, sad, etc), urgency (urgent, not urgent) and even intentions
(interested v. not interested).
Application: pros and cons of a product – useful in e-commerce
Grammar/Spell Checkers:
Grammar and spell checkers automatically detect and correct errors in written text, such as
grammatical mistakes, spelling errors, and punctuation errors. Examples include built-in spell
checkers in word processing software, grammar checking tools, and browser extensions.
Grammar and spell checkers use language rules, dictionaries, and statistical models to identify
and suggest corrections for errors in text.
A word needs to be checked for spelling correctness and corrected if necessary, many a time
in the context of the surrounding words.
Spell Check is a process of detecting and sometimes providing suggestions for incorrectly
spelled words in a text.
A basic spell checker carries out the following processes:
It scans the text and extracts the words contained in it.
It then compares each word with a known list of correctly spelled words (i.e. a dictionary).
Application: word /power point / google search
Autocorrect
Challenges/Open Problems:
Natural Language Processing (NLP) faces various challenges due to the complexity and
diversity of human language. Let’s discuss 10 major challenges in NLP:
1. Language differences
The human language and understanding is rich and intricated and there many languages spoken
by humans. Human language is diverse and thousand of human languages spoken around the
world with having its own grammar, vocabular and cultural nuances. Human cannot understand
all the languages and the productivity of human language is high. There is ambiguity in natural
language since same words and phrases can have different meanings and different context.
This is the major challenges in understating of natural language.
There is a complex syntactic structures and grammatical rules of natural languages. The rules
are such as word order, verb, conjugation, tense, aspect and agreement. There is rich semantic
content in human language that allows speaker to convey a wide range of meaning through words
and sentences. Natural Language is pragmatics which means that how language can be used in
context to approach communication goals. The human language evolves time to time with the
processes such as lexical change. The change in language represents cultural, social and
historical factors.
2.Training Data
Training data is a curated collection of input-output pairs, where the input represents the
features or attributes of the data, and the output is the corresponding label or target.
Training data is composed of both the features (inputs) and their corresponding labels
(outputs). For NLP, features might include text data, and labels could be categories,
sentiments, or any other relevant annotations.
It helps the model generalize patterns from the training set to make predictions or
classifications on new, previously unseen data.
Development Time and Resource Requirements for Natural Language Processing (NLP) projects
depends on various factors consisting the task complexity, size and quality of the data,
availability of existing tools and libraries, and the team of expert involved. Here are some key
points:
• Complexity of the task: Task such as classification of text or analyzing the sentiment
of the text may require less time compared to more complex tasks such as machine
translation or answering the questions.
• Availability and Quality Data: For Natural Language Processing models requires high-
quality of annotated data. It can be time consuming to collect, annotate, and preprocess
the large text datasets and can be resource-intensive specially for tasks that requires
specialized domain knowledge or fine-tuned annotations.
• Selection of algorithm and development of model: It is difficult to choose the right
algorithms machine learning algorithms that is best for Natural Language Processing
task.
• Evaluation and Training: It requires powerful computation resources that consists of
powerful hardware (GPUs or TPUs) and time for training the algorithms iteration. It is
also important to evaluate the performance of the model with the help of suitable
metrics and validation techniques for conforming the quality of the results.
Overcoming Misspelling and Grammatical Error are the basic challenges in NLP, as there are
different forms of linguistics noise that can impact accuracy of understanding and analysis.
Here are some key points for solving misspelling and grammatical error in NLP:
• Spell Checking: Implement spell-check algorithms and dictionaries to find and correct
misspelled words.
• Text Normalization: The is normalized by converting into a standard format which may
contains tasks such as conversion of text to lowercase, removal of punctuation and
special characters, and expanding contractions.
• Tokenization: The text is split into individual tokens with the help of tokenization
techniques. This technique allows to identify and isolate misspelled words and
grammatical error that makes it easy to correct the phrase.
• Language Models: With the help of language models that is trained on large corpus of
data to predict the likelihood of word or phrase that is correct or not based on its
context.
It is a crucial step of mitigating innate biases in NLP algorithm for conforming fairness, equity,
and inclusivity in natural language processing applications. Here are some key points for
mitigating biases in NLP algorithms.
• Collection of data and annotation: It is very important to confirm that the training
data used to develop NLP algorithms is diverse, representative and free from biases.
• Analysis and Detection of bias: Apply bias detection and analysis method on training
data to find biases that is based on demographic factors such as race, gender, age.
• Data Preprocessing: Data Preprocessing the most important process to train data to
mitigate biases like debiasing word embeddings, balance class distributions and
augmenting underrepresented samples.
• Fair representation learning: Natural Language Processing models are trained to learn
fair representations that are invariant to protect attributes like race or gender.
• Auditing and Evaluation of Models: Natural Language models are evaluated for fairness
and bias with the help of metrics and audits. NLP models are evaluated on diverse
datasets and perform post-hoc analyses to find and mitigate innate biases in NLP
algorithms.
Words with multiple meaning plays a lexical challenge in Nature Language Processing because
of the ambiguity of the word. These words with multiple meaning are known as polysemous or
homonymous have different meaning based on the context in which they are used. Here are
some key points for representing the lexical challenge plays by words with multiple meanings
in NLP:
8. Addressing Multilingualism
• Multilingual Corpora: Multilingual corpus consists of text data in various languages and
serve as valuable resources for training NLP models and systems.
• Cross-Lingual Transfer Learning: This is a type of techniques that is used to transfer
knowledge learned from one language to another.
• Language Identification: Design language identification models to automatically detect
the language of a given text.
• Machine Translation: Machine Translation provides the facility to communicate and
inform access across language barriers and can be used as preprocessing step for
multilingual NLP tasks.
It is very crucial task to reduce uncertainty and false positives in Natural Language Process
(NLP) to improve the accuracy and reliability of the NLP models. Here are some key points to
approach the solution:
Facilitating continuous conversations with NLP includes the development of system that
understands and responds to human language in real-time that enables seamless interaction
between users and machines. Implementing real time natural language processing pipelines
gives to capability to analyze and interpret user input as it is received involving algorithms are
optimized and systems for low latency processing to confirm quick responses to user queries
and inputs.
Building an NLP models that can maintain the context throughout a conversation. The
understanding of context enables systems to interpret user intent, conversation history
tracking, and generating relevant responses based on the ongoing dialogue. Apply intent
recognition algorithm to find the underlying goals and intentions expressed by users in their
messages.
• Quantity and Quality of data: High quality of data and diverse data is used to train
the NLP algorithms effectively. Data augmentation, data synthesis, crowdsourcing are
the techniques to address data scarcity issues.
• Ambiguity: The NLP algorithm should be trained to disambiguate the words and phrases.
• Out-of-vocabulary Words: The techniques are implemented to handle out-of-
vocabulary words such as tokenization, character-level modeling, and vocabulary
expansion.
• Lack of Annotated Data: Techniques such transfer learning and pre-training can be
used to transfer knowledge from large dataset to specific tasks with limited labeled
data.
Tokenization and sentence segmentation So, we need to separate all individual tokens having
the “minimal unit of meaning”, is referred to as the morpheme. E.g. happiness The stem happy
is considered as a free morpheme since it is a “word” in its own. Bound morphemes (prefixes
and suffixes) require a free morpheme to which it can be attached to and can therefore not
appear as a “word” on their own.
Lexical analysis: The lexical analysis in NLP deals with the study at the level of words with
respect to their lexical (word) meaning and part-of-speech. This level of linguistic processing
utilizes a language’s lexicon (dictionary), which is a collection of individual lexemes. A lexeme
is a basic unit of lexical meaning; which is an abstract unit of morphological analysis that
represents the set of forms or “senses” taken by a single morpheme.
“duck”, for example, can take the form of a noun (swimming bird) or a verb (bow down) but its
part-of-speech and lexical meaning can only be derived in context with other words used in
the phrase/sentence. This, in fact, is an early step towards a more sophisticated Information
Retrieval system where precision is improved through part-of-speech tagging.
Syntactic Analysis: Syntactic Analysis also referred to as “parsing”, allows the extraction of
phrases which convey more meaning than just the individual words by themselves. In
Information Retrieval, parsing can be forced to improve indexing since phrases can be used as
representations of documents which provide better information than just single-word indices.
In the same way, phrases that are syntactically derived from the query offers better search
keys to match with documents that are similarly parsed.
Nevertheless, syntax can still be ambiguous at times as in the case of the news headline:
“Boy paralyzed after tumour fights back to gain black belt” — which actually refers to how a
boy was paralyzed because of a tumour but endured the fight against the disease and
ultimately gained a high level of competence in martial arts.
Pragmatic Analysis: The pragmatic(Realistic / Logical) level of linguistic processing deals with
the use of real-world knowledge and understanding of how this impacts the meaning of what is
being communicated. By analyzing the contextual dimension of the documents and queries, a
more detailed representation is derived.
In Information Retrieval, this level of Natural Language Processing primarily engages query
processing and understanding by integrating the user’s history and goals as well as the context
upon which the query is being made. Contexts may include time and location.
This level of analysis enables major breakthroughs in Information Retrieval as it facilitates
the conversation between the information retrieval system and the users, allowing the
induction of the purpose upon which the information being sought is planned to be used,
thereby ensuring that the information retrieval system is fit for purpose.
• Language is Arbitrary
• Language is Social
• Language is Symbolic
• Language is Systematic
• Language is Vocal
• Language is Non-instinctive(instinct – natural / unlearn)
• Language is Productive and Creative
Language is Arbitrary: Language is arbitrary in the sense that there is no inherent relation
between the words of a language and their meanings or the ideas conveyed by them.
Language is Symbolic: Language consists of various sound symbols and their graphological
counterparts that are employed to denote some objects, occurrences or meaning.
Language is Systematic: Although language is symbolic, yet its symbols are arranged in a
particular system. All languages have their system of arrangements. Furthermore, all
languages have phonological and syntactic systems and within a system, there are also several
sub-systems.
Language is Productive and Creative: Language has creativity and productivity. The structural
elements of human language can be combined to produce new utterances(words), which neither
the speaker nor his hearers may ever have made or heard before any, listener, yet which both
sides understand without difficulty.
Tokenization involves breaking down text into smaller units, such as words, subwords, or
characters, for further processing.
Techniques include whitespace tokenization, word tokenization, and subword tokenization
(e.g., Byte-Pair Encoding, WordPiece).
2. Lexical Analysis:
Decomposing words into their parts and maintaining rules for how combinations are formed.
morphological processing can go some way toward handling unrecognized words.
Morphological processes alter stems to derive new words. They may change the word’s
meaning (derivational) or grammatical function (inflectional).
3. Syntactic Parsing:
The basic unit of meaning analysis is the sentence: a sentence expresses a proposition, an idea,
or a thought, and says something about some real or imaginary world. In NLP approaches based
on generative linguistics, this is generally taken to involve the determining of the syntactic or
grammatical structure of each sentence.
Syntactic parsing analyzes the grammatical structure of sentences to identify phrases and
dependencies between words. Techniques include constituency parsing (e.g., Context-Free
Grammars), dependency parsing (e.g., Transition-Based Parsing, Graph-Based Parsing), and
neural network-based parsers (e.g., Recursive Neural Networks, Transformer models).
4. Semantic Analysis:
It is these subsequent steps that derive a meaning for the sentence in question. Semantic
analysis focuses on extracting meaning from text, including tasks such as semantic role
labeling, semantic parsing, and word sense disambiguation.
Techniques include statistical models, graph-based approaches, and neural network-based
methods (e.g., Transformers).
5. Sentiment Analysis
This is the dissection of data (text, voice, etc) in order to determine whether it’s positive,
neutral, or negative. It tags each statement with ‘sentiment’ then aggregates the sum of all
the statements in a given dataset.
Sentiment analysis can transform large archives of customer feedback, reviews, or social
media reactions into actionable, quantified results. These results can then be analyzed for
customer insight and further strategic results.
Sentiment analysis determines the sentiment expressed in text, such as positive, negative,
or neutral. Techniques include lexicon-based methods, machine learning classifiers (e.g.,
Support Vector Machines, Naive Bayes), and deep learning models (e.g., CNNs, LSTMs,
Transformers).
Named Entity Recognition, or NER (because we in the tech world are huge fans of our
acronyms) is a Natural Language Processing technique that tags ‘named identities’ within text
and extracts them for further analysis.
As you can see in the example below, NER is similar to sentiment analysis. NER, however, simply
tags the identities, whether they are organization names, people, proper nouns, locations, etc.,
and keeps a running tally of how many times they occur within a dataset.
How many times an identity (meaning a specific thing) crops up in customer feedback can
indicate the need to fix a certain pain point. Within reviews and searches it can indicate a
preference for specific kinds of products, allowing you to custom tailor each customer journey
to fit the individual user, thus improving their customer experience.
The limits to NER’s application are only bounded by your feedback and content teams’
imaginations.
NER identifies and categorizes named entities such as persons, organizations, locations, dates,
and numerical expressions in text.
Techniques include rule-based approaches, statistical models (e.g., Conditional Random Fields),
and neural network-based models (e.g., BiLSTM-CRF).
More technical than our other topics, lemmatization and stemming refers to the breakdown,
tagging, and restructuring of text data based on either root stem or definition.
That might seem like saying the same thing twice, but both sorting processes can lend
different valuable data. Discover how to make the best of both techniques in our guide to Text
Cleaning for NLP.
That’s a lot to tackle at once, but by understanding each process and combing through the
linked tutorials, you should be well on your way to a smooth and successful NLP application.
NL tasks:
Natural language processing (NLP) encompasses a wide range of tasks aimed at understanding,
generating, and processing human language. Some common NLP tasks:
Text Classification:
Assigning predefined categories or labels to text documents based on their content. Examples
include sentiment analysis, topic classification, spam detection, and language identification.
Identifying and classifying named entities mentioned in text, such as persons, organizations,
locations, dates, and numerical expressions.
Assigning grammatical categories (e.g., noun, verb, adjective) to each word in a sentence.
Syntactic Parsing:
Semantic Parsing:
Translating natural language utterances into formal representations, such as logical forms or
executable queries, for tasks like question answering and database querying.
Coreference Resolution:
Identifying which words or phrases in a text refer to the same entity, resolving pronouns and
noun phrases to their corresponding antecedents.
Sentiment Analysis:
Determining the sentiment expressed in text, such as positive, negative, or neutral. This can
be applied at the document, sentence, or aspect level.
Text Summarization:
Generating concise summaries of longer texts while preserving the most important information
and key points.
Machine Translation:
Translating text from one language to another automatically using computational methods.
Generating answers to questions posed in natural language, often based on a given context or
a large corpus of knowledge.
Language Generation:
Generating human-like text based on input prompts or contexts. This includes tasks such as
text generation, dialogue generation, and story generation.
Measuring the similarity between text documents or clustering them into groups based on
their semantic content or features.
Aligning text segments across languages, documents, or versions for tasks such as bilingual
dictionary creation, parallel corpus alignment, and plagiarism detection.
Transcribing spoken language into written text, often as part of automatic speech recognition
(ASR) systems.
Segmentation:
Text segmentation is the process of dividing written text into meaningful units, such as words,
sentences, or topics. The term applies both to mental processes used by humans when reading
text, and to artificial processes implemented in computers, which are the subject of natural
language processing.
• Word segmentation - Word segmentation is the problem of dividing a string of written
language into its component words.
• Intent segmentation - Intent segmentation is the problem of dividing written words into
key phrases (2 or more group of words).
• Sentence segmentation - Sentence segmentation is the problem of dividing a string of
written language into its component sentences.
• Topic segmentation - Topic analysis consists of two main tasks: topic identification and
text segmentation. While the first is a simple classification of a specific text, the latter
case implies that a document may contain multiple topics, and the task of computerized
text segmentation may be to discover these topics automatically and segment the text
accordingly. The topic boundaries may be apparent from section titles and paragraphs.
In other cases, one needs to use techniques similar to those used in document
classification.
Segmenting the text into topics or discourse turns might be useful in some natural processing
tasks: it can improve information retrieval or speech recognition significantly (by
Tagging:
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for
a particular part of a speech based on its definition and context. It is responsible for text
reading in a language and assigning some specific token (Parts of Speech) to each word. It is
also called grammatical tagging.
For example:
Input: Everything to permit us.
Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]
Steps Involved in the POS tagging example:
Tokenize text (word_tokenize)
apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
Chunking:
Chunking in NLP is a process to take small pieces of information and group them into large
units. The primary use of Chunking is making groups of "noun phrases." It is used to add
structure to the sentence by following POS tagging combined with regular expressions. The
resulted group of words are called "chunks." It is also called shallow parsing.
In shallow parsing, there is maximum one level between roots and leaves while deep parsing
comprises of more than one level. Shallow parsing is also called light parsing or chunking.
There are no pre-defined rules, but you can combine them according to need and
requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating
conjunction(e.g. and, but..) from the sentence. You can use the rule as below
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
#Example:
from nltk import pos_tag
from nltk import RegexpParser
text ="learn NLP from NPTEL and make study easy".split()
print("After Split:",text)
import nltk
nltk.download('averaged_perceptron_tagger')
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)
Output
After Split: ['learn', 'NLP', 'from', 'NPTEL', 'and', 'make', 'study', 'easy’]
After Token: [('learn', 'JJ'), ('NLP', 'NNP'), ('from', 'IN'), ('NPTEL', 'NNP'), ('and', 'CC'),
('make', 'VB'), ('study', 'NN'), ('easy', 'JJ’)]
After Chunking (S
mychunk learn/JJ)
mychunk NLP/NNP)
from/IN
mychunk NPTEL/NNP and/CC)
make/VB
mychunk study/NN easy/JJ))
Chunking is used for entity detection. An entity is that part of the sentence by which
machine gets the value for any intention.
Example:
Temperature of India.
Here Temperature is the intention and India is an entity.
In other words, chunking is used as selecting the subsets of tokens.
NER:
Named entity recognition (NER) is one of the most common uses of Information Extraction
(IE) technology.
NER systems identify different types of proper names, such as person and company names,
and sometimes special types of entities, such as dates and times, that can be easily.
NER is especially important in biomedical applications, where terminology is a formidable
(difficult) problem. But IE is much more than just NER.
A much more difficult and potentially much more significant capability is the recognition of
events and their participants.
For example, in each of the sentences
“Microsoft acquired Powerset.”
“Powerset was acquired by Microsoft.”
we would like to recognize not only that Microsoft and Powerset are company names, but also
that an acquisition event took place, that the acquiring company was Microsoft, and the
acquired company was Powerset.
For computers, however, we need to help them recognize entities first so that they can
categorize them.
This is done through machine learning and Natural Language Processing (NLP).
NLP studies the structure and rules of language and creates intelligent systems capable of
deriving meaning from text and speech, while machine learning helps machines learn and
improve over time.
To learn what an entity is, an NER model needs to be able to detect a word, or string of
words that form an entity (e.g. New York City) and know which entity category it belongs to.
So first, we need to create entity categories, like Name, Location, Event, Organization, etc.,
and feed NER model relevant training data. Then, by tagging some word and phrase samples
with their corresponding entities, you’ll eventually teach your NER model how to detect
entities itself.
You may create entity extractor to extract people from a given corpus, or company name
from a given corpus or any other entity.
Extracting the main entities in a text helps sort unstructured data and detect important
information, which is crucial if you have to deal with large datasets.
Content Recommendation: if you watch a lot of comedies on Netflix, you’ll get more
recommendations that have been classified as the entity Comedy.
Process Resumes: By using an entity extractor, recruitment teams can instantly extract the
most relevant information about candidates, from personal information (like name, address,
phone number, date of birth and email), to data related to their training and experience
(such as certifications, degree, company names, skills, etc).
Parsing:
There is a significant difference between NLP and traditional machine learning tasks, with
the former dealing with unstructured text data while the latter deals with structured
tabular data. Hence, it is necessary to understand how to deal with text before applying
machine learning techniques to it. This is where text parsing comes into the picture.
Text parsing is a common programming task that separates the given series of text into
smaller components based on some rules. Its application ranges from document parsing to
deep learning NLP.
Text parsing can be done with the two popular options are regular expressions and word
tokenization.
An assumption in most work in natural language processing is that the basic unit of meaning
analysis is the sentence: a sentence expresses a proposition(plan / suggestion) , an idea, or a
thought, and says something about some real or imaginary world.
Extracting the meaning from a sentence is thus a key issue. Sentences are not, however, just
linear sequences of words, and so it is widely recognized that to carry out this task requires
an analysis of each sentence, which determines its structure in one way or another.
In NLP approaches based on generative linguistics, this is generally taken to involve the
determining of the syntactic or grammatical structure of each sentence.
Part-of-speech (POS) taggers with high level of accuracy can solve Word’s syntactic
ambiguity.
On the other hand, the problem of resolving semantic ambiguity is called WSD (word sense
disambiguation).
Resolving semantic ambiguity is harder than resolving syntactic ambiguity.
For example, consider the two examples of the distinct sense that exist for the word “bass”
−
I can hear bass sound.
He likes to eat grilled bass.
The occurrence of the word bass clearly denotes the distinct meaning.
In first sentence, it means frequency(low-pitched) and in second, it means fish. Hence, if it
would be disambiguated by Word sense disambiguate(WSD) then the correct meaning to the
above sentences can be assigned as follows −
I can hear bass/low-pitched frequency sound.
He likes to eat grilled bass/fish.
NL Generation:
Natural language generation is one of the frontiers of artificial intelligence. It is the idea that
computers and technologies can take non-language sources, for example, Excel spreadsheets,
videos, metadata and other sources, and create natural language outputs that seem human,
given that humans are the only biological creatures that use complex natural language.
NLG solutions are made of three main components:
▪ the data behind the narrative,
▪ the conditional logic and software that makes sense of that data, and
From video game and fantasy football match recaps, to custom BI dashboard analysis and client
communications, natural language generation is valuable wherever there is a need to generate
content from data.
Natural language generation empowers organizations to create data-driven narratives that are
personalized, insightful, and sounds as if a human wrote each one individually—all at a massive
scale.
NLG is different than NLP. NLP is focused on deriving analytic insights from textual data, NLG
is used to synthesize textual content by combining analytic output with contextualized
narratives.
In other words, NLP reads while NLG writes. NLP systems look at language and figure out what
ideas are being communicated. NLG systems start with a set of ideas locked in data and turn
them into language that, in turn, communicates them.
For example, using the historical data for July 1, 2005, the software produces:
Grass pollen levels for Friday have increased from the moderate to high levels of yesterday
with values of around 6 to 7 across most parts of the country. However, in Northern areas,
pollen levels will be moderate with values of 4.
For example, below first two sentences provide different meanings. However, if the second
event occurs right before half time, then these two sentences can be aggregated like the third
sentence:
“[X team] maintained their lead into halftime. “
“VAR(virtual assistant referee) overruled a decision to award [Y team]’s [Football player Z] a
penalty after replay showed [Football player T]’s apparent kick didn’t connect.”
“[X team] maintained their lead into halftime after VAR overruled a decision to award [Y
team]’s [Football player Z] a penalty after replay showed [Football player T]’s apparent kick
didn’t connect.”
• Grammaticalization: Grammaticalization stage makes sure that the whole report follows
the correct grammatical form, spelling, and punctuation. This includes validation of
actual text according to the rules of syntax, morphology, and orthography. For instance,
football games are written in the past tense.
• Language Implementation: This stage involves inputting data into templates and
ensuring that the document is output in the right format and according to the
preferences of the user.
Sentence-level SA aims to classify sentiment expressed in each sentence. The first step is to
identify whether the sentence is subjective or objective. If the sentence is subjective,
Sentence-level SA will determine whether the sentence expresses positive or negative
opinions. Classifying text at the document level or at the sentence level does not provide the
necessary detail needed in many applications. To obtain these details; we need to go to the
aspect level.
Aspect-level SA aims to classify the sentiment with respect to the specific aspects of
entities. The first step is to identify the entities and their aspects.
It involves breaking down text data into smaller fragments, allowing you to obtain more
granular and accurate insights from your data.
For example: “The food was great but the service was poor.”
In cases like this, there is more than one sentiment and more than one topic in a single
sentence, so to label the whole review as either positive or negative would be incorrect. Use
aspect-based sentiment analysis here, which extracts and separates each aspect and
sentiment polarity in the sentence.
In this instance, the aspects are Food and Service, resulting in the following sentiment
attribution:
The food was great = Food → Positive
The service was poor = Service → Negative
With the proper representation of the text after text pre-processing, many of the
techniques, such as clustering and classification, can be adapted to text mining.
For example, the k-means can be modified to cluster text documents into groups, where each
group represents a collection of documents with a similar topic.
The distance of a document to a centroid represents how closely the document talks about
that topic.
Classification tasks such as sentiment analysis and spam filtering are prominent use cases for
the Naive Bayes classifier. Text mining may utilize methods and techniques from natural
language processing.
Text Entailment:
Textual entailment (TE- semantic interpretation) in natural language processing is a directional
relation between text fragments. The relation holds whenever the truth of one text fragment
follows from another text.
In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h),
respectively.
Textual entailment measures natural language understanding as it asks for a semantic
interpretation of the text, and due to its generality remains an active area of research.
Examples
Textual entailment can be illustrated with examples of three different relations:
An example of a positive TE (text entails hypothesis) is:
text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has good consequences.
An example of a negative TE (text contradicts hypothesis) is:
text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man has no consequences.
An example of a non-TE (text does not entail nor contradict) is:
text: If you help the needy, God will reward you.
hypothesis: Giving money to a poor man will make you a better person.
Cross-language information retrieval refers more specifically to the use case where users
formulate their information need in one language and the system retrieves relevant documents
in another.
Approaches to CLIR
Various approaches can be adopted to create a cross lingual search system. They are as
follows:
Query translation approach: In this approach, the query is translated into the language of
the document. Many translation schemes could be possible like dictionary-based translation or
more sophisticated machine translations.
The dictionary-based approach uses a lexical resource like bi-lingual dictionary to translate
words from source language to target document language. This translation can be done at word
level or phrase level.
The main assumption in this approach is that user can read and understand documents in target
language. In case, the user is not conversant with the target language, he/she can use some
external tools to translate the document in foreign language to his/her native language. Such
tools need not be available for all language pairs.
Document translation approach: This approach translates the documents in foreign languages
to the query language. Although this approach eases the problem stated above, this approach
has scalability issues.
There are too many documents to be translated and each document is quite large as compared
to a query. This makes the approach practically unsuitable.
Interlingual based approach: In this case, the documents and the query are both translated
into some common Interlingua. This approach generally requires huge resources as the
translation needs to be done online. A possible solution to overcome the problems in query and
document translations is to use query translation followed by snippet translation instead of
document translation.
A snippet generally contains parts of a document containing query terms. This can give a clue
to the end user about usability of document. If the user finds it useful, then document
translation can be used to translate the document in language of the user. With every approach
comes a challenge with an associated cost.