NLP Module 1
NLP Module 1
21AIM72
Dr.Jimsha K Mathew
SAP/AIML
Module 1
NLG converts a computer’s machine-readable language into text and can also convert
that text into audible speech using text-to-speech technology.
First, the NLP system identifies what data should be converted to text.
If you asked the computer a question about the weather, it most likely did an online search to find
your answer, and from there it decides that the temperature, wind, and humidity are the factors
that should be read aloud to you.
Then, it organizes the structure of how it’s going to say it.
NLG system can construct full sentences using a lexicon and a set of grammar rules.
Finally, text-to-speech takes over.
Understand the structure of a sentence, punctuations,
TTS uses linguistic rules for the pronunciation of words, select phonetic representation.
The text-to-speech engine uses a Prosody model to evaluate the text and identify breaks,
duration, and pitch. The engine then combines all the recorded phonemes into one cohesive
string of speech using a speech database.(Prosody Model in linguistics refers to the patterns of
rhythm, stress, and intonation in speech.)
NLU and NLG
NLU
NLU can understand and process the meaning of
speech or text of a natural language. To do so,
NLU systems need a lexicon of the language, a
software component called a parser for taking
input data and building a data structure,
grammar rules, and semantics theory.NLU’s
core functions are understanding unstructured
data and converting text into a structured data set
which a machine can more easily consume.
NLG
NLG is a software process that turns
structured data – converted by NLU and a
(generally) non-linguistic representation of
information – into a natural language output
that humans can understand, usually in text
format.
Advantages of NLP
• For the training of the NLP model, a lot of data and computation are
required.
• Many issues arise for NLP when dealing with informal expressions,
idioms, and cultural jargon.
• They often have multiple meanings and can be context-dependent. NLP models
may struggle to accurately interpret and disambiguate such expressions.
• NLP results are sometimes not to be accurate, and accuracy is directly
proportional to the accuracy of data.
• NLP is designed for a single, narrow job since it cannot adapt to new
domains and has a limited function.
Origin and challenges of NLP
History of NLP
There are different rules for different languages. The syntax represents the set
of rules that the official language will have to follow. Violation of these rules
will give a syntax error.
Syntactic analysis in Natural Language Processing (NLP) involves parsing
sentences or text to analyze the grammatical structure and relationships
between words and phrases.
Here the sentence is transformed into the structure that represents a correlation
between the words. This correlation might violate the rules occasionally.
For example, “To the movies, we are going.” Will give a syntax error.
The syntactic analysis uses the results given by morphological analysis to
develop the description of the sentence. This process is called parsing.
Syntactic Analysis:
Example Sentence: "The cat chased the mouse.“
In syntactic analysis along with morphological step, the sentence can be broken down into its constituent
parts and analyzed as follows:
1. Tokenization: The first step in NLP is typically tokenization, where the sentence is divided into individual
tokens (words or punctuation marks). In this example, the tokens are: "The," "cat," "chased," "the," and
"mouse."
2. Part-of-Speech (POS) Tagging: Each token is assigned a part-of-speech tag that represents its grammatical
category. Common POS tags include nouns, verbs, adjectives, adverbs, pronouns, and more. For the example
sentence:
1. "The" is tagged as a determiner (DET).
2. "cat" is tagged as a noun (NOUN).
3. "chased" is tagged as a verb (VERB).
4. "the" is tagged as a determiner (DET).
5. "mouse" is tagged as a noun (NOUN).
3. Parsing: Parsing involves determining the syntactic structure of the sentence, including how words relate to
each other.
Possible rules for the sentences- From grammer , make parse tree
If parse tree incomplete for the sentence, then sentence is wrong.
For example, the cat chases the mouse in the garden, would be
represented as:
Concept of Parser:
Semantic Analysis
The semantic analysis looks after the meaning.
It allocates the meaning to all the structures built by the syntactic analyzer.
Then every syntactic structure and the objects are mapped together into the
task domain.
If mapping is possible the structure is sent, if not then it is rejected. For example,
“hot ice-cream” will give a semantic error.
During semantic analysis two main operations are executed:
First, each separate word will be mapped with appropriate objects in the database.
The dictionary meaning of every word will be found. A word might have more than
one meaning.
Secondly, all the meanings of each different word will be integrated to find a proper
correlation between the word structures. This process of determining the correct
meaning is called lexical disambiguation. It is done by associating each word with
the context.
Pragmatic Analysis
This phase checks the real-world knowledge or context to derive the real
meaning of the sentence(Intended meaning of the sentence).
• Eg: When we are ready to leave some one’s house, we don’t say “ I would like to leave
now, so lets end this conversation”
• Will simply say “ Well, it is getting late”.
Sentence parsing
A parser in NLP uses the grammar rules (formal grammar rules) to verify if the
input text is valid or not syntactically.
The parser helps us to get the meaning of the provided text (like the
dictionary meaning of the provided text).
As the parser helps us to analyze the syntax error in the text; so, the parsing
process is also known as the syntax analysis or the Syntactic analysis.
We have mainly two types of parsing techniques- top-down parsing, and
bottom-up parsing.
In the top-down parsing approach, the construction of the parse tree starts
from the root node. And in the bottom-up parsing approach, the
construction of the parse tree starts from the leaf node.
Probability and Statistics
“Once upon a time, there was a . . . ”
• Can you guess the next word?
• Hard in general, because language is not deterministic.
• But some words are more likely than other
• We can model uncertainty using probability theory.
• We can use statistics to ground our models in empirical data
Statistical Inference
• Statistical inference in Natural Language Processing (NLP) involves drawing conclusions
about a language's underlying structure and patterns based on statistical models and
sample data. This technique is essential for enabling machines to make predictions,
classify text, and understand language probabilistically. Two main kinds of statistical
inference:
• 1. Estimation
• 2. Hypothesis testing
In natural language processing:
• Estimation – learn model parameters
Methods include Maximum Likelihood Estimation (MLE), Bayesian Inference, or Markov Chain Monte Carlo (MCMC),
depending on the complexity of the model.
• Hypothesis tests – assess statistical significance of test results
• Example: A/B testing in machine translation systems can statistically compare two models’ performance to determine which
produces more accurate translations.
Language modeling (LM)
• characters
• words (individual words or sets of multiple words together)
• part of words
• punctuations
• sentences
• regular expressions
Why do we tokenize?
Word tokenization
• NLTK offers a bunch of different methods for word tokenization. We
will explore the following:
1.word_tokenize()
2.TreebankWordTokenizer
3.WordPunctTokenizer
4.RegEx
EX:1
Inflectional morphology
Derivational morphology.
1. Inflectional Morphology
•Example in English:
• walk → walks (plural)
• run → ran (past tense)
• big → bigger (comparative)
•Example in English:
• happy → unhappy (prefix changing meaning)
• teach → teacher (verb to noun)
• nation → national (noun to adjective)
Derivational morphemes have more semantic impact and can create
entirely new words from existing ones.
Morphological parsing
Stemming
The word giraffe, which differs by only one letter from graffe,
seems intuitively to be more similar than, say grail or graf,
• We can distinguish three increasingly broader problems:
THANK YOU
IMPORTANT QUESTIONS