0% found this document useful (0 votes)
67 views

NLP Module 1

Uploaded by

rayhalcomet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

NLP Module 1

Uploaded by

rayhalcomet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 124

NATURAL LANGUAGE PROCESSING

21AIM72
Dr.Jimsha K Mathew
SAP/AIML
Module 1

Origins and challenges of NLP – Language Modeling: Grammar-


based LM, Statistical LM - Regular Expressions, Finite-State
Automata – English Morphology, Transducers for lexicon and
rules, Tokenization, Detecting and Correcting Spelling Errors,
Minimum Edit Distance
Natural Language Processing
Language
• Human to Human Communication – verbal - English, Hindi, Kannada,
Tamil, Telugu, Marathi OR non verbal communication – sign
language, hand gesture
• Machine to machine communication – High level language, Assembly
level , Low level languages.
Natural language
• Languages that are spoken naturally by human beings- English, Hindi,
Kannada, Tamil, Telugu, Marathi.
What is Natural Language Processing?

• Natural language processing is the ability of computer program to


understand human language as it is spoken.
• How humans communicate with each other?
• Listen, interpret, understand, reply back
• Computer should replicate the same
• Trying to implement two computers to communicate
Study of Human Languages

Language is a crucial component for human


lives and also the most fundamental aspect of
our behavior. We can experience it in mainly
two forms - written and spoken.
In the written form, it is a way to pass our
knowledge from one generation to the next.
In the spoken form, it is the primary medium
for human beings to coordinate with each
other in their day-to-day behavior. Language
is studied in various academic disciplines.
Each discipline comes with its own set of
problems and a set of solution to address
those.
What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of AI that helps


computers to understand, interpret and manipulate human
languages to analyze and derive it’s meaning.
Human languages can be in the form of text or audio format.
 NLP helps developers to organize and structure knowledge to
perform tasks like translation, summarization, named entity
recognition, relationship extraction, speech recognition, topic
segmentation, etc.
Applications of Natural Language
Processing
The applications of Natural Language Processing are as follows:
Spell checker in MS word- Dictionary
Mail Filter – Reads mail and based on keywords , detects spam
(Content, sender)
Text prediction –
• when we type in google , it suggests most probable recommendation.
• Automatic mail reply suggestion – its based on mail what we get. “Thank you”, Thanks a lot”
Machine Translation: NLP is used to develop machine translation
systems like Google Translate, which can automatically translate text or
speech from one language to another.
Applications of Natural Language
Processing
Sentiment Analysis: NLP can analyze text data from sources such as social media,
customer reviews, or surveys to determine sentiment or opinions. This is valuable
for brand monitoring, market research, and customer feedback analysis.
Chatbots and Virtual Assistants: NLP is behind the development of chatbots and
virtual assistants like Siri, Alexa, and chatbots on websites. These systems can
understand and respond to natural language queries and commands.
Named Entity Recognition (NER): NLP can identify and classify entities such as
names of people, places, organizations, and dates within text, which is useful in
information extraction and document categorization.
Text Summarization: NLP can automatically generate concise summaries of long
documents, which is useful in content curation, news aggregation, and document
summarization.
Working of Natural Language Processing
(NLP)
• Working in natural language processing (NLP) typically involves using
computational techniques to analyze and understand human language.
• This can include tasks such as language understanding, language
generation, and language interaction.
• The field is divided into three different parts:
1. Speech Recognition — The translation of spoken language into text.
2. Natural Language Understanding (NLU) — The computer’s ability to
understand what we say.
3. Natural Language Generation (NLG) — The generation of natural
language by a computer.
Speech Recognition:
First, the computer must take natural language and convert it into machine-readable
language. This is what speech recognition or speech-to-text does. This is the first step of
NLU.
Hidden Markov Models (HMMs) are used in the majority of voice recognition systems
nowadays.
 These are statistical models that use mathematical calculations to determine what you said in order to
convert your speech to text.
HMMs do this by listening to you talk,
 breaking it down into small units (typically 10-20 milliseconds),
 comparing it to pre-recorded speech to figure out which phoneme you uttered in each unit (a
phoneme is the smallest unit of speech).
 The program then examines the sequence of phonemes and uses statistical analysis to determine the
most likely words and sentences you were speaking.
Natural Language Understanding (NLU) :
The next and hardest step of NLP is the understanding part.
First, the computer must understand the meaning of each word.
It tries to figure out whether the word is a noun or a verb, whether it’s in the past or
present tense, and so on. This is called Part-of-Speech tagging (POS).
A lexicon (a vocabulary) and a set of grammatical rules are also built into NLP
systems. The most difficult part of NLP is understanding.
Tokenization ,POS tagging, syntactic analysis, semantic analysis, knowledge base
analysis.
There are several challenges in accomplishing this when considering problems such
as words having several meanings (polysemy) or different words having similar
meanings (synonymy), but developers encode rules into their NLU systems and
train them to learn to apply the rules correctly.
Example
• Grammar is defined as the rules for forming well-
structured sentences.
• S → NP VP
• NP → DT N
• NP → DT N PP
• PP-> P NP
• VP->V NP
• VP->V NP PP
Noun Phrase: A group of words that work together to describe a person,
place, thing or idea.
Eg: The green apple
A happy child
An old friend
Noun Phrase typically include : Noun + determiner+Adjectives+ PP
A determiner : is a word placed in front of a noun to specify quantity,
ownership and specificity
All determiners can be classified as one of the following:
Articles: (a/an, the)
Demonstratives: (this, that, these, those)
Possessives: (my, your, his, her, its, our, their)
 quantifiers : (many, much, more, most, some)
• Verb Phrase: is a combination of an auxiliary verb , also known as
helping verb and main verb.
• My mother is cooking favorite dish.( is: aux verb , cooking: main
verb)
• I have been working for a long time. ( have been: aux verb , working:
main verb)

• Prepositional Phrase: is a combination of preposition , a modifier


and its object.
• E.g.: The submarine sinks into the ocean.
• That puppy at the park is so happy.
Polysemy (different meaning for
same word)-Example
• Bank:
• Financial Institution: "I have an account at the bank."
• The Side of a River: "We had a picnic by the riverbank."
• Bat:
• Animal: "I saw a bat flying in the night."
• Sporting Equipment: "He swung the baseball bat."
• Crane:
• Bird Species: "I spotted a crane in the wetlands."
• Heavy Machinery: "The construction site had a large crane."
Synonymy (Different word has
same meaning)- Example
• Big and Large:
• Happy and Joyful:
• Fast and Quick:
• Smart and Intelligent:
Natural Language Generation (NLG):

NLG converts a computer’s machine-readable language into text and can also convert
that text into audible speech using text-to-speech technology.
First, the NLP system identifies what data should be converted to text.
 If you asked the computer a question about the weather, it most likely did an online search to find
your answer, and from there it decides that the temperature, wind, and humidity are the factors
that should be read aloud to you.
 Then, it organizes the structure of how it’s going to say it.
 NLG system can construct full sentences using a lexicon and a set of grammar rules.
Finally, text-to-speech takes over.
 Understand the structure of a sentence, punctuations,
 TTS uses linguistic rules for the pronunciation of words, select phonetic representation.
 The text-to-speech engine uses a Prosody model to evaluate the text and identify breaks,
duration, and pitch. The engine then combines all the recorded phonemes into one cohesive
string of speech using a speech database.(Prosody Model in linguistics refers to the patterns of
rhythm, stress, and intonation in speech.)
NLU and NLG
NLU
NLU can understand and process the meaning of
speech or text of a natural language. To do so,
NLU systems need a lexicon of the language, a
software component called a parser for taking
input data and building a data structure,
grammar rules, and semantics theory.NLU’s
core functions are understanding unstructured
data and converting text into a structured data set
which a machine can more easily consume.

NLG
NLG is a software process that turns
structured data – converted by NLU and a
(generally) non-linguistic representation of
information – into a natural language output
that humans can understand, usually in text
format.
Advantages of NLP

• NLP helps us to analyze data from both structured and unstructured


sources.
• NLP is very fast and time efficient.
• NLP offers end-to-end exact answers to the question. So, it saves time
that going to consume unnecessary and unwanted information.
• NLP offers users to ask questions about any subject and give a direct
response within milliseconds.
Disadvantages of NLP

• For the training of the NLP model, a lot of data and computation are
required.
• Many issues arise for NLP when dealing with informal expressions,
idioms, and cultural jargon.
• They often have multiple meanings and can be context-dependent. NLP models
may struggle to accurately interpret and disambiguate such expressions.
• NLP results are sometimes not to be accurate, and accuracy is directly
proportional to the accuracy of data.
• NLP is designed for a single, narrow job since it cannot adapt to new
domains and has a limited function.
Origin and challenges of NLP

History of NLP

 We have divided the history of NLP into four phases.

 The phases have distinctive concerns and styles.


Origins of Natural Language Processing (NLP):
1.Early Foundations (1950s-1960s):
1. Alan Turing's Influence: The concept of NLP can be
traced back to Alan Turing’s 1950 paper, "Computing
Machinery and Intelligence," where he introduced the
idea of a machine's ability to exhibit intelligent
behavior indistinguishable from that of a human,
laying the groundwork for artificial intelligence (AI)
and NLP.
2. Machine Translation: During the Cold War, there was
significant interest in automatic translation between
languages, especially for translating Russian texts into
English. The Georgetown-IBM experiment in 1954 was
one of the first demonstrations, translating more than
60 Russian sentences into English.
3. Chomsky's Linguistic Theories: Noam Chomsky’s work
in the 1950s on formal grammar and syntax,
particularly the idea of generative grammar, influenced
the early computational models for language
processing.
Rule-Based Systems (1960s-1980s):
 Symbolic NLP: Early NLP systems were heavily rule-based,
relying on hand-crafted linguistic rules. These systems used
syntactic and semantic rules to process language, focusing
on tasks like parsing sentences, information retrieval, and
machine translation.

Statistical Methods (1980s-1990s):


 Introduction of Probabilistic Models: In the late 1980s, the
NLP field shifted from rule-based approaches to statistical
methods. This change was driven by the availability of large
text corpora and computational power, allowing for the use
of probabilistic models like Hidden Markov Models (HMMs)
for tasks such as part-of-speech tagging and speech
recognition.
 Machine Learning Techniques: The 1990s saw the rise of
machine learning in NLP, with techniques like decision trees
and early neural networks being applied to language tasks.
Deep Learning and Modern NLP (2000s-Present):
1. Neural Networks and Deep Learning: The 2010s
marked a significant breakthrough with the advent of
deep learning. Techniques like recurrent neural
networks (RNNs), long short-term memory (LSTM), and
convolutional neural networks (CNNs) improved NLP
tasks such as sentiment analysis, translation, and
question answering.
2. Transformers and Large Language Models: The
introduction of the Transformer model in 2017
revolutionized NLP. Transformers, and later, models like
BERT, GPT, and T5, demonstrated superior performance
on a wide range of NLP tasks by leveraging attention
mechanisms and large-scale pre-training on diverse
datasets.
Challenges of NLP
Natural Language Processing (NLP) faces several challenges due to the
complexity and ambiguity of human language. Here are some key
challenges:
1. Ambiguity: Lexical, Syntactic, Semantic Ambiguity
2. Contextual Understanding: Same sentence may have different meaning.
3. Handling Idioms and Metaphors: Kick the bucket, Break the ice.
4. Sarcasm and Irony
5. Domain-Specific Knowledge
6. Morphological Complexity
7. Multilinguality
8. Coreference Resolution
9. Named Entity Recognition (NER)
10. Emotion and Sentiment Analysis
11. Speech Understanding
12. Data Scarcity and Bias
13. Ethical Concerns
Components of NLP
• Five main component of Natural
Language processing in AI are:
• Morphological and Lexical Analysis
• Syntactic Analysis
• Semantic Analysis
• Pragmatic Analysis
Morphological Processing
It is the first phase of NLP.
Study of different forms of word.
In Morphological analysis, word by word- the sentence is analyzed.
The purpose of this phase is to break chunks of language input into sets of
tokens corresponding to paragraphs, sentences and words.
Non-word tokens such as punctuation are removed from the words. Hence the
remaining words are assigned categories.
 For instance, Ram’s iPhone cannot convert the video from .mkv to .mp4. So
here, Ram is a proper noun, Ram’s is assigned as possessive suffix and .mkv
and .mp4 is assigned as a file extension.
Morphological Processing Contd..

Each word is assigned a syntactic category.


The file extensions are also identified present in the sentence which is behaving as an
adjective in the above example.
In the above example, the possessive suffix is also identified. This is a very important step as
the judgement of prefixes and suffixes will depend on a syntactic category for the word.
For example, swims and swim’s are different. One makes it plural, while the other makes it
a third-person singular verb.
If the prefix or suffix is incorrectly interpreted then the meaning and understanding of the
sentence are completely changed.
The interpretation assigns a category to the word. Hence, discard the uncertainty from the
word.
Example, the sentence like “The school goes to the boy” would be rejected by syntax
analyzer or parser.
Syntactic Analysis:

There are different rules for different languages. The syntax represents the set
of rules that the official language will have to follow. Violation of these rules
will give a syntax error.
Syntactic analysis in Natural Language Processing (NLP) involves parsing
sentences or text to analyze the grammatical structure and relationships
between words and phrases.
Here the sentence is transformed into the structure that represents a correlation
between the words. This correlation might violate the rules occasionally.
For example, “To the movies, we are going.” Will give a syntax error.
The syntactic analysis uses the results given by morphological analysis to
develop the description of the sentence. This process is called parsing.
Syntactic Analysis:
Example Sentence: "The cat chased the mouse.“

 In syntactic analysis along with morphological step, the sentence can be broken down into its constituent
parts and analyzed as follows:
1. Tokenization: The first step in NLP is typically tokenization, where the sentence is divided into individual
tokens (words or punctuation marks). In this example, the tokens are: "The," "cat," "chased," "the," and
"mouse."
2. Part-of-Speech (POS) Tagging: Each token is assigned a part-of-speech tag that represents its grammatical
category. Common POS tags include nouns, verbs, adjectives, adverbs, pronouns, and more. For the example
sentence:
1. "The" is tagged as a determiner (DET).
2. "cat" is tagged as a noun (NOUN).
3. "chased" is tagged as a verb (VERB).
4. "the" is tagged as a determiner (DET).
5. "mouse" is tagged as a noun (NOUN).
3. Parsing: Parsing involves determining the syntactic structure of the sentence, including how words relate to
each other.
Possible rules for the sentences- From grammer , make parse tree
If parse tree incomplete for the sentence, then sentence is wrong.
For example, the cat chases the mouse in the garden, would be
represented as:
Concept of Parser:
Semantic Analysis
The semantic analysis looks after the meaning.
 It allocates the meaning to all the structures built by the syntactic analyzer.
Then every syntactic structure and the objects are mapped together into the
task domain.
 If mapping is possible the structure is sent, if not then it is rejected. For example,
“hot ice-cream” will give a semantic error.
 During semantic analysis two main operations are executed:
 First, each separate word will be mapped with appropriate objects in the database.
The dictionary meaning of every word will be found. A word might have more than
one meaning.
 Secondly, all the meanings of each different word will be integrated to find a proper
correlation between the word structures. This process of determining the correct
meaning is called lexical disambiguation. It is done by associating each word with
the context.
Pragmatic Analysis

This phase checks the real-world knowledge or context to derive the real
meaning of the sentence(Intended meaning of the sentence).

Pragmatic analysis in Natural Language Processing (NLP) is the process of


interpreting the meaning of a sentence based on the context in which it is
used, including factors like speaker intent, social norms, and real-world
knowledge. It goes beyond the literal meaning of words to understand the
intended meaning in a given situation.

• Eg: When we are ready to leave some one’s house, we don’t say “ I would like to leave
now, so lets end this conversation”
• Will simply say “ Well, it is getting late”.
Sentence parsing
A parser in NLP uses the grammar rules (formal grammar rules) to verify if the
input text is valid or not syntactically.
The parser helps us to get the meaning of the provided text (like the
dictionary meaning of the provided text).
As the parser helps us to analyze the syntax error in the text; so, the parsing
process is also known as the syntax analysis or the Syntactic analysis.
 We have mainly two types of parsing techniques- top-down parsing, and
bottom-up parsing.
In the top-down parsing approach, the construction of the parse tree starts
from the root node. And in the bottom-up parsing approach, the
construction of the parse tree starts from the leaf node.
Probability and Statistics
“Once upon a time, there was a . . . ”
• Can you guess the next word?
• Hard in general, because language is not deterministic.
• But some words are more likely than other
• We can model uncertainty using probability theory.
• We can use statistics to ground our models in empirical data
Statistical Inference
• Statistical inference in Natural Language Processing (NLP) involves drawing conclusions
about a language's underlying structure and patterns based on statistical models and
sample data. This technique is essential for enabling machines to make predictions,
classify text, and understand language probabilistically. Two main kinds of statistical
inference:
• 1. Estimation
• 2. Hypothesis testing
In natural language processing:
• Estimation – learn model parameters
Methods include Maximum Likelihood Estimation (MLE), Bayesian Inference, or Markov Chain Monte Carlo (MCMC),
depending on the complexity of the model.
• Hypothesis tests – assess statistical significance of test results
• Example: A/B testing in machine translation systems can statistically compare two models’ performance to determine which
produces more accurate translations.
Language modeling (LM)

Language modeling (LM) is the use of various statistical and probabilistic


techniques to determine the probability of a given sequence of words
occurring in a sentence.
Language models analyze bodies of text data to provide a basis for their
word predictions. It is widely used in predictive text input systems,
speech recognition, machine translation, spelling correction etc.
The input to a language model is usually a training set of example
sentences. The output is a probability distribution over sequences of
words.
Types of Language Model
Grammar-based models
 Statistical models
Grammar based Model Grammar :

 contains Symbols, Rules, Procedure of rule application.


 Formal grammar More technically, a formal grammar
consists of a finite set of terminal symbols, a finite set of
nonterminal symbols, a set of rules (also called production
rules) with a left- and a right-handed side, each consisting
of a word a start symbol.
 Formal grammars usually have two special symbols
• S: the start symbol
• ε: the empty string (sometimes: λ)
 No restrictions on the form of the production rules.
Example: Any complex language like natural languages or context-sensitive languages can be
described by this grammar.
Statistical Language Models:

 These models rely on statistical methods to predict language


patterns based on large amounts of real-world text data. They
capture language by learning probabilities from training data, rather
than relying on predefined grammar rules.
Tokenization
• Tokenization is the process of creating tokens.
• Tokens can be thought of as a building unit of the text sequence (the
data).
• A string is constructed by piecing different characters together.
• Characters come together to build a word. Then many words come
together to form a sentence, and sentences form a paragraph. Many
paragraphs form a document, and so on.
• These units that build up the text corpus are tokens and the process
of splitting a text sequence into its tokens is tokenization.
• These tokens can be:

• characters
• words (individual words or sets of multiple words together)
• part of words
• punctuations
• sentences
• regular expressions
Why do we tokenize?

• As we understood, tokens are the building blocks of text


in the natural language. Therefore, most of the
preprocessing and modeling happens at the token level.
• For example, removing stopwords, stemming,
lemmatization and many other preprocessing steps
happen at token levels.
• Even neural network architectures process individual
tokens to make sense of the document.
• A simple approach is to define a subset of characters as whitespace,
and then split the text on these tokens.
• However, whitespace-based tokenization is not ideal: we may want to
split conjunctions like isn’t and hyphenated phrases like prize-winning
and half-asleep, and we likely want to separate words from commas
and periods that immediately follow them.
• At the same time, it would be better not to split abbreviations like
U.S. and Ph.D
Words-Tokenization
• Tokenization is a simple process that takes raw data and converts it
into a useful data string.
• Tokenization is used in natural language processing to split paragraphs
and sentences into smaller units that can be more easily assigned
meaning.
• The first step of the NLP process is gathering the data (a sentence)
and breaking it into understandable parts (words).
Example
• “What restaurants are nearby?“
• In order for this sentence to be understood by a machine, tokenization is
performed on the string to break it into individual parts. With tokenization,
we’d get something like this:
• ‘what’ ‘restaurants’ ‘are’ ‘nearby’
• This may seem simple, but breaking a sentence into its parts allows a machine
to understand the parts as well as the whole.
• This will help the program understand each of the words by themselves, as well
as how they function in the larger text.
• This is especially important for larger amounts of text as it allows the machine
to count the frequencies of certain words as well as where they frequently
appear.
Tokenization with `split()`
function
• Word Tokenization
• By default, the split() function “splits” the text into
chunks on whitespace characters.

text = "Ayush and Smrita are beautiful couple"


tokens = text.split()
(tokens)
Tokenization with NLTK-(Natural Language Tool Kit)
• NLTK is a popular NLP library. It offers some great in-built tokenizers,
let’s explore.

Word tokenization
• NLTK offers a bunch of different methods for word tokenization. We
will explore the following:
1.word_tokenize()
2.TreebankWordTokenizer
3.WordPunctTokenizer
4.RegEx
EX:1

•word_tokenize() does a good job in tokenizing


the individual words along with the
punctuations as well.
EX 2
TreebankWordTokenizer()
WordPunctTokenizer()
RegexpTokenizer()
Morphemes and morphs
Morphemes and morphs are both related to word structure in
morphology, but they refer to different aspects of how meaning is
expressed in language.
Morphemes
•Definition: A morpheme is the smallest meaningful unit of language.
It is an abstract concept that represents a specific meaning or
grammatical function.
•Example: In the word cats, there are two morphemes:
• cat (which carries the meaning of the animal)
• -s (which signifies plural).
Morphemes can be free (standalone words, like cat) or bound (must
attach to other morphemes, like -s).
Morphs:

•Definition: A morph is the physical or phonological realization


of a morpheme. In other words, it is the actual form (spoken or
written) that represents the morpheme in a specific context.

•Example: In the word cats, the sound [s] is the morph


representing the plural morpheme -s.

Morphemes are abstract units of meaning, while morphs are


their concrete realizations in speech or writing.
Morphology
• In linguistics, morphology is the study of the internal structure and functions of
the words.
• How the words are formed from smaller meaningful units called
morphemes.
• The morpheme is the smallest element of a word that has grammatical
function and meaning.
• the two types of morphemes
Free morpheme : These can stand alone as words.
Examples include: book, run, happy
bound morpheme : These cannot stand alone and must be attached to other morphemes.
They typically include:
• Prefixes (e.g., un- in unhappy, pre- in prepaid)
• Suffixes (e.g., -ed in walked, -ing in singing)
• Infixes (rare in English but seen in some languages)
• Circumfixes (in some languages, parts that appear at both ends of a word)
For example, in the word unhappiness:
•un- is a bound morpheme (prefix).
•happy is a free morpheme.
•-ness is a bound morpheme (suffix).
Morphology, the study of the structure and formation of
words, can be divided into different classes based on how
morphemes combine and interact to form words.
These classes of morphology are typically categorized into
two main types:

Inflectional morphology
Derivational morphology.
1. Inflectional Morphology

Inflectional morphology deals with the modification of a word to express


different grammatical categories like tense, case, number, gender, or
aspect. It does not change the word's basic meaning or its part of speech.

•Example in English:
• walk → walks (plural)
• run → ran (past tense)
• big → bigger (comparative)

Inflection in English is relatively minimal compared to languages like Latin


or Russian, where word forms change more extensively to indicate case,
gender, etc.
2. Derivational Morphology

Derivational morphology focuses on creating new words by adding


affixes (prefixes or suffixes) to a base word. This process often
changes the word's meaning and sometimes its grammatical
category (e.g., from noun to adjective or verb to noun).

•Example in English:
• happy → unhappy (prefix changing meaning)
• teach → teacher (verb to noun)
• nation → national (noun to adjective)
Derivational morphemes have more semantic impact and can create
entirely new words from existing ones.
Morphological parsing
Stemming

 A stemmer is a tool used in natural language processing


(NLP) and text mining that reduces words to their base or
root form, known as a "stem." The purpose of stemming is
to simplify the words to their core meaning, which helps in
various tasks like text analysis, information retrieval, and
machine learning.
Key Points about Stemming:
1.Reduction: Stemming algorithms remove suffixes and prefixes
from words. For example:
1."running" → "run"
2."happily" → "happi"
Regular Expression:

 Regular expression (often shortened to regex), a language for


specifying text search expression strings.
 This practical language is used in every computer language, word
processor, and text processing tools like the Unix tools grep or Emacs.
 Formally, a regular expression is an algebraic notation for
characterizing a set of strings. Regular expressions are particularly
useful for searching in texts, when we have a pattern to search corpus
for and a corpus of texts to search through.
 A regular expression search function will search through the corpus,
returning all texts that match the pattern.
The corpus can be a single document or a collection. For example, the
Unix command-line tool grep takes a regular expression and returns
every line of the input document that matches the expression
RULES
QUANTIFIERS
\ tells the computer that to treat following
METACHARACTERS characters as search characters
Ex: for ‘+’, ‘.’, ’-’
Regular Language
 A regular language is a specific type of formal language that
can be expressed using regular expressions and is
recognized by finite automata.
 Regular languages are the simplest class in the Chomsky
hierarchy of formal languages, which classifies languages
based on the computational power required to recognize
them.

Formal Definition of Regular Languages:

A regular language is defined by one of the following


equivalent methods:

1.It can be generated by a regular expression.


2.It can be accepted by a deterministic finite automaton (DFA)
or nondeterministic finite automaton (NFA).
3.It can be described by a regular grammar, which is a type of
grammar in the Chomsky hierarchy where production rules
follow strict forms.
Introduction of Finite Automata
Finite automata are abstract machines used to recognize
patterns in input sequences, forming the basis for
understanding regular languages in computer science.
They consist of states, transitions, and input symbols,
processing each symbol step-by-step.
 If the machine ends in an accepting state after
processing the input, it is accepted; otherwise, it is
rejected.
 Finite automata come in deterministic (DFA) and non-
deterministic (NFA), both of which can recognize the
same set of regular languages.
They are widely used in text processing, compilers, and
network protocols.
Figure: Features of Finite Automata
Features of Finite Automata

•Input: Set of symbols or characters provided to the machine.


•Output: Accept or reject based on the input pattern.
•States of Automata: The conditions or configurations of the
machine.
•State Relation: The transitions between states.
•Output Relation: Based on the final state, the output decision is
made.
Formal Definition of Finite
Automata
A finite automaton can be defined
as a tuple:
{ Q, Σ, q, F, δ }, where:
•Q: Finite set of states
•Σ: Set of input symbols
•q: Initial state
•F: Set of final states
•δ: Transition function
Types of Finite Automata

There are two types of finite automata:

 Deterministic Finite Automata (DFA)


 Non-Deterministic Finite Automata (NFA)
Deterministic Finite Automata
(DFA)
Automata and languages
Different types of automata define different language classes:
• - Finite-state automata define regular languages
• - Pushdown automata define context-free languages
• - Turing machines define recursively enumerable languages
Automata and languages
Automata and languages
Finite-state automata
Ex:1 :Accept: Automaton end up in accepting state
Ex:2 : Rejection: Automaton does not end up in
accepting state
Ex:3 : Rejection: Transition not defined
Finite-state methods for morphology

Union: merging automata


Automata and languages Transition table

Design a finite state automata for baa+!


Transducers in the context of lexicon and rules

 Finite State Transducers (FSTs) are computational models


used to describe the relationship between input and
output sequences or strings.
 They are an extension of Finite State Automata (FSA) with
the addition of output capabilities.
 Specifically, FSTs are widely used in areas such as natural
language processing, speech recognition, and
computational linguistics.
Key Features of Finite State Transducers:

1.States: Like finite automata, FSTs consist of a finite number of states. At


any given moment, the system is in one of these states.
2.Transitions: Each transition between states is labeled by both an input
symbol and an output symbol. When the FST reads an input symbol, it
produces a corresponding output symbol and transitions to another
state.
3.Input and Output Alphabet: FSTs operate over two alphabets—one for
inputs and one for outputs.
4.Start State: There is a designated start state where the FST begins.
5.Final State(s): Some states are designated as accepting or final states.
Once an input sequence is completely read, if the machine ends up in a
final state, the input-output relation is considered valid.
Finite-state transducers
Detecting and Correcting
Spelling Errors
• The detection and correction of spelling errors is an integral part of
modern word-processors and search engines, and is also important in
correcting errors in optical character recognition (OCR), the automatic
OCR recognition of machine or hand-printed characters, and on-line
handwriting recognition, the recognition of human printed or cursive
handwriting as the user is writing.
Spell correction:
– The user typed “graffe”
– Which is closest? : graf grail giraffe

The word giraffe, which differs by only one letter from graffe,
seems intuitively to be more similar than, say grail or graf,
• We can distinguish three increasingly broader problems:

1. non-word error detection: detecting spelling errors that result in


non-words (like graffe for giraffe).

2. isolated-word error correction: correcting spelling errors that


result in nonwords, for example correcting graffe to giraffe, but
looking only at the word in isolation.

3. context-dependent error detection and correction: using the


context to help detect and correct spelling errors even if they
accidentally result in an actual word of English (real-word errors).
• Detecting non-word errors is generally done by marking any word that is
not found in a dictionary.
• For example the rare words “wont” or “veery” are also common
misspelling of won’t and very.
• An FST morphological parser can be turned into an even more efficient
FSA word recognizer by using the projection operation to extract the
lower-side language graph.
• A new stem can be added to the dictionary, and then all the inflected
forms are easily recognized. This makes FST dictionaries especially
powerful for spell-checking in morphologically rich languages where a
single stem can have tens or hundreds of possible surface forms.
Minimum Edit
Distance(Levenshtein distance )
• It is a way of quantifying how dissimilar two strings (e.g., words) are
to one another by counting the minimum number of operations
required to transform one string into the other.
• It is widely used in computational linguistics and bioinformatics for
applications such as spell checking, DNA sequence alignment, and
natural language processing.
Operations Involved
• Insertion: Add a character. (i)
• Deletion: Remove a character. (d)
• Substitution: Replace one character with another. (s)

• The minimum edit distance between two strings is defined as the


minimum number of editing operations (insertion, deletion,
substitution) needed to transform one string into another.
For example, the minimum edit distance between the words "kitten"
and "sitting" is 3:
• Replace 'k' with ‘s‘ -1
• Replace 'e' with ‘i‘ -1
• Insert 'g' at the end-1
• The minimum edit distance between INTENTION and EXECUTION can
be visualized using their alignment.
• Given two sequences, an alignment is a correspondence between
substrings of the two sequences.
H.W:
Find the Minimum Edit Distance for “fast “ &
“cats”
END OF MODULE-1

THANK YOU
IMPORTANT QUESTIONS

1. Explain origin and challenges of NLP


2. Explain components of NLP
3. Explain various language models
4. Explain Tokenization in detail
5. Write regular expression for a mobile number starts with 8 or 9
with total 10 digits
6. Write Regular expression for a mail id
7. Design a DFA for baa+!
8. Explain Transducers
9. Explain Automata
10. Difference between DFA & NFA
11. Demonstrate Minimum Edit Distance with an example
12. Find the Minimum Edit distance for alignment for “CATS “and
“FAST”

You might also like