0% found this document useful (0 votes)
23 views15 pages

NLP-PT 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views15 pages

NLP-PT 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

NLP

Q)N-GRAM

An n-gram is a contiguous sequence of n items from a given sample of text or


speech. In Natural Language Processing (NLP), n-grams are used to model the
structure and patterns of language by capturing the relationships between words
or characters in a sequence.

Unigram

A unigram is a single word or character from a text. It represents the most


basic unit of analysis and focuses on individual elements without considering
their context.

For example, in the sentence "The cat sleeps," the unigrams are "The," "cat,"
and "sleeps."

Unigrams are useful for understanding individual words' frequencies and their
standalone occurrences.

Bigram

A bigram consists of two consecutive words or characters. It captures the


relationship between adjacent elements and provides a sense of the local
context within a text.

For instance, in "The cat sleeps," the bigrams are "The cat" and "cat sleeps."

Bigrams are useful for modelling and predicting sequences based on the
immediate preceding item.

Trigram

A trigram is a sequence of three consecutive words or characters, offering a


more detailed view of the context by considering the relationships among
three adjacent elements.
For the sentence "The cat sleeps," the trigram is "The cat sleeps."

Trigrams help in capturing more context and can improve predictive models
by incorporating a broader range of surrounding information.

n-gram

An n-gram is a general term for a sequence of n consecutive words or


characters. It provides flexibility to capture various levels of context
depending on the value of n.

For example, a 4-gram in the text "The cat sleeps on the mat" would be "The
cat sleeps on."

N-grams are essential for modeling and analyzing language patterns by


considering different lengths of sequences to understand and predict textual
data.

Applications of N-grams

Language Modelling

N-grams predict the next word or character in a sentence based on the


previous ones. For example, a bigram model could suggest "AI" as the next
word after "I love" because it's a common phrase.

Text Classification

N-grams help in sorting texts, like categorizing emails as "spam" or "not


spam," by identifying patterns in word sequences.

Machine Translation

N-grams are used to match and translate phrases between languages,


helping to translate common phrases accurately.

Speech Recognition

N-grams improve speech-to-text accuracy by modeling how words or sounds


typically follow each other in speech.
Information Retrieval

N-grams enhance search engines by better matching search queries with


relevant documents based on word sequences.

**DO N-GRAM SUMS**

Q) Levels in NLP

Phonology

Phonology concerns how words are related to the sounds that realize them.
Phonemes are the basic units of sound that differentiate words in a language.
For example, the phoneme /k/ appears in both "skit" and "kit."

Morphology

Morphology deals with how words are constructed from basic meaning units
called morphemes. A morpheme is the smallest unit of meaning in a
language. For instance, the word "unhappiness" can be broken down into
three morphemes: the prefix "un-" (meaning "not"), the stem "happy," and the
suffix "-ness" (indicating a state of being).

Levels of NLP

Lexical Level Lexical analysis in NLP focuses on the meaning and


part-of-speech of words. It deals with lexemes, which are the fundamental
units of lexical meaning. For example, in the sentence "She will park the car,"
the word "park" can function as a verb (to park a car) or a noun (a park with
trees), depending on the context.

Syntax (Parsing) Syntax concerns how words are arranged to form


grammatically correct sentences, determining each word's structural role and
how phrases relate to one another. For example, the sentence "The chicken
is ready to eat" could mean the chicken is about to eat something, or it could
mean the chicken is cooked and ready to be eaten.
Semantics Semantics deals with the meanings of words and how they
combine to form sentence meaning, focusing on context-independent
meaning. For example, the phrase "cold fire" doesn’t make sense
semantically because fire is usually associated with heat.

Discourse Discourse analysis involves how preceding sentences affect the


interpretation of the next sentence, such as understanding pronouns and
temporal aspects. For example, in the sentence "John went to the store. He
bought some milk," the word "He" refers to John.

Pragmatics Pragmatics studies how sentences are used in different


situations and how context affects interpretation. For instance, the command
"Get me a cup of coffee" versus "Shut the door" reflects different pragmatic
uses.

World Knowledge World knowledge includes general knowledge about the


world that language users must share to understand each other's beliefs,
goals, and communication effectively.

Q) explain morphological analysis / Role of FSA


Morphological analysis in NLP involves studying the structure of words to
understand how they are formed from morphemes, the smallest units of
meaning in a language.

This process includes breaking down words into their constituent parts—such
as roots, prefixes, and suffixes—to analyze their meaning and grammatical
properties.

For example, consider the word "unhappiness."

Morphological analysis would decompose this word into three morphemes:


"un-" (a prefix meaning "not"), "happy" (the root or stem, which is a free
morpheme as it can stand alone), and "-ness" (a suffix indicating a state or
condition). By understanding these components, morphological analysis
helps determine the word's meaning, grammatical role, and how it can be
used in different contexts.
This analysis is crucial in various NLP tasks such as lemmatization, where
words are reduced to their base or dictionary form (e.g., "running" to "run"),
and in understanding how words relate to each other in terms of meaning
and grammar, especially in languages with complex inflectional systems.

FSA:

Finite State Automata (FSA), also known as Finite State Machines (FSM),
play a crucial role in various Natural Language Processing (NLP) tasks,
particularly in morphological analysis, syntax parsing, and text processing.

An FSA is a computational model consisting of a finite number of states,


transitions between those states, and actions based on input symbols.

1. Morphological Analysis:

In morphological analysis, FSA can be used to model the structure of words


and their valid forms.

For instance, an FSA can represent the different ways a verb can be
conjugated or how prefixes and suffixes can be attached to a root word.

This allows the system to recognize and generate correct word forms by
transitioning through states that represent valid morphological constructions.

2. Lexical Analysis:

FSAs are used to tokenize input text into words, recognize patterns, or
identify parts of speech.

For example, an FSA can be used to match regular expressions in text, such
as identifying all instances of a particular pattern (e.g., email addresses or
phone numbers) within a body of text.

3. Syntax Parsing:

FSAs are sometimes employed in syntax parsing, especially in simpler


models like regular grammars.

They can help to validate whether a sequence of words forms a syntactically


correct sentence according to predefined rules.
For instance, an FSA might ensure that an article is followed by a noun or
that a verb phrase follows a noun phrase.

4. Spell Checking and Correction:

FSAs can also be used in spell checkers to recognize valid word forms and
suggest corrections for misspelled words by transitioning through states that
represent valid word sequences.

5. Finite State Transducers (FSTs):

An extension of FSAs, called Finite State Transducers, is often used in tasks


like morphological generation and speech recognition. FSTs map one
sequence of symbols to another, which is useful in converting base forms of
words into their inflected forms or transcribing spoken input into text.

Q) Porter Stemmer ALGORITHM

The Porter Stemmer Algorithm is a widely used algorithm in Natural Language


Processing (NLP) for reducing words to their root or base form, known as
stemming.

The algorithm removes common morphological and inflectional endings from


words in English. It operates by applying a series of predefined rules that
systematically strip suffixes from words.

Consonants and Vowels in the Algorithm

The algorithm distinguishes between consonants and vowels to define word


structures. Here's how they are identified:

Vowels: The letters "a", "e", "i", "o", "u" are always considered vowels. The
letter "y" is also considered a vowel when it is surrounded by consonants
(e.g., "cry" where "y" acts as a vowel).

Consonants: Any letter that is not a vowel is considered a consonant. The


letter "y" is considered a consonant when it is at the beginning of a word or
when it follows another vowel (e.g., "yes" where "y" acts as a consonant).
The Steps of the Porter Stemmer Algorithm

The Porter Stemmer works in several steps, each applying specific rules to
modify the word. Here's an outline:

Step 1a: Plural Removal

Remove "sses" to "ss" (e.g., "caresses" -> "caress").

Remove "ies" to "i" (e.g., "ponies" -> "poni").

Remove "ss" to "ss" (e.g., "caress" -> "caress").

Remove "s" to "" if the preceding part contains a vowel (e.g., "cats" -> "cat").

Step 1b: Ending Removal

If a word ends with "eed," and the part before "eed" contains a vowel, replace
"eed" with "ee" (e.g., "agreed" -> "agree").

If a word ends with "ed" or "ing" and the preceding part contains a vowel,
remove "ed" or "ing" (e.g., "hopping" -> "hop").

After removing "ed" or "ing," if the word ends with "at," "bl," or "iz," add "e"
(e.g., "hopping" -> "hope").

If the word ends with a double consonant (except "l", "s", "z"), remove the last
consonant (e.g., "hopping" -> "hop").

If the word is in the form of consonant-vowel-consonant (CVC) and ends with


a consonant that is not "w," "x," or "y," add "e" (e.g., "hop" -> "hope").

Step 1c: Y to I

If a word ends with "y" and the preceding part contains a vowel, replace "y"
with "i" (e.g., "happy" -> "happi").

Step 2: Double Suffix Removal

Remove common suffixes like "ational" to "ate" or "tional" to "tion" (e.g.,


"relational" -> "relate").

This step applies various rules to reduce longer suffixes to simpler forms.
Step 3: Final Suffix Removal

Simplifies suffixes such as "icate" to "ic" (e.g., "duplicate" -> "duplic").

The rule applies based on the structure of the word and the presence of
vowels in certain positions.

Step 4: "E" Removal

Remove an "e" if the preceding part contains more than one consonant (e.g.,
"alike" -> "alik").

Do not remove "e" if the word ends with "le" (e.g., "single" -> "singl").

Step 5: Clean Up

If a word ends with a consonant-vowel-consonant sequence where the final


consonant is not "l," "s," or "z," add an "e" (e.g., "hop" -> "hope").

If the word ends in a single "e" and the preceding part contains more than
one vowel, remove the "e" (e.g., "agree" -> "agre").

Example

Let’s walk through an example with the word "relational":

Step 1a: No change needed (no plural "s").

Step 1b: No change needed (no "eed," "ed," or "ing").

Step 1c: No change needed (no ending "y").

Step 2: "ational" is converted to "ate" -> "relate."

Step 3: No change needed (no suffix to simplify).

Step 4: No change needed (no final "e" to remove).

Step 5: No change needed (word is already simplified).

Final stemmed form: "relate."

** also look at the example in the PDF


Q) POS Tagging and Its challenges

Parts of Speech (POS) Tagging is a fundamental task in Natural Language


Processing (NLP) that involves labeling each word in a sentence with its
corresponding part of speech, such as noun, verb, adjective, adverb, etc.

POS tagging is essential for various NLP tasks, including syntactic parsing,
sentiment analysis, and machine translation, as it provides crucial
grammatical information.

How POS Tagging Works

POS tagging typically involves two main approaches:

Rule-Based Tagging:

This approach uses a set of predefined linguistic rules to assign tags to


words. For example, a rule might state that if a word ends in "ing," it is likely a
verb (e.g., "running").

Rule-based systems rely heavily on handcrafted rules and lexicons, which


can be effective but may struggle with new or uncommon word forms.

Statistical and Machine Learning-Based Tagging:

This approach uses probabilistic models, such as Hidden Markov Models


(HMMs), Conditional Random Fields (CRFs), or neural networks, trained on
annotated corpora to predict the most likely tag for each word based on its
context.

These models learn patterns from large datasets and can generalize better to
unseen data compared to rule-based systems.

Challenges in POS Tagging

Ambiguity:
Words can belong to multiple parts of speech depending on the context. For
example, "book" can be a noun ("I read a book") or a verb ("I will book a
ticket"). Disambiguating the correct POS tag based on context is one of the
main challenges.

Out-of-Vocabulary (OOV) Words:

Words that are not present in the training data or lexicon, such as newly
coined terms, slang, or proper nouns, can be difficult to tag accurately. These
OOV words require the POS tagger to rely heavily on context or make
educated guesses, which may not always be accurate.

Complex Sentence Structures:

Sentences with complex structures, such as long, nested clauses, can


confuse POS taggers. The tagger must accurately identify the role of each
word in such sentences, which is challenging when dealing with complex
syntax or unusual sentence constructions.

Morphologically Rich Languages:

Languages with rich morphology, where a single word can have many
inflected forms, pose a significant challenge for POS tagging. For example, in
languages like Finnish or Turkish, a word's form can change drastically based
on tense, case, number, or other grammatical features, making it difficult to
tag correctly.

Idiomatic Expressions and Phrasal Verbs:

Idioms and phrasal verbs (e.g., "give up," "look forward to") can complicate
POS tagging because the meaning of the entire phrase differs from the sum
of its parts. Identifying and correctly tagging these phrases require
understanding beyond individual word tags.

Inconsistent Tagging Schemes:

Different POS taggers or datasets might use different tagging schemes,


which can lead to inconsistencies when combining or comparing data from
various sources. For example, one scheme might tag "to" as a preposition,
while another might tag it as a particle in specific contexts.
Multilingual POS Tagging:

Applying POS tagging to multiple languages introduces additional


challenges, such as differences in syntax, word order, and language-specific
tagging conventions. Models trained on one language often do not generalize
well to another without significant retraining or adaptation.

Errors in Text:

Spelling errors, typos, and informal language (e.g., in social media text) can
make POS tagging more difficult, as these errors may result in
misinterpretation of words or sentences.

Q) Affixes and its Types

In Natural Language Processing (NLP), affixes are morphemes that are attached
to a word stem to modify its meaning or grammatical function. Affixes play a
crucial role in morphological analysis, which is the process of studying the
structure of words and how they are formed.

1. Prefixes

Prefixes are affixes added to the beginning of a root word to change its
meaning.

They often alter the word’s semantic value or grammatical category. For
example, in the word "unhappy," the prefix "un-" is added to the root word
"happy" to create its antonym, meaning "not happy."

Similarly, "pre-" in "preview" means "before," changing the meaning of "view"


to refer to seeing something before its official release.

2. Suffixes

Suffixes are affixes attached to the end of a root word to modify its meaning
or grammatical role.
They can indicate tense, number, or part of speech. For instance, in the word
"running," the suffix "-ing" changes the verb "run" into its present participle
form.

Another example is "happiness," where the suffix "-ness" turns the adjective
"happy" into a noun representing the state of being happy.

3. Infixes

Infixes are affixes inserted within a root word rather than at the beginning or
end.

They are less common in English but play a significant role in some
languages.

For example, in the Tagalog language, the infix "-um-" can be inserted into
the root word "sulat" (write) to form "sumulat" (wrote). Infixes can alter the
meaning of the root word by changing its grammatical function.

4. Circumfixes

Circumfixes are affixes that surround a root word, with one part attached at
the beginning and the other at the end.

This type is relatively rare in English but is found in other languages.

For example, in the German language, the circumfix "ge-...-t" is used in the
past participle form of verbs, as in "gespielt" (played), where "ge-" is the
prefix and "-t" is the suffix added to the root "spiel" (play).

Q) Explain Open Class and Close class words

Open Class Words

Open class words, also known as content words, are categories of words that
can freely accept new members and frequently change over time. These
words typically carry significant meaning and contribute most of the content
in a sentence. They include:
Nouns: Represent people, places, things, or concepts (e.g., "computer,"
"city," "happiness").

Verbs: Represent actions or states (e.g., "run," "think," "exist").

Adjectives: Describe or modify nouns (e.g., "beautiful," "large," "quick").

Adverbs: Modify verbs, adjectives, or other adverbs (e.g., "quickly," "very,"


"well").

Examples:

In the sentence "The quick brown fox jumps over the lazy dog," "fox,"
"jumps," "quick," and "lazy" are all open class words because they provide
core meaning and can be replaced or expanded with new words (e.g.,
"clever" instead of "quick," or "dog" instead of "fox").

Closed Class Words

Closed class words, also known as function words, belong to categories that
are generally fixed and do not readily accept new members. These words
primarily serve grammatical functions and help structure sentences rather
than providing substantial content. They include:

Prepositions: Indicate relationships between nouns and other words (e.g.,


"in," "on," "under").

Conjunctions: Connect words, phrases, or clauses (e.g., "and," "but,"


"because").

Pronouns: Replace nouns (e.g., "he," "she," "it").

Determiners: Specify nouns (e.g., "the," "a," "this").

Auxiliary Verbs: Help form various tenses or aspects of verbs (e.g., "is,"
"have," "will").

Examples:

In the sentence "She went to the store because she needed groceries,"
"she," "to," "the," "because," and "needed" are closed class words. They
primarily serve grammatical roles and do not change frequently or expand
with new forms.

Q) Inflectional and Derivational Morphology

Inflectional Morphology

Inflectional morphology deals with modifying a word to express different


grammatical categories without changing its core meaning or part of speech.
Inflectional affixes are added to a word to indicate various grammatical
features such as tense, number, case, mood, or comparison. This process
maintains the base form of the word but alters its role in a sentence.

Examples of Inflectional Morphology:

Verb Tenses: Adding suffixes to verbs to indicate tense or aspect. For


example:

"walk" (present) → "walked" (past)

"run" (present) → "running" (present participle)

Noun Plurals: Changing the form of nouns to indicate plurality:

"cat" (singular) → "cats" (plural)

Adjective Comparisons: Modifying adjectives to show degrees of


comparison:

"big" (positive) → "bigger" (comparative) → "biggest" (superlative)

In inflectional morphology, the affix is typically added to the end of the word
(suffixes), and the word's syntactic category remains the same.

Derivational Morphology

Derivational morphology involves creating new words by adding affixes to a


base or root word, which changes the word’s meaning or part of speech. This
process can result in a significant shift in the word's function, meaning, or
both.

Examples of Derivational Morphology:

Changing Part of Speech: Adding affixes to change the word’s grammatical


category:

"happy" (adjective) → "happiness" (noun)

"care" (verb) → "careful" (adjective)

Adding Prefixes: Modifying the meaning of a word by adding a prefix:

"happy" → "unhappy" (negation of the adjective)

"likely" → "unlikely" (opposite meaning)

Creating New Words: Forming new words by combining roots with


derivational affixes:

"educate" (verb) → "education" (noun)

"perform" (verb) → "performance" (noun)

In derivational morphology, the affix can be either a prefix (attached to the


beginning of a word) or a suffix (attached to the end). This process often
changes the word's meaning and sometimes its grammatical category.

You might also like