NLP-PT 1
NLP-PT 1
Q)N-GRAM
Unigram
For example, in the sentence "The cat sleeps," the unigrams are "The," "cat,"
and "sleeps."
Unigrams are useful for understanding individual words' frequencies and their
standalone occurrences.
Bigram
For instance, in "The cat sleeps," the bigrams are "The cat" and "cat sleeps."
Bigrams are useful for modelling and predicting sequences based on the
immediate preceding item.
Trigram
Trigrams help in capturing more context and can improve predictive models
by incorporating a broader range of surrounding information.
n-gram
For example, a 4-gram in the text "The cat sleeps on the mat" would be "The
cat sleeps on."
Applications of N-grams
Language Modelling
Text Classification
Machine Translation
Speech Recognition
Q) Levels in NLP
Phonology
Phonology concerns how words are related to the sounds that realize them.
Phonemes are the basic units of sound that differentiate words in a language.
For example, the phoneme /k/ appears in both "skit" and "kit."
Morphology
Morphology deals with how words are constructed from basic meaning units
called morphemes. A morpheme is the smallest unit of meaning in a
language. For instance, the word "unhappiness" can be broken down into
three morphemes: the prefix "un-" (meaning "not"), the stem "happy," and the
suffix "-ness" (indicating a state of being).
Levels of NLP
This process includes breaking down words into their constituent parts—such
as roots, prefixes, and suffixes—to analyze their meaning and grammatical
properties.
FSA:
Finite State Automata (FSA), also known as Finite State Machines (FSM),
play a crucial role in various Natural Language Processing (NLP) tasks,
particularly in morphological analysis, syntax parsing, and text processing.
1. Morphological Analysis:
For instance, an FSA can represent the different ways a verb can be
conjugated or how prefixes and suffixes can be attached to a root word.
This allows the system to recognize and generate correct word forms by
transitioning through states that represent valid morphological constructions.
2. Lexical Analysis:
FSAs are used to tokenize input text into words, recognize patterns, or
identify parts of speech.
For example, an FSA can be used to match regular expressions in text, such
as identifying all instances of a particular pattern (e.g., email addresses or
phone numbers) within a body of text.
3. Syntax Parsing:
FSAs can also be used in spell checkers to recognize valid word forms and
suggest corrections for misspelled words by transitioning through states that
represent valid word sequences.
Vowels: The letters "a", "e", "i", "o", "u" are always considered vowels. The
letter "y" is also considered a vowel when it is surrounded by consonants
(e.g., "cry" where "y" acts as a vowel).
The Porter Stemmer works in several steps, each applying specific rules to
modify the word. Here's an outline:
Remove "s" to "" if the preceding part contains a vowel (e.g., "cats" -> "cat").
If a word ends with "eed," and the part before "eed" contains a vowel, replace
"eed" with "ee" (e.g., "agreed" -> "agree").
If a word ends with "ed" or "ing" and the preceding part contains a vowel,
remove "ed" or "ing" (e.g., "hopping" -> "hop").
After removing "ed" or "ing," if the word ends with "at," "bl," or "iz," add "e"
(e.g., "hopping" -> "hope").
If the word ends with a double consonant (except "l", "s", "z"), remove the last
consonant (e.g., "hopping" -> "hop").
Step 1c: Y to I
If a word ends with "y" and the preceding part contains a vowel, replace "y"
with "i" (e.g., "happy" -> "happi").
This step applies various rules to reduce longer suffixes to simpler forms.
Step 3: Final Suffix Removal
The rule applies based on the structure of the word and the presence of
vowels in certain positions.
Remove an "e" if the preceding part contains more than one consonant (e.g.,
"alike" -> "alik").
Do not remove "e" if the word ends with "le" (e.g., "single" -> "singl").
Step 5: Clean Up
If the word ends in a single "e" and the preceding part contains more than
one vowel, remove the "e" (e.g., "agree" -> "agre").
Example
POS tagging is essential for various NLP tasks, including syntactic parsing,
sentiment analysis, and machine translation, as it provides crucial
grammatical information.
Rule-Based Tagging:
These models learn patterns from large datasets and can generalize better to
unseen data compared to rule-based systems.
Ambiguity:
Words can belong to multiple parts of speech depending on the context. For
example, "book" can be a noun ("I read a book") or a verb ("I will book a
ticket"). Disambiguating the correct POS tag based on context is one of the
main challenges.
Words that are not present in the training data or lexicon, such as newly
coined terms, slang, or proper nouns, can be difficult to tag accurately. These
OOV words require the POS tagger to rely heavily on context or make
educated guesses, which may not always be accurate.
Languages with rich morphology, where a single word can have many
inflected forms, pose a significant challenge for POS tagging. For example, in
languages like Finnish or Turkish, a word's form can change drastically based
on tense, case, number, or other grammatical features, making it difficult to
tag correctly.
Idioms and phrasal verbs (e.g., "give up," "look forward to") can complicate
POS tagging because the meaning of the entire phrase differs from the sum
of its parts. Identifying and correctly tagging these phrases require
understanding beyond individual word tags.
Errors in Text:
Spelling errors, typos, and informal language (e.g., in social media text) can
make POS tagging more difficult, as these errors may result in
misinterpretation of words or sentences.
In Natural Language Processing (NLP), affixes are morphemes that are attached
to a word stem to modify its meaning or grammatical function. Affixes play a
crucial role in morphological analysis, which is the process of studying the
structure of words and how they are formed.
1. Prefixes
Prefixes are affixes added to the beginning of a root word to change its
meaning.
They often alter the word’s semantic value or grammatical category. For
example, in the word "unhappy," the prefix "un-" is added to the root word
"happy" to create its antonym, meaning "not happy."
2. Suffixes
Suffixes are affixes attached to the end of a root word to modify its meaning
or grammatical role.
They can indicate tense, number, or part of speech. For instance, in the word
"running," the suffix "-ing" changes the verb "run" into its present participle
form.
Another example is "happiness," where the suffix "-ness" turns the adjective
"happy" into a noun representing the state of being happy.
3. Infixes
Infixes are affixes inserted within a root word rather than at the beginning or
end.
They are less common in English but play a significant role in some
languages.
For example, in the Tagalog language, the infix "-um-" can be inserted into
the root word "sulat" (write) to form "sumulat" (wrote). Infixes can alter the
meaning of the root word by changing its grammatical function.
4. Circumfixes
Circumfixes are affixes that surround a root word, with one part attached at
the beginning and the other at the end.
For example, in the German language, the circumfix "ge-...-t" is used in the
past participle form of verbs, as in "gespielt" (played), where "ge-" is the
prefix and "-t" is the suffix added to the root "spiel" (play).
Open class words, also known as content words, are categories of words that
can freely accept new members and frequently change over time. These
words typically carry significant meaning and contribute most of the content
in a sentence. They include:
Nouns: Represent people, places, things, or concepts (e.g., "computer,"
"city," "happiness").
Examples:
In the sentence "The quick brown fox jumps over the lazy dog," "fox,"
"jumps," "quick," and "lazy" are all open class words because they provide
core meaning and can be replaced or expanded with new words (e.g.,
"clever" instead of "quick," or "dog" instead of "fox").
Closed class words, also known as function words, belong to categories that
are generally fixed and do not readily accept new members. These words
primarily serve grammatical functions and help structure sentences rather
than providing substantial content. They include:
Auxiliary Verbs: Help form various tenses or aspects of verbs (e.g., "is,"
"have," "will").
Examples:
In the sentence "She went to the store because she needed groceries,"
"she," "to," "the," "because," and "needed" are closed class words. They
primarily serve grammatical roles and do not change frequently or expand
with new forms.
Inflectional Morphology
In inflectional morphology, the affix is typically added to the end of the word
(suffixes), and the word's syntactic category remains the same.
Derivational Morphology