ML Module A7707 - Part1
ML Module A7707 - Part1
LANGUAGE
PROCESSING -
NLP
MODULE 1
INTRODUCTIO
N
MANISH CHHABRA
Course
• outcomes
After the completion of the course, the student will be able to:
• A77--.1 Identify the structure of words and documents for text
preprocessing.
• A77__.2 Choose an approach to parse the given text document.
• A77--.3 Make use of semantic parsing to capture real meaning of
text.
• A77--.4 Select a language model to predict the probability of a
sequence of words.
• A77--.5 Examine the various applications of NLP
Introduction: What is Natural Language Processing (NLP), Origins of NLP, The Chal- lenges of NLP, Phases of NLP,
Language and Grammar. Finding the Structure of Words and Documents: Words and Their Components, Issues and
Challenges, Morphological Models. Finding the Structure of Documents: Introduction, Sentence Boundary Detection,
Topic Bound- ary Detection, Methods, Complexity of the Approaches, Performances of the Approaches, Features,
Processing Stages
Syntax: Parsing Natural Language, A Data-Driven Approach to Syntax, Stop words, Correcting Words, Stemming,
Lemmatization, Parts of Speech (POS) Tagging, Representation of Syntactic Structure, Parsing Algorithms, Models for
Ambiguity Resolution in Parsing. Semantic Parsing: Introduction,
Semantic Interpretation: Structural Ambiguity, Entity and Event Resolution, System Paradigms, WordSense, Predicate-
Argument Structure, Meaning Representation
Language modeling: Introduction, n-Gram Models, Language Model Evaluation, Pa- rameter Estimation, Types of
Language Models: Class-Based Language Models, MaxEnt Language Models, Neural Network Language Models
Language- Specific Modeling Problems, Multilingual and Crosslingual Language Modeling.
Applications: Question Answering: History, Architectures, Question Analysis, Search and Candidate Extraction,
Automatic Summarization: Approaches to Summarization, Spoken Dialog Systems: Speech Recognition and
Understanding, Speech Generation, Dialog Manager, Voice User Interface, Information Retrieval: Document
Preprocessing, Monolingual Information Retrieval
Natural Language Processing- (NLP)
is a field of artificial intelligence (AI) that focuses on the
interaction between computers and human language. NLP
enables computers to understand, interpret, and generate
human language in a way that is both meaningful and useful.
The origins of Natural Language Processing (NLP)
can be traced back to the mid-20th century. NLP emerged as a field at the intersection of computer science, linguistics, and artificial
intelligence (AI). Here are some key milestones and contributors in the development of NLP.
Phonology − It is study of organizing sound systematically.
Morphology: The study of the formation and internal structure of words.
Morpheme − It is primitive unit of meaning in a language.
Syntax: The study of the formation and internal structure of sentences.
Semantics: The study of the meaning of sentences.
Pragmatics − It deals with using and understanding sentences in
different situations
and how the interpretation of the sentence is affected.
Discourse: − It dealswith how the immediately preceding
sentence can affect the interpretation of the next sentence.
World Knowledge − It includes the general knowledge about the world.
.
Steps in NLP
There are general five steps :
1.Lexical Analysis
3.Semantic Analysis
4.Discourse Integration
5.Pragmatic Analysis
Lexical Analysis:
Lexical analysis involves tokenization and breaking down the input text
into individual words or tokens.
This step establishes the basic structure of words in the text and
removes any extraneous characters or symbols. It's the initial stage of
text processing.
here's a simple example of lexical analysis
Input Text:
"language“
Token Type: Word
Explanation: The second word in the text.
"processing“
Token Type: Word
Explanation: The third word in the text.
"(“
Token Type: Punctuation
Explanation: An opening parenthesis.
"NLP“
Token Type: Word/Acronym
Explanation: An acronym representing "Natural Language Processing.“
")“
Token Type: Punctuation
Explanation: A closing parenthesis.
"is“
Token Type: Word
Explanation: A common word indicating existence or a state.
"a“
Token Type: Word
Explanation: An article indicating an indefinite noun.
"subfield“
Token Type: Word
Explanation: A word describing a specialized area within a field.
"of“
Token Type: Word
Explanation: A common word indicating possession or association.
"artificial“
Token Type: Word
Explanation: A word describing something not occurring naturally.
"intelligence“
Token Type: Word
Explanation: A word referring to the capacity to think and learn.
# Sample text
# Tokenization function
def tokenize(text):
# Use regular re expressions to split text into words
words = re.findall(r'\w+', text.lower())
return words
1. We import the re (regular expressions) module to help us split the text into words. The re
module is Python's regular expressions module, which is used for pattern matching and searching in strings.
2. The sample text is: "Natural language processing (NLP) is a subfield of artificial intelligence.“
re.findall()
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned
left-to-right, and matches are returned in the order found
re.search()-re.search() method either returns None (if the pattern doesn’t match)
re.match(): Determines if a pattern matches at the beginning of a string.
re.split(): Splits a string into a list of substrings based on a specified pattern.
5.The tokens list contains the individual words extracted from the text.
r'\w+':
r prefix: This indicates that the string following it is a raw string,
which means that backslashes \ within the string are treated as
literal characters, making it easier to work with regular expressions.
'\w+': This is the regular expression pattern.
\w: In a regular expression, \w represents a word character, which
includes alphanumeric characters (letters and digits) and underscores.
It matches a single word character.
+: The + quantifier means "one or more occurrences of the preceding
pattern."
So, '\w+' matches one or more word characters in sequence,
effectively capturing words in the text.
Syntactic Analysis (Parsing):
Syntactic Analysis (Parsing): We can break down the sentence into its grammatical components:
"The" (Determiner): This word functions as a determiner, indicating that a specific noun is coming next.
"brown" (Adjective): Similar to "quick," it describes the color of the noun "fox."
"fox" (Noun): The main subject of the sentence. It's the animal that is performing the action.
"jumps" (Verb): The action word in the sentence, indicating what the subject (fox) is doing.
over" (Preposition): It shows the relationship between the action and the object.
"dog" (Noun): The object of the action, which the fox is jumping over.
EXERCISE :"The cat sat on the mat."
EXERCISE :"The cat sat on the mat."
ANSWER :
"The" (Determiner)
"cat" (Noun)
"sat" (Verb)
"on" (Preposition)
"the" (Determiner)
"mat" (Noun)
This analysis provides an understanding of the grammatical
structure and relationships between words in the sentence
"The cat sat on the mat."
Semantic Analysis:
Semantic analysis goes beyond syntax and aims to understand the
meaning of text.
It involves interpreting the relationships between words, phrases, and
sentences and determining the overall meaning of the text.
This step helps in capturing the semantics and intent behind the
language.
For example let us do semantic analysis of previous example:
"The" (Determiner): This word indicates that a specific noun is coming next but doesn't provide specific meaning by itself.
"quick" (Adjective): It describes the fox, suggesting that the fox is fast or agile.
"brown" (Adjective): This word describes the color of the fox, indicating that the fox has a brown fur color.
"jumps" (Verb): The action word in the sentence, indicating that the fox is leaping or hopping.
"over" (Preposition): It shows the relationship between the action (jumps) and the object (dog).
"lazy" (Adjective): This word describes the dog, suggesting that the dog is not active or energetic.
The word choices and descriptions convey a mental image of a fast and agile fox
leaping over a brown, lazy dog.
Discourse integration deals with understanding how sentences are connected in a larger context,
such as a conversation or a document. It focuses on coherence and cohesion in language and helps
in tracking references, pronouns, and discourse markers to establish context and continuity.
Discourse integration involves understanding how different words, phrases, and clauses in a
sentence relate to each other and to the broader context in which the sentence appears.
EXAMPLE - "The quick brown fox jumps over the lazy dog.”
Let's break down the elements of this sentence and discuss discourse integration:
1."The quick brown fox" - This part of the sentence introduces the subject of the sentence, which is the fox. It also
provides some descriptive information about the fox (i.e., it is quick and brown). Discourse integration in this case involves
recognizing that this noun phrase is the entity that will perform the action in the sentence.
2. "jumps over" - This is the verb phrase that indicates the action being performed. Discourse integration involves
understanding the relationship between the subject (the fox) and the action (jumping over) and how they fit together in
the context of the sentence.
3 ."the lazy dog" - This is the object of the action, specifying what the fox is jumping over.
Discourse integration involves recognizing that "the lazy dog" is the target or recipient of the fox's action.
Additionally, discourse integration may involve resolving ambiguities or references. For example, if there were
multiple dogs mentioned earlier in the text, discourse integration would require identifying which
specific "lazy dog" the sentence is referring to.
Overall, discourse integration in NLP is crucial for comprehending the relationships between various linguistic
elements within a sentence and how they contribute to the coherence and meaning of a larger text or
conversation.
Pragmatic Analysis:
Pragmatic analysis takes into account the pragmatic aspects of language, including implied
meaning, speech acts, and context-dependent interpretations.
It considers the social and situational context in which language is used and helps in understanding
the intentions and implications behind language use.
These steps are particularly relevant for deep NLP tasks where a deeper understanding of language is
required,
such as in
question answering,
machine translation,
sentiment analysis, and
chatbots.
In practical NLP applications, not all of these steps are always necessary or may be combined
depending on the specific task and goals. Additionally, modern NLP models, such as transformers,
have shown remarkable performance by jointly modeling multiple levels of linguistic information,
which can reduce the need for explicitly separating these steps.
A pragmatic analysis of the sentence "The quick brown fox jumps over the lazy dog.”
involves examining how it conveys meaning beyond its literal interpretation by considering the context, speaker's
intentions, and implicatures.
Basic Information: This sentence tells us about a fast, brown-colored fox and a lazy dog.
Funny Contrast: It's a bit funny because it's describing the fox as fast (quick) and the dog as lazy.
These are opposites, so it's like saying the fox is very active while the dog is very relaxed.
Special Use: Sometimes, people use this sentence because it has all the letters of the
alphabet (A,B,C,D….J,K,L,M,N,………U,V,W,X,Y,Z) in it.
Depends on the Situation: What this sentence means can change depending on the situation or the
conversation. It might just be telling a story about a fox and a dog, or it might
be used to show how all the letters of the alphabet can fit in a sentence.
So, its meaning depends on what's happening around it.
In simple terms, it's a sentence that talks about a fast fox and a lazy dog,
and it can be used for different things depending on when and where it's said.
Finding the Structure of Words and Documents is a crucial step in understanding and processing textual data.
This involves following several tasks and techniques to uncover the hierarchical and semantic relationships within language-
Tokenization: Tokenization is the process of breaking text into individual words or tokens. It's the foundational step to find
the basic structure of words in a document. For example, in the sentence "I love NLP," tokenization would result in the
tokens "I," "love," and "NLP.“
Part-of-Speech Tagging(POS) Part-of-speech tagging assigns grammatical categories (e.g., noun, verb, adjective) to each
word in a sentence. This helps in understanding the grammatical structure of sentences and their components.
Dependency Parsing: Dependency parsing analyzes the grammatical structure of a sentence by establishing syntactic
relationships between words. It identifies which words depend on others in a sentence. For example, in the sentence
"The cat chased the mouse," dependency parsing would reveal that "chased" depends on "cat," and "cat" depends
on "The.“
Named Entity Recognition (NER): NER identifies and classifies named entities in text, such as names of people,
organizations, locations, dates, and more. This helps in understanding the document's content and structure by identifying
important entities.
Document Structure Analysis: Analyzing the organization of a document, including sections, headings, and paragraphs, is
vital for understanding the hierarchy and flow of information within documents, which is essential for tasks like document
summarization and content extraction.
WORDS AND THEIR COMPONENTS
Orthography:
Explanation: Orthography refers to the standardized system of writing and spelling in a language. It deals with the correct visual
representation of words, including punctuation and capitalization rules.
Example: In English, orthography dictates that "apple" is spelled with two 'p's, and sentences should begin with capital letters and end
with punctuation marks like periods or question marks.
Graphemes:
Explanation: Graphemes are the smallest units of written language that represent individual sounds or characters. Each letter or
character in a written word is a grapheme.
Example: In the word "book," the graphemes are 'b,' 'o,' 'o,' and 'k.' Each of these represents a distinct sound or character.
Phonemes:
Explanation: Phonemes are the smallest distinctive sound units in spoken language. They are the basic building blocks of spoken
words and convey different meanings when substituted.
Example: In English, the words "pat" and "bat" differ only in the initial phoneme (/p/ and /b/, respectively), resulting in different
meanings.
Morphemes:
Explanation: Morphemes are the smallest meaningful units of language. They can be words themselves or parts of words
(prefixes, suffixes, roots) that carry meaning.
Example: In "unhappiness," there are three morphemes: "un-" (a prefix meaning "not"), "happy" (the root word), and "-ness" (a
suffix indicating a state or quality).
Lexicon:
Explanation: The lexicon is a mental or digital dictionary of words in a language, along with their meanings, usage, and
grammatical properties.
Example: The lexicon contains entries for words like "cat" (a noun referring to a feline animal) and "run" (a verb meaning to move
swiftly).
Lemmas:
Explanation: Lemmas are the base or dictionary forms of words. Lemmatization involves reducing words to their lemma form.
Example: The lemma of "running" is "run.“
Word Frequency:
Explanation: Word frequency measures how often a word appears in a text or corpus.
Example: "The" is a high-frequency word in English.
Stop Words:
Explanation: Stop words are common words often filtered out in NLP tasks.
Example: "And," "the," and "is" are common stop words in English.
Word Embeddings:
Explanation: Word embeddings represent words as dense vectors capturing semantic relationships.
Example: Word2Vec might represent "king" and "queen" as vectors with similar directions in high-dimensional space.
These components and linguistic features of words are essential for various NLP tasks, including text analysis, information retrieval,
machine translation, and sentiment analysis, as they provide the foundation for understanding and processing natural language text
Problem: Consider the word "unhappiness."
Identify and describe the following components of the word:
Orthography
Morphemes (including prefixes, roots, and suffixes)
Word Stem
Lemma
Part-of-Speech (POS)
Stop Word (if applicable)
Solution:
Orthography: The orthography of the word "unhappiness" consists of the letters and
their arrangement in the word.
Morphemes:
Prefix: "un-" (indicating negation)
Root: "happy"
Suffix: "-ness" (indicating a state)
Word Stem: The stem of the word "unhappiness" is "happi," which is the root
form after removing prefixes and suffixes.
Stop Word (if applicable): "Unhappiness" does not contain any common stop
words..
Problem: You are given the following sentence: "The quick brown fox jumped
over the lazy dog." Analyze the sentence to identify and describe the linguistic
components of the words in it. Provide explanations for each component-
orthography, morphology, lemma, and part-of-speech
Solution: HINT
Word: "The"
Orthography: The word is spelled as "The."
Morphology: It's a single word with no prefixes or suffixes.
Lemma: The lemma is "the" (the base form).
Part-of-Speech (POS): It's an article (DET), indicating specificity.
Word: "quick"
Orthography: The word is spelled as "quick."
Morphology: It's a single word with no prefixes or suffixes.
Lemma: The lemma is "quick."
Part-of-Speech (POS): It's an adjective (ADJ), describing the noun "fox.“……………….
…………………………………………………………
Morphological Models in NLP –
Morphological models in Natural Language Processing (NLP) are used to analyze and generate the morphological
structure of words in a language. Morphology deals with the internal structure of words and how words are formed
through the combination of morphemes, which are the smallest units of meaning in a language.-
Stemming Algorithms:
Stemming models aim to reduce words to their root or base form by removing prefixes and suffixes. Stemming is a
heuristic approach and may not always produce valid words, but it can be effective for information retrieval and text
analysis tasks.
Examples: Porter Stemmer, Snowball Stemmer (Porter2), Lancaster Stemmer
Lemmatization Models:
Lemmatization models determine the lemma or base form of a word by considering its grammatical properties and
context. Lemmatization results in valid words and is often used in linguistic analysis.
Examples: WordNet Lemmatizer, spaCy Lemmatizer.r
Finite-State Morphology:
Finite-state morphology models are based on finite-state transducers (FSTs) and are used for morphological analysis
and generation in various languages. FSTs can represent regular morphological rules.
Examples: Xerox Finite-State Tools (XFST), HFST (Helsinki Finite-State Transducer Technology).
Statistical models for morphology use machine learning techniques to learn morphological patterns from data. They
can be applied to languages with limited linguistic resources.
Examples: Conditional Random Fields (CRF), Hidden Markov Models (HMMs).
Introduction to Finding the Structure of Documents-
An introduction to finding the structure of documents in natural language processing (NLP) involves understanding
the fundamental concept of how text documents are organized and segmented into meaningful parts. This process
is essential for various NLP tasks and text analysis-
In NLP, a "document" refers to any piece of text, which could be as short as a single sentence or as long as an entire
book.
The "structure" of a document refers to how the text is organized into different components or sections, such as
paragraphs, sentences, headings, and more.
The structure of a document is crucial for several reasons:
It helps us understand the document's hierarchy and organization.
It enables efficient navigation and retrieval of specific information.
It forms the basis for many NLP tasks, including text summarization, sentiment analysis, and information
extraction.
2. Basic Elements of Document Structure:
Documents typically consist of several basic elements:
Paragraphs: Blocks of text that group related sentences together. Paragraphs are often used to introduce and discuss
different ideas or topics.
Sentences: The basic units of meaning within a document. Sentences convey complete thoughts and are often separated
by punctuation marks such as periods, question marks, and exclamation points.
Headings and Sections: In structured documents like reports, articles, or books, headings and sections are used to
organize content hierarchically. Headings indicate the topic or theme of a section, and sections group related content
together.
Document Structure:
1. Introduction
- Definition of NLP
- Scope of NLP
2. Applications of NLP
- Sentiment analysis
- Information retrieval
3. Conclusion
- Future directions in NLP research
"""
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on
the interaction between computers and human languages. It encompasses a wide
range of tasks and applications, making it a diverse and dynamic field.
Paragraph 2:
Document Structure:
1. Introduction
- Definition of NLP
- Scope of NLP
2. Applications of NLP
- Sentiment analysis
- Information retrieval
3. Conclusion
- Future directions in NLP research