0% found this document useful (0 votes)
106 views

ML Module A7707 - Part1

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on interactions between computers and human language. NLP enables computers to understand, interpret, and generate human language in a way that is meaningful and useful. The origins of NLP can be traced back to the 1950s with early work on machine translation and AI research. NLP has evolved through several phases driven by advancements in technology, including rule-based approaches, statistical NLP, and modern deep learning models. Key applications of NLP include chatbots, machine translation, sentiment analysis, and question answering.

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

ML Module A7707 - Part1

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on interactions between computers and human language. NLP enables computers to understand, interpret, and generate human language in a way that is meaningful and useful. The origins of NLP can be traced back to the 1950s with early work on machine translation and AI research. NLP has evolved through several phases driven by advancements in technology, including rule-based approaches, statistical NLP, and modern deep learning models. Key applications of NLP include chatbots, machine translation, sentiment analysis, and question answering.

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

NATURAL

LANGUAGE
PROCESSING -
NLP
MODULE 1
INTRODUCTIO
N
MANISH CHHABRA
Course
• outcomes
After the completion of the course, the student will be able to:
• A77--.1 Identify the structure of words and documents for text
preprocessing.
• A77__.2 Choose an approach to parse the given text document.
• A77--.3 Make use of semantic parsing to capture real meaning of
text.
• A77--.4 Select a language model to predict the probability of a
sequence of words.
• A77--.5 Examine the various applications of NLP
Introduction: What is Natural Language Processing (NLP), Origins of NLP, The Chal- lenges of NLP, Phases of NLP,
Language and Grammar. Finding the Structure of Words and Documents: Words and Their Components, Issues and
Challenges, Morphological Models. Finding the Structure of Documents: Introduction, Sentence Boundary Detection,
Topic Bound- ary Detection, Methods, Complexity of the Approaches, Performances of the Approaches, Features,
Processing Stages

Syntax: Parsing Natural Language, A Data-Driven Approach to Syntax, Stop words, Correcting Words, Stemming,
Lemmatization, Parts of Speech (POS) Tagging, Representation of Syntactic Structure, Parsing Algorithms, Models for
Ambiguity Resolution in Parsing. Semantic Parsing: Introduction,

Semantic Interpretation: Structural Ambiguity, Entity and Event Resolution, System Paradigms, WordSense, Predicate-
Argument Structure, Meaning Representation

Language modeling: Introduction, n-Gram Models, Language Model Evaluation, Pa- rameter Estimation, Types of
Language Models: Class-Based Language Models, MaxEnt Language Models, Neural Network Language Models
Language- Specific Modeling Problems, Multilingual and Crosslingual Language Modeling.

Applications: Question Answering: History, Architectures, Question Analysis, Search and Candidate Extraction,
Automatic Summarization: Approaches to Summarization, Spoken Dialog Systems: Speech Recognition and
Understanding, Speech Generation, Dialog Manager, Voice User Interface, Information Retrieval: Document
Preprocessing, Monolingual Information Retrieval
Natural Language Processing- (NLP)
is a field of artificial intelligence (AI) that focuses on the
interaction between computers and human language. NLP
enables computers to understand, interpret, and generate
human language in a way that is both meaningful and useful.
The origins of Natural Language Processing (NLP)
can be traced back to the mid-20th century. NLP emerged as a field at the intersection of computer science, linguistics, and artificial
intelligence (AI). Here are some key milestones and contributors in the development of NLP.

Machine Translation (1950s):


The field of NLP was heavily influenced by efforts to create machine translation systems. One of the earliest and most famous projects
in this area was the Georgetown-IBM experiment in 1954, which attempted to translate Russian sentences into English.

Early AI Research (1950s-1960s):


During the early years of AI research, there was significant interest in creating computer programs that could understand and generate
human language. Researchers like Allen Newell and Herbert A. Simon worked on early AI projects related to language understanding.

Chomsky's Linguistic Theory (1957):


Noam Chomsky's work on generative grammar and the formal structure of languages had a profound impact on both linguistics and
NLP. His ideas influenced the development of parsing algorithms for language analysis.

The ELIZA Chatbot (1960s):


Joseph Weizenbaum's ELIZA, developed in the mid-1960s, was one of the earliest chatbots. It simulated a Rogerian psychotherapist and
engaged in text-based conversations with users.
Transformational Grammar (1960s):
The development of transformational grammar by Noam Chomsky and others contributed to the formalization of grammar rules,
which played a role in early NLP systems.

Shakey the Robot (1960s-1970s):


Stanford Research Institute developed Shakey the Robot in the late 1960s and early 1970s. While not focused solely on language, it
was one of the early AI systems that interacted with the environment using natural language commands.

Pioneering NLP Systems (1970s):


The 1970s saw the development of some of the first NLP systems, including SHRDLU, a natural language understanding system for
manipulating blocks in a virtual world, and DIALOGUE, a system for understanding and generating English sentences.
Statistical NLP (1980s-1990s):
The statistical approach to NLP gained prominence in the 1980s and 1990s. Researchers began to use probabilistic models for
various NLP tasks, such as part-of-speech tagging and machine translation.

Word Embeddings (2000s):


The mid-2000s saw the rise of word embeddings like Word2Vec and GloVe, which revolutionized how words are represented in
NLP, capturing semantic relationships and context.

Deep Learning Revolution (2010s):


The advent of deep learning, particularly neural networks, led to significant advancements in NLP. Models like LSTM (Long Short-
Term Memory) and Transformer architectures, such as BERT and GPT, have had a profound impact on NLP tasks, including
language understanding, translation, and generation.
NLP continues to evolve rapidly, driven by advancements in deep learning, large-scale datasets, and computational resources. It now
plays a crucial role in various applications, including chatbots, virtual assistants, sentiment analysis, and machine translation, among
others.
Phases of NLP
Natural Language Processing (NLP) has evolved over several phases, driven by advancements in technology, linguistics, and artificial
intelligence. These phases represent the changing approaches and techniques used in NLP research and applications. While the
boundaries between phases are not rigid, they provide a broad overview of NLP's historical development.
Here are the main phases of NLP:
Rule-Based Approaches (1950s-1970s):
The early phase of NLP focused on rule-based approaches, where linguists and computer scientists manually crafted rules and grammars
to process and understand natural language. This era included projects like machine translation experiments and early chatbots.

Statistical NLP (1980s-1990s):


In this phase, statistical techniques gained prominence. Researchers began to use probabilistic models and machine learning algorithms
to automate the extraction of patterns from large corpora of text. Hidden Markov Models (HMMs) and statistical parsers became
essential tools for tasks like part-of-speech tagging, named entity recognition, and machine translation.

Knowledge-Based NLP (1980s-1990s):


Knowledge-based systems incorporated domain-specific knowledge and ontologies to improve language understanding. This phase aimed
to make NLP systems more context-aware by using semantic networks and expert knowledge.

Hybrid Systems (1990s-2000s):


This phase combined rule-based, statistical, and knowledge-based approaches to tackle various NLP tasks. Researchers integrated
multiple techniques to improve the accuracy and robustness of language processing systems.
Machine Learning and Deep Learning (2010s-Present):
The advent of deep learning, particularly neural networks, has revolutionized NLP. This phase witnessed the rise of neural network-based
models for a wide range of NLP tasks, including recurrent neural networks (RNNs), convolutional neural networks (CNNs), long short-
term memory networks (LSTMs), and the Transformer architecture. Models like BERT, GPT, and their variants have achieved state-of-the-
art performance in tasks such as sentiment analysis, machine translation, and question answering.

Pretrained Language Models (2018-Present):


A notable recent development is the use of large pretrained language models, such as BERT (Bidirectional Encoder
Representations from Transformers) and GPT (Generative Pretrained Transformer). These models are pretrained on vast
text corpora and fine-tuned for specific NLP tasks. They have become the backbone of many NLP applications, enabling
efficient transfer learning and improved performance with less task-specific data .

Ethical and Responsible AI (Ongoing):


. As NLP technologies advance, there is increasing attention on ethical and responsible AI practices. Researchers and organizations are
addressing issues related to bias, fairness, privacy, and the responsible deployment of NLP systems in real-world applications.
NLP continues to evolve, with ongoing research in areas like multilingual NLP, low-resource languages, explainable AI, and AI ethics. The
field is dynamic, and new phases may emerge as technology and understanding of language processing evolve further
Natural Language Processing
 Humans communicate through some form of language either by text or speech.
 To make interactions between computers and humans, computers need to understand
natural languages used by humans.
 Natural language processing is all about making computers learn, understand, analyse,
manipulate and interpret
natural(human) languages.
 NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence.
 Processing of Natural Language is required when you want an intelligent system like robot to
perform as per your instructions, when you want to hear decision from a dialogue based
clinical expert system, etc.
 The ability of machines to interpret human language is now at the core of many applications
that we use every day
- chatbots, Email classification and spam filters, search engines, grammar checkers, voice
assistants, and social language translators.
 The input and output of an NLP system can be Speech or Written Text
Components of NLP
 There are two components of NLP,

=Natural Language Understanding(NLU)


Natural Language Generation (NLG).

 Natural Language Understanding (NLU)


which involves transforming human
language into a machine-readable format.
 It helps the machine to understand and analyse human language by extracting the
text from large data such as keywords, emotions, relations, and semantics.

Natural Language Generation (NLG) acts as a translator that converts
the computerized data into natural language representation.

It mainly involves Text planning, Sentence planning, and Text realization.

The NLU is harder than NLG.
Natural Language Processing (NLP)
is a field of artificial intelligence (AI) that focuses on the
interaction between computers and human language. NLP
enables computers to understand, interpret, and generate human
language in a way that is both meaningful and useful.
NLP Terminology


Phonology − It is study of organizing sound systematically.

Morphology: The study of the formation and internal structure of words.

Morpheme − It is primitive unit of meaning in a language.

Syntax: The study of the formation and internal structure of sentences.

Semantics: The study of the meaning of sentences.

Pragmatics − It deals with using and understanding sentences in
different situations
and how the interpretation of the sentence is affected.

Discourse: − It dealswith how the immediately preceding
sentence can affect the interpretation of the next sentence.

World Knowledge − It includes the general knowledge about the world.
.
Steps in NLP
 There are general five steps :
1.Lexical Analysis

2.Syntactic Analysis (Parsing)

3.Semantic Analysis

4.Discourse Integration

5.Pragmatic Analysis
Lexical Analysis:

Lexical analysis involves tokenization and breaking down the input text
into individual words or tokens.

This step establishes the basic structure of words in the text and
removes any extraneous characters or symbols. It's the initial stage of
text processing.
here's a simple example of lexical analysis

Input Text:

"Natural language processing (NLP) is a subfield of artificial intelligence."

Lexical Analysis (Tokenization):


"Natural“
Token Type: Word
Explanation: The first word in the text.

"language“
Token Type: Word
Explanation: The second word in the text.

"processing“
Token Type: Word
Explanation: The third word in the text.

"(“
Token Type: Punctuation
Explanation: An opening parenthesis.
"NLP“
Token Type: Word/Acronym
Explanation: An acronym representing "Natural Language Processing.“

")“
Token Type: Punctuation
Explanation: A closing parenthesis.

"is“
Token Type: Word
Explanation: A common word indicating existence or a state.

"a“
Token Type: Word
Explanation: An article indicating an indefinite noun.

"subfield“
Token Type: Word
Explanation: A word describing a specialized area within a field.
"of“
Token Type: Word
Explanation: A common word indicating possession or association.

"artificial“
Token Type: Word
Explanation: A word describing something not occurring naturally.

"intelligence“
Token Type: Word
Explanation: A word referring to the capacity to think and learn.

This breakdown represents the lexical analysis of the input text,


where each word, acronym, or punctuation mark is identified
along with its type and a brief explanation.

Lexical analysis helps in segmenting the text into meaningful units ,,


which is a fundamental step in natural language processing.
SIMPLE EXAMPLE PYTHON CODE (LAB_TASK1) FOR TEXT/LEXICAL ANALSIS
import re

# Sample text

text = "Natural language processing (NLP) is a subfield of artificial intelligence."

# Tokenization function

def tokenize(text):
# Use regular re expressions to split text into words
words = re.findall(r'\w+', text.lower())
return words

# Tokenize the text


tokens = tokenize(text)

# Print the tokens


print(tokens)
In this above example:

1. We import the re (regular expressions) module to help us split the text into words. The re
module is Python's regular expressions module, which is used for pattern matching and searching in strings.

2. The sample text is: "Natural language processing (NLP) is a subfield of artificial intelligence.“

3. The tokenize function uses regular expressions

re.findall()
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned
left-to-right, and matches are returned in the order found

re.search()-re.search() method either returns None (if the pattern doesn’t match)
re.match(): Determines if a pattern matches at the beginning of a string.
re.split(): Splits a string into a list of substrings based on a specified pattern.

4.It converts the text to lowercase /Uppercase to ensure uniformity.

5.The tokens list contains the individual words extracted from the text.
r'\w+':
r prefix: This indicates that the string following it is a raw string,
which means that backslashes \ within the string are treated as
literal characters, making it easier to work with regular expressions.
'\w+': This is the regular expression pattern.
\w: In a regular expression, \w represents a word character, which
includes alphanumeric characters (letters and digits) and underscores.
It matches a single word character.
+: The + quantifier means "one or more occurrences of the preceding
pattern."
So, '\w+' matches one or more word characters in sequence,
effectively capturing words in the text.
Syntactic Analysis (Parsing):

Syntactic analysis, often referred to as parsing, is the process of


analyzing the grammatical structure of a sentence to understand how
words and phrases relate to one another.
This step helps in understanding the sentence's grammatical rules and
how words are organized.

The sentence such as “The school goes to boy” is rejected by English


syntactic analyzer.
For example the Input Sentence:

"The quick brown fox jumps over the lazy dog."

Syntactic Analysis (Parsing): We can break down the sentence into its grammatical components:

"The" (Determiner): This word functions as a determiner, indicating that a specific noun is coming next.

"quick" (Adjective): It modifies the noun "fox" by describing its quality.

"brown" (Adjective): Similar to "quick," it describes the color of the noun "fox."

"fox" (Noun): The main subject of the sentence. It's the animal that is performing the action.

"jumps" (Verb): The action word in the sentence, indicating what the subject (fox) is doing.

over" (Preposition): It shows the relationship between the action and the object.

"the" (Determiner): Another determiner, specifying a particular object.

"lazy" (Adjective): An adjective describing the quality of the object (dog).

"dog" (Noun): The object of the action, which the fox is jumping over.
EXERCISE :"The cat sat on the mat."
EXERCISE :"The cat sat on the mat."

ANSWER :

"The" (Determiner)
"cat" (Noun)
"sat" (Verb)
"on" (Preposition)
"the" (Determiner)
"mat" (Noun)
This analysis provides an understanding of the grammatical
structure and relationships between words in the sentence
"The cat sat on the mat."
Semantic Analysis:
Semantic analysis goes beyond syntax and aims to understand the
meaning of text.
It involves interpreting the relationships between words, phrases, and
sentences and determining the overall meaning of the text.

This step helps in capturing the semantics and intent behind the
language.
For example let us do semantic analysis of previous example:

"The quick brown fox jumps over the lazy dog."


in this sentence,
"The quick brown fox jumps over the lazy dog."
we can infer the following meanings and relationships between words and phrases:

"The" (Determiner): This word indicates that a specific noun is coming next but doesn't provide specific meaning by itself.

"quick" (Adjective): It describes the fox, suggesting that the fox is fast or agile.

"brown" (Adjective): This word describes the color of the fox, indicating that the fox has a brown fur color.

"fox" (Noun): The main subject of the sentence, representing an animal.

"jumps" (Verb): The action word in the sentence, indicating that the fox is leaping or hopping.

"over" (Preposition): It shows the relationship between the action (jumps) and the object (dog).

"the" (Determiner): Another determiner, specifying a particular object.

"lazy" (Adjective): This word describes the dog, suggesting that the dog is not active or energetic.

"dog" (Noun): The object of the action, representing another animal.


Semantic Relationships:
The sentence describes a scenario where a specific fox
(described as quick and brown) is performing an action (jumps) over a specific dog
(described as lazy).

The word choices and descriptions convey a mental image of a fast and agile fox
leaping over a brown, lazy dog.

The semantics of the sentence suggest an action of playfulness or perhaps


avoidance by the fox.

Semantic analysis helps us understand the deeper meaning and relationships


between words in a sentence, which goes beyond the surface-level grammatical
structure. It plays a crucial role in tasks like sentiment analysis,
machine translation, and natural language understanding, where understanding
context and meaning is essential.
Discourse Integration:

Discourse integration deals with understanding how sentences are connected in a larger context,
such as a conversation or a document. It focuses on coherence and cohesion in language and helps
in tracking references, pronouns, and discourse markers to establish context and continuity.

Discourse integration involves understanding how different words, phrases, and clauses in a
sentence relate to each other and to the broader context in which the sentence appears.

EXAMPLE - "The quick brown fox jumps over the lazy dog.”

Let's break down the elements of this sentence and discuss discourse integration:
1."The quick brown fox" - This part of the sentence introduces the subject of the sentence, which is the fox. It also
provides some descriptive information about the fox (i.e., it is quick and brown). Discourse integration in this case involves
recognizing that this noun phrase is the entity that will perform the action in the sentence.

2. "jumps over" - This is the verb phrase that indicates the action being performed. Discourse integration involves
understanding the relationship between the subject (the fox) and the action (jumping over) and how they fit together in
the context of the sentence.
3 ."the lazy dog" - This is the object of the action, specifying what the fox is jumping over.

Discourse integration involves recognizing that "the lazy dog" is the target or recipient of the fox's action.

Additionally, discourse integration may involve resolving ambiguities or references. For example, if there were
multiple dogs mentioned earlier in the text, discourse integration would require identifying which
specific "lazy dog" the sentence is referring to.

Overall, discourse integration in NLP is crucial for comprehending the relationships between various linguistic
elements within a sentence and how they contribute to the coherence and meaning of a larger text or
conversation.
Pragmatic Analysis:

Pragmatic analysis takes into account the pragmatic aspects of language, including implied
meaning, speech acts, and context-dependent interpretations.
It considers the social and situational context in which language is used and helps in understanding
the intentions and implications behind language use.

These steps are particularly relevant for deep NLP tasks where a deeper understanding of language is
required,
such as in
question answering,
machine translation,
sentiment analysis, and
chatbots.
In practical NLP applications, not all of these steps are always necessary or may be combined
depending on the specific task and goals. Additionally, modern NLP models, such as transformers,
have shown remarkable performance by jointly modeling multiple levels of linguistic information,
which can reduce the need for explicitly separating these steps.
A pragmatic analysis of the sentence "The quick brown fox jumps over the lazy dog.”
involves examining how it conveys meaning beyond its literal interpretation by considering the context, speaker's
intentions, and implicatures.

Here's a pragmatic analysis:

Basic Information: This sentence tells us about a fast, brown-colored fox and a lazy dog.

Funny Contrast: It's a bit funny because it's describing the fox as fast (quick) and the dog as lazy.
These are opposites, so it's like saying the fox is very active while the dog is very relaxed.

Special Use: Sometimes, people use this sentence because it has all the letters of the
alphabet (A,B,C,D….J,K,L,M,N,………U,V,W,X,Y,Z) in it.

It's like a special sentence used for testing writing or printing.

Depends on the Situation: What this sentence means can change depending on the situation or the
conversation. It might just be telling a story about a fox and a dog, or it might
be used to show how all the letters of the alphabet can fit in a sentence.
So, its meaning depends on what's happening around it.
In simple terms, it's a sentence that talks about a fast fox and a lazy dog,
and it can be used for different things depending on when and where it's said.
Finding the Structure of Words and Documents is a crucial step in understanding and processing textual data.

This involves following several tasks and techniques to uncover the hierarchical and semantic relationships within language-

Tokenization: Tokenization is the process of breaking text into individual words or tokens. It's the foundational step to find
the basic structure of words in a document. For example, in the sentence "I love NLP," tokenization would result in the
tokens "I," "love," and "NLP.“

Part-of-Speech Tagging(POS) Part-of-speech tagging assigns grammatical categories (e.g., noun, verb, adjective) to each
word in a sentence. This helps in understanding the grammatical structure of sentences and their components.

Dependency Parsing: Dependency parsing analyzes the grammatical structure of a sentence by establishing syntactic
relationships between words. It identifies which words depend on others in a sentence. For example, in the sentence
"The cat chased the mouse," dependency parsing would reveal that "chased" depends on "cat," and "cat" depends
on "The.“
Named Entity Recognition (NER): NER identifies and classifies named entities in text, such as names of people,
organizations, locations, dates, and more. This helps in understanding the document's content and structure by identifying
important entities.

Document Structure Analysis: Analyzing the organization of a document, including sections, headings, and paragraphs, is
vital for understanding the hierarchy and flow of information within documents, which is essential for tasks like document
summarization and content extraction.
WORDS AND THEIR COMPONENTS
Orthography:
Explanation: Orthography refers to the standardized system of writing and spelling in a language. It deals with the correct visual
representation of words, including punctuation and capitalization rules.
Example: In English, orthography dictates that "apple" is spelled with two 'p's, and sentences should begin with capital letters and end
with punctuation marks like periods or question marks.

Graphemes:
Explanation: Graphemes are the smallest units of written language that represent individual sounds or characters. Each letter or
character in a written word is a grapheme.
Example: In the word "book," the graphemes are 'b,' 'o,' 'o,' and 'k.' Each of these represents a distinct sound or character.

Phonemes:
Explanation: Phonemes are the smallest distinctive sound units in spoken language. They are the basic building blocks of spoken
words and convey different meanings when substituted.
Example: In English, the words "pat" and "bat" differ only in the initial phoneme (/p/ and /b/, respectively), resulting in different
meanings.
Morphemes:
Explanation: Morphemes are the smallest meaningful units of language. They can be words themselves or parts of words
(prefixes, suffixes, roots) that carry meaning.
Example: In "unhappiness," there are three morphemes: "un-" (a prefix meaning "not"), "happy" (the root word), and "-ness" (a
suffix indicating a state or quality).
Lexicon:
Explanation: The lexicon is a mental or digital dictionary of words in a language, along with their meanings, usage, and
grammatical properties.
Example: The lexicon contains entries for words like "cat" (a noun referring to a feline animal) and "run" (a verb meaning to move
swiftly).

Part-of-Speech (POS) Tags:


Explanation: POS tagging involves assigning grammatical categories (e.g., noun, verb, adjective) to words in a sentence, helping
identify their syntactic role.
Example: In the sentence "The cat sleeps," "cat" is tagged as a noun, and "sleeps" is tagged as a verb.

Lemmas:
Explanation: Lemmas are the base or dictionary forms of words. Lemmatization involves reducing words to their lemma form.
Example: The lemma of "running" is "run.“

Word Sense Disambiguation (WSD):


Explanation: Some words have multiple meanings (senses). WSD is the task of determining the correct sense of a word in
context.
Example: "Bank" can mean a financial institution or the side of a river, and WSD helps determine the intended meaning.
Example: In "He went to the bank," "bank" can mean a financial institution or the side of a river, depending on context
Inflection:
Explanation: Inflection refers to changes made to words to indicate grammatical features.
Example: "Walk" (base form) vs. "Walked" (past tense).

Word Frequency:
Explanation: Word frequency measures how often a word appears in a text or corpus.
Example: "The" is a high-frequency word in English.

Stop Words:
Explanation: Stop words are common words often filtered out in NLP tasks.
Example: "And," "the," and "is" are common stop words in English.

Word Embeddings:
Explanation: Word embeddings represent words as dense vectors capturing semantic relationships.
Example: Word2Vec might represent "king" and "queen" as vectors with similar directions in high-dimensional space.

These components and linguistic features of words are essential for various NLP tasks, including text analysis, information retrieval,
machine translation, and sentiment analysis, as they provide the foundation for understanding and processing natural language text
Problem: Consider the word "unhappiness."
Identify and describe the following components of the word:
Orthography
Morphemes (including prefixes, roots, and suffixes)
Word Stem
Lemma
Part-of-Speech (POS)
Stop Word (if applicable)

Solution:
Orthography: The orthography of the word "unhappiness" consists of the letters and
their arrangement in the word.

Morphemes:
Prefix: "un-" (indicating negation)
Root: "happy"
Suffix: "-ness" (indicating a state)
Word Stem: The stem of the word "unhappiness" is "happi," which is the root
form after removing prefixes and suffixes.

Lemma: The lemma of the word "unhappiness" is "happy," representing the


base or root form without inflections.

Part-of-Speech (POS): The part-of-speech of "unhappiness" is a noun (NN),


representing a state or quality.

Stop Word (if applicable): "Unhappiness" does not contain any common stop
words..
Problem: You are given the following sentence: "The quick brown fox jumped
over the lazy dog." Analyze the sentence to identify and describe the linguistic
components of the words in it. Provide explanations for each component-
orthography, morphology, lemma, and part-of-speech

Solution: HINT
Word: "The"
Orthography: The word is spelled as "The."
Morphology: It's a single word with no prefixes or suffixes.
Lemma: The lemma is "the" (the base form).
Part-of-Speech (POS): It's an article (DET), indicating specificity.

Word: "quick"
Orthography: The word is spelled as "quick."
Morphology: It's a single word with no prefixes or suffixes.
Lemma: The lemma is "quick."
Part-of-Speech (POS): It's an adjective (ADJ), describing the noun "fox.“……………….
…………………………………………………………
Morphological Models in NLP –
Morphological models in Natural Language Processing (NLP) are used to analyze and generate the morphological
structure of words in a language. Morphology deals with the internal structure of words and how words are formed
through the combination of morphemes, which are the smallest units of meaning in a language.-

Stemming Algorithms:
Stemming models aim to reduce words to their root or base form by removing prefixes and suffixes. Stemming is a
heuristic approach and may not always produce valid words, but it can be effective for information retrieval and text
analysis tasks.
Examples: Porter Stemmer, Snowball Stemmer (Porter2), Lancaster Stemmer

Lemmatization Models:

Lemmatization models determine the lemma or base form of a word by considering its grammatical properties and
context. Lemmatization results in valid words and is often used in linguistic analysis.
Examples: WordNet Lemmatizer, spaCy Lemmatizer.r
Finite-State Morphology:
Finite-state morphology models are based on finite-state transducers (FSTs) and are used for morphological analysis
and generation in various languages. FSTs can represent regular morphological rules.
Examples: Xerox Finite-State Tools (XFST), HFST (Helsinki Finite-State Transducer Technology).

Rule-Based Morphological Models:


Rule-based models use linguistic rules and patterns to perform morphological analysis and generation.
These models are designed by linguists and can handle complex morphological phenomena.
Examples: The Two-Level Morphology Model, Apertium Rule-Based Models.

Statistical Morphological Models:

Statistical models for morphology use machine learning techniques to learn morphological patterns from data. They
can be applied to languages with limited linguistic resources.
Examples: Conditional Random Fields (CRF), Hidden Markov Models (HMMs).
Introduction to Finding the Structure of Documents-
An introduction to finding the structure of documents in natural language processing (NLP) involves understanding
the fundamental concept of how text documents are organized and segmented into meaningful parts. This process
is essential for various NLP tasks and text analysis-

1. Understanding the Significance of Document Structure:

In NLP, a "document" refers to any piece of text, which could be as short as a single sentence or as long as an entire
book.
The "structure" of a document refers to how the text is organized into different components or sections, such as
paragraphs, sentences, headings, and more.
The structure of a document is crucial for several reasons:
It helps us understand the document's hierarchy and organization.
It enables efficient navigation and retrieval of specific information.
It forms the basis for many NLP tasks, including text summarization, sentiment analysis, and information
extraction.
2. Basic Elements of Document Structure:
Documents typically consist of several basic elements:
Paragraphs: Blocks of text that group related sentences together. Paragraphs are often used to introduce and discuss
different ideas or topics.
Sentences: The basic units of meaning within a document. Sentences convey complete thoughts and are often separated
by punctuation marks such as periods, question marks, and exclamation points.
Headings and Sections: In structured documents like reports, articles, or books, headings and sections are used to
organize content hierarchically. Headings indicate the topic or theme of a section, and sections group related content
together.

. Importance in NLP Tasks:


The ability to identify and understand the structure of documents is foundational for many NLP tasks:
Information Retrieval: Document structure helps search engines retrieve relevant documents and rank them based on
the user's query.
Text Summarization: Summarization algorithms rely on document structure to extract key sentences or paragraphs for
creating concise summaries.
Sentiment Analysis: Understanding the structure of reviews or opinions in a document can help analyze sentiments more
effectively.
Named Entity Recognition (NER): Recognizing entities within documents often involves considering the document's
layout and structure.
# Sample text
document_text = """
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on the interaction between computers and human
languages. It encompasses a wide range of tasks and applications, making it a diverse and dynamic field.

Document Structure:
1. Introduction
- Definition of NLP
- Scope of NLP
2. Applications of NLP
- Sentiment analysis
- Information retrieval
3. Conclusion
- Future directions in NLP research
"""

# Split the document into paragraphs using double line breaks


paragraphs = document_text.split('\n\n')

# Print the paragraphs to understand the document's organization


for i, paragraph in enumerate(paragraphs):
print(f"Paragraph {i + 1}:\n{paragraph}\n")
OUTPUT :
Paragraph 1:

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on
the interaction between computers and human languages. It encompasses a wide
range of tasks and applications, making it a diverse and dynamic field.

Paragraph 2:
Document Structure:
1. Introduction
- Definition of NLP
- Scope of NLP
2. Applications of NLP
- Sentiment analysis
- Information retrieval
3. Conclusion
- Future directions in NLP research

You might also like