0% found this document useful (0 votes)
9 views

NLP-Unit-1-part1

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

NLP-Unit-1-part1

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

NATURAL LANGUAGE PROCESSING

NLP
• Natural Language Processing (NLP) is a branch of artificial
intelligence that deals with the interaction between computers
and human languages. It involves developing algorithms and
models that allow machines to process, analyze, and produce
text or speech in a way that is meaningful to humans.

Page 02
some real-world examples of(NLP)
 Voice Assistants
 Machine Translation
 Chat bots
 Sentiment Analysis
 Text Summarization
 Spell Check and Grammar Correction
 Speech Recognition
 Text-Based Search Engines
 Autocorrect and Auto complete
Applications of NLP

• 1. Question Answering:
Applications of NLP
• 2.Spam Detection
Applications of NLP
• 3. Sentiment Analysis
Applications of NLP
• 4. Machine Translation
Applications of NLP
• 5. Spelling correction
Applications of NLP
• 6. Speech Recognition
• Speech recognition is used for converting spoken words into
text. It is used in applications, such as mobile, home
automation, video recovery, dictating to Microsoft Word, voice
biometrics, voice user interface, and so on.
Applications of NLP
• 7. Chatbot
Components of NLP
• There are two components of NLP, Natural Language
Understanding (NLU)and Natural Language Generation
(NLG).
• Natural Language Understanding (NLU) which involves
transforming human language into a machine-readable
format.
• It helps the machine to understand and analyze human
language by extracting the text from large data such as
keywords, emotions, relations, and semantics.
Components of NLP
• Natural Language Generation (NLG) acts as a translator that
converts the computerized data into natural language
representation.
• It mainly involves Text planning, Sentence planning, and
Text realization.
LEVELS OF NLP

NLP operates at various


levels, from understanding
the basic components of
language to deep semantic
understanding. These levels
can be categorized into:

PAGE 013
LEXICAL LEVEL

• This phase scans the source code as a stream of characters and converts
it into meaningful lexemes.

• It divides the whole text into paragraphs, sentences, and words.

• Lexeme: A sequence of characters that forms a syntactic unit in the


language. Token: A pair consisting of a token name and an optional
attribute value, which represents the lexeme in a structured form.
Examples of Tokens: Keywords, identifiers, operators, literals, and
punctuation symbols.
• int x = 10;

• Lexical Analysis Steps:

• The code is read character by character: i, n, t, , x, , =, , 1, 0, ;.

• Characters are grouped into lexemes:


int (keyword)
x (identifier)
= (operator)
10 (literal)
; (symbol)
• Lexemes are converted into tokens:
<keyword, int>
<identifier, x>
<operator, =>
<literal, 10>
<symbol, ;>

• Real-World Analogy:

• Sentence: "The cat sat."

• Lexemes: "The", "cat", "sat", "."

• Tokens: Article(The), Noun(cat), Verb(sat), Punctuation(.).


SYNTACTIC LEVEL

• Syntactic Analysis is used to check grammar, word arrangements, and


shows the relationship among the words.

• The sentence such as “The school goes to boy” is rejected by English


syntactic analyzer
SEMANTIC LEVEL

• Semantic analysis is concerned with the meaning representation.

• It mainly focuses on the literal meaning of words, phrases, and


sentences.

• The semantic analyzer disregards sentence such as “hot ice-cream”.


• The cat sat on the mat.

• Explanation at the semantic level:

• Syntax: The structure of the sentence is correct (Subject → Verb


→ Object).

• Semantics: The meaning is clear:


"cat" refers to an animal.
"sat" means the action of sitting.
"on the mat" gives the location.
"The mat sat on the cat."

• The syntax is still correct, but the semantics are odd or illogical because mats
do not "sit."
DISCOURSE LEVEL

• discourse level refers to the way language is understood and processed


beyond the sentence level, focusing on the connections between
sentences and how they form coherent and meaningful texts.

• For example, consider these two sentences:

• "I went to the park."

• "It was a sunny day, so I brought a picnic."


PRAGMATIC LEVEL

• pragmatic level refers to how language is used in real-world situations


to communicate effectively. It goes beyond just the literal meaning of
words (semantics) and focuses on understanding the context and
intent behind them.

• Imagine you say, "Can you pass the salt?" Pragmatics is about
recognizing that this is not just a question about your ability to pass
the salt, but a polite request for someone to pass it.
NATURAL LANGUAGE PROCESSING WITH PYTHON'S NLTK
PACKAGE

• NLTK, or Natural Language Toolkit, is a Python package that you can


use for NLP.

• A lot of the data that you could be analyzing is unstructured data


and contains human- readable text. Before you can analyze that
data programmatically, you first need to preprocess it.

• Now we are going to see kinds of text preprocessing tasks you can
do with NLTK so that you’ll be ready to apply them in future
projects.
NLP pipeline
NLP pipeline
• Step1: Sentence Segmentation
• Step2: Word Tokenization
• Step3: Stemming
• Step 4: Lemmatization
• Step 5: Identifying Stop Words
• Step 6: Dependency Parsing
• Step 7: POS tags
• Step 8: Named Entity Recognition (NER)
• Step 9: Chunking
Sentence Segmentation
• Sentence Segment is the first step for building the NLP pipeline. It breaks the
paragraph into separate sentences.
• Example: Consider the following paragraph -
• Input Text:
"Natural Language Processing is fascinating. It has many applications in AI."
• Segmented Output:
• Sentence 1: "Natural Language Processing is fascinating."
• Sentence 2: "It has many applications in AI."
WORD TOKENIZATION
• Word Tokenize is used to break the sentence into separate words or
tokens.

• Input Sentence:
"Natural Language Processing is fascinating!“

• Tokenized Output:
['Natural', 'Language', 'Processing', 'is', 'fascinating', '!']
STEMMING • Stemming is the process of reducing a word to its
root or base form by removing suffixes.

• Example of Stemming
LEMMATIZATION
• Like stemming, lemmatizing reduces words to their core meaning, but
it will give you a complete English word that makes sense on its own
instead of just a fragment of a word like 'discoveri'.

• • A lemma is a word that represents a whole group of words, and that


group of words is called a lexeme.Example of Lemmatization: Input
Words:
['running', 'runner', 'flies', 'easily', 'better']

• Lemmatized Output: Using part-of-speech (POS) tagging:['run',


'runner', 'fly', 'easily', 'good']
EXAMPLE COMPARISON

Word Stemmed Lemmatized (POS = Verb)

running run run

easily easily easily

better better good


IDENTIFYING STOP WORDS

• Stop words are words that you want to ignore, so you filter them out of
your text when you’re processing it. Very common words like 'in', 'is',
and 'an' are often used as stop words since they don’t add a lot of
meaning to a text in and of themselves.

• Note: nltk.download("stopwords")
• Example

• Input Sentence:
“She is going to the market to buy fruits.."

• Stop Words Identified:


[she, is, to, the]

• Remaining Words After Removal:


["going", "market", "buy", "fruits“]
DEPENDENCY PARSING

• Dependency Parsing is a process in Natural Language


Processing (NLP) that involves analyzing the grammatical
structure of a sentence. It establishes relationships between
words based on their syntactic dependencies.
CONT..

• Example:Sentence: She eats an apple.

• Dependencies: eats → She (subject), eats → apple


(object)
POS TAGS

• POS stands for parts of speech, which includes Noun, verb, adverb,
and Adjective. It indicates that how a word functions with its
meaning as well as grammatically within the sentences. A word has
one or more parts of speech based on the context in which it is
used.
EXAMPLE: GOOGLE SOMETHING ON THE INTERNET.

• Google (Verb - VB)Explanation: Here, "Google" is used as a verb, meaning to search


for something online. It represents an action.

• something (Pronoun - PRP)Explanation: This is a pronoun used as a placeholder for


an unspecified object or idea.

• on (Preposition - IN)Explanation: A preposition that indicates the relationship


between the action ("Google") and its location ("the Internet").

• the (Determiner - DT)Explanation: A determiner specifying a particular instance of


"Internet."

• Internet (Noun - NN)Explanation: A proper noun, referring to the global network.


NAMED ENTITY RECOGNITION
• Definition: NER is a subtask of Information Extraction (IE)
that classifies named entities in text into predefined
categories such as persons, organizations, locations, etc.

• Example:Sentence: "Barack Obama was born in Hawaii on


August 4, 1961."

• NER Output:
Barack Obama → Person (PER)
Hawaii → Location (LOC)
August 4, 1961 → Date (DATE)
CHUNKING

• Chunking is used to collect the individual piece of


information and grouping them into bigger pieces of
sentences.
NATURAL LANGUAGE PROCESSING CHALLENGES
• Ambiguity in Language: Words or sentences can have more than one
meaning depending on the context. For example, the word "bat" could
refer to a flying animal or a piece of sports equipment. NLP systems
need to figure out which meaning is being used.

• Handling Different Languages and Dialect: Different languages and


even different dialects within the same language can have unique rules
and vocabulary. An NLP model needs to be able to understand and
process text in many different languages and regions, which is
challenging.English follows Subject-Verb-Object (SVO) order: "She eats
an apple."Japanese follows Subject-Object-Verb (SOV): "She an apple
eats."
• Sarcasm: In some cases, people say the opposite of what they mean,
like saying "Great job!" when something went wrong. NLP systems
often struggle to detect sarcasm, which is difficult even for humans.

• Understanding Context: Words or sentences change their meaning


depending on the context. For example, the phrase "I’m going to the
bank" could mean a financial institution or the side of a river. NLP
systems must figure out which meaning applies in the given situation.
• Multimodal NLP: Sometimes, NLP systems need to understand both
text and other types of information, like images or audio. This is
challenging because combining these different types of data is
complex.

• Data Scarcity: NLP systems require a lot of data to learn. However, in


some cases, there may not be enough labeled data for specific
languages or tasks. This makes it hard for the model to perform well
on those tasks.
• Word Representation: Words are usually represented as numbers in
NLP models. The challenge is to find a way to represent words so that
the model can understand their meaning and relationships. Some
words may have different meanings based on how they are used.

• Named Entity Recognition (NER):This involves identifying names of


people, places, organizations, etc., in text. The challenge is that names
can be complex (e.g., "New York City" or "Eiffel Tower") and there may
be multiple possible names that need to be identified.
REAL-LIFE APPLICATION: SPELL AND GRAMMAR CHECKERS

•Tools like Microsoft Word, Grammarly, and Google Docs use NLP.

•They analyze your text to find and fix:

•Spelling mistakes (e.g., "teh" → "the").

•Grammar issues (e.g., "He go to school" → "He goes to school").


HOW DO THEY WORK?

• Tokenization: Breaks sentences into words.

• Error Detection: Identifies spelling or grammar errors.

• Suggestions: Provides corrections or alternatives.


BENEFITS OF SPELL AND GRAMMAR CHECKERS
• Examples of Popular Tools

• Grammarly, Google Docs, Microsoft Word

• LanguageTool
INFORMATION EXTRACTION
• Definition: A process of automatically extracting structured
information from unstructured data (e.g., text).

• Goal: To convert text into a structured format like databases or


tables.

• Example: Extracting company names and their stock prices from


news articles.
WHY IS IE IMPORTANT?
• Purpose: Helps in making sense of large volumes of text data.

• Applications: Search engines, chatbots, data analytics,


sentiment analysis, etc.

• Benefits: Saves time, improves decision-making, aids in


knowledge discovery.
TYPES OF INFORMATION EXTRACTION
• Named Entity Recognition (NER):Identifies entities like people,
organizations, dates, locations.

• Relationship Extraction: Identifies relationships between entities Example:


Sentence: "Steve Jobs founded Apple."
Relation:"Steve Jobs" → founded → "Apple"

• Event Extraction: Detects events and related Example:


Sentence: "An earthquake occurred in Japan on March 11, 2011."
Event: Earthquake, Location: Japan, Date: March 11, 2011
HOW DOES INFORMATION EXTRACTION WORK?
• Step-by-step process: Text Preprocessing: Tokenization, stop-word
removal, stemming/lemmatization.

• Entity Recognition: Using models (e.g., rule-based or ML-based) to


identify entities.

• Pattern Matching: Identifying predefined patterns for relationships or


events.

• Extraction and Structuring: Extracting data and organizing it into


structured forms.
TECHNIQUES IN INFORMATION EXTRACTION
• Rule-based Approaches: Uses predefined rules and patterns to
extract information.

• Machine Learning Approaches: Training models on labeled data


(e.g., supervised learning with features like POS tagging).

• Deep Learning Approaches: Utilizes neural networks and models like


BERT, LSTM for better context understanding.
APPLICATIONS OF INFORMATION EXTRACTION
• Business Intelligence: Extracting financial data and trends from
reports.

• Healthcare: Extracting patient information, drug interactions, and


symptoms from clinical notes.

• Social Media Analysis: Identifying popular trends and user


sentiments.

• Legal Field: Extracting case laws and related details from legal
documents.
QUESTION ANSWERING

• Question Answering is a task in Natural Language Processing (NLP)


where a system provides an answer to a question based on a given
text or dataset. It is commonly used in search engines, virtual
assistants, and customer support systems.
HOW DOES QA WORK?
• Understand the Question: The system figures out what you are asking.

• Example: Question: "When was India’s independence?"


It identifies "India’s independence" → Looking for a date.

• Find the Answer: The system searches for the answer in:
Text Documents, Websites (Google, Wikipedia), Databases

• Example: Text: "India gained independence on August 15, 1947."


The system extracts: "August 15, 1947".

• Give the Answer: Finally, it shows the answer to the user.


TYPES OF QA SYSTEMS
• Closed-Domain QA: Focuses on a specific topic.

• Example: If the topic is "Cricket", you ask: "Who won the 2011 Cricket World
Cup?“ Answer: "India."

• Open-Domain QA: Can answer questions from any topic.

• Example:"What is the capital of France?" → Answer: "Paris."

• Reading Comprehension QA: The system answers questions based on a given


passage. Example: Passage: "The Eiffel Tower is located in Paris, France."
Question: "Where is the Eiffel Tower?"
Answer: "Paris, France."
STEPS IN A QA SYSTEM

• 1. Information Retrieval (IR): Finds relevant documents or passages.


Example:

• Question: "Who discovered gravity?"

• System searches for articles containing keywords like "discovered"


and "gravity".
CONT..

• 2. Answer Extraction

• Extracts the exact answer from the document.

• Example:
Passage: "Isaac Newton discovered gravity in 1687."
Extracted Answer: "Isaac Newton"
DIAGRAM: WORKFLOW OF A QA SYSTEM

• Question -> Information Retrieval -> Answer Extraction -> Output


Answer
• Input: Question: "What is the capital of France?"

• Processing: Tokenization: ["What", "is", "the", "capital", "of", "France"]

• Search: Searches knowledge base (e.g., Wikipedia).

• Output:
Answer: "Paris"
WORD SEGMENTATION
• Word Segmentation in NLP (Natural Language Processing) is the
process of breaking down a continuous sequence of text into
individual words. It is especially important for languages that do not
use spaces between words, like Chinese, Japanese, or Thai.

• What is Word Segmentation?

• Imagine you have a sentence without spaces:


"thisisaneasyexample"

• Your task is to find the words and split them correctly:


"this is an easy example"
WHY IS WORD SEGMENTATION IMPORTANT?
• Many languages like English use spaces to separate words. But in some
languages, sentences look like this:

• Chinese: 我喜欢看书 → “我 / 喜欢 / 看书” → I / like / reading

• Thai: ฉันชอบอาหาร → “ฉัน / ชอบ / อาหาร” → I / like / food

• Without proper segmentation, the computer cannot understand where one


word ends and another starts.
HOW DOES WORD SEGMENTATION WORK?
• Rule-Based Methods
Using dictionaries or grammar rules to split words.
Example: "thisisapen" → Use a dictionary to identify "this / is / a /
pen".

• Statistical Methods
Use probabilities to determine the most likely word splits.
Example: In "ilikesheep" → Is it "I like sheep" or "I like shee p"?
The system calculates which one is more likely.

• Machine Learning Methods: Train a computer to recognize word


boundaries from large amounts of data.
REAL-LIFE APPLICATIONS
• Search Engines: To break search queries into words.

• Machine Translation: Segmenting sentences for accurate translation.

• Speech Recognition: Detecting word boundaries in spoken language.


• Examples

• English Example:
Input: "itiseasytosegment"
Output: "it / is / easy / to / segment"

• Chinese Example:
Input: 我喜欢学习
Output: 我 / 喜欢 / 学习 → "I / like / studying"

• Thai Example:
Input: ฉันชอบแมว
Output: ฉัน / ชอบ / แมว → "I / like / cats"

You might also like