0% found this document useful (0 votes)

19 views28 pages

Natural Language Processing

The document provides an overview of various concepts in Natural Language Processing (NLP), including language models, grammar-based models, statistical models, tokenization, and parts-of-speech tagging. It discusses the advantages and disadvantages of different modeling techniques, such as n-grams and smoothing methods, as well as the importance of grammar and dependency structures in understanding language. Additionally, it highlights practical applications of these concepts in tasks like machine translation, speech recognition, and text generation.

Uploaded by

alisha shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views28 pages

Natural Language Processing

Uploaded by

alisha shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Language Model

A language model in natural language processing (NLP) is a statistical or

machine learning model that is used to predict the next word in a sequence
given the previous words. Language models play a crucial role in various NLP
tasks such as machine translation, speech recognition, text generation, and
sentiment analysis. They analyze and understand the structure and use of
human language, enabling machines to process and generate text that is
contextually appropriate and coherent.

Grammar Based LM
Grammar-based language models are a type of statistical language model
that uses formal grammars to represent the underlying structure of language.
Unlike n-gram models, which focus on the probability of sequences of words,
grammar-based models explicitly model the grammatical relationships
between words in a sentence.
-> Training
-> Parsing
-> Probability calculation
-> Best Parse selection
Advantage
->Explicit Modelling of structure
-> Robustness
Disadvantage
->Complexity
->Limited Coverage
Application
-> NLP, Speech recognition, computational linguistics

1
Statistical based LM
Statistical language models (SLMs) are a cornerstone of natural language
processing (NLP), aiming to predict the likelihood of a sequence of words in a
given language. They do this by analyzing vast amounts of text data and
identifying statistical patterns in word usage.

Common Types of Statistical Language Models

1. N-gram Models:

○ Unigrams: Predict the probability of a single word.

○ Bigrams: Predict the probability of a word given the previous
word.
○ Trigrams: Predict the probability of a word given the two
previous words.
○ Higher-order n-grams: Consider longer sequences of words.
2. Maximum Likelihood Estimation (MLE): A common method for
estimating the probabilities of n-grams based on their frequency in the
training data.

Advantages
-> Simplicity, efficiency, widelyused
Disadvantages
-> Data sparsity(as many word combinations may not appear frequently in
the training data), Limited Context, Lack of Generalization.
Application
-> Speech recognition, Machine translation, text generation, Information
retrieval.

2
Regular Expression
A regular expression (regex) is a sequence of characters that define a search
pattern. Here’s how to write regular expressions:

1. Start by understanding the special characters used in regex, such as

“.”, “*”, “+”, “?”, and more.

2. Choose a programming language or tool that supports regex, such

as Python, Perl, or grep.

3. Write your pattern using the special characters and literal

characters.

4. Use the appropriate function or method to search for the pattern in a

string.

Finite set automata

Finite automata are abstract machines used to recognize patterns in input
sequences, forming the basis for understanding regular languages in
computer science. They consist of states, transitions, and input symbols,
processing each symbol step-by-step. If the machine ends in an accepting
state after processing the input, it is accepted; otherwise, it is rejected. Finite
automata come in deterministic (DFA) and non-deterministic (NFA), both of
which can recognize the same set of regular languages. They are widely used
in text processing, compilers, and network protocols.

Features of Finite Automata

● Input: Set of symbols or characters provided to the machine.

● Output: Accept or reject based on the input pattern.

3
● States of Automata: The conditions or configurations of the

machine.

● State Relation: The transitions between states.

● Output Relation: Based on the final state, the output decision is

made.

English Morphology
In Natural Language Processing (NLP), morphology plays a crucial role in
understanding the structure and meaning of words. It involves analyzing
words into their constituent parts (morphemes) and understanding how
these parts contribute to the overall meaning.

Example

Consider the word "unhappiness."

● Morphemes: un- (prefix), happy (root), -ness (suffix)

● Analysis: The prefix "un-" negates the meaning of the root "happy,"
and the suffix "-ness" converts it into a state or quality.

Tokenization
Tokenization is a fundamental process in Natural Language Processing (NLP)
that involves breaking down a stream of text into smaller units called tokens.
These tokens can range from individual characters to full words or phrases,
depending on the level of granularity required. By converting text into these
manageable chunks, machines can more effectively analyze and understand
human language.
Types

1. Word Tokenization

4
This is the most common method where text is divided into individual words.
It works well for languages with clear word boundaries, like English. For
example, "Machine learning is fascinating" becomes:
["Machine", "learning", "is", "fascinating"]

2.Character Tokenization

In this method, text is split into individual characters. This is particularly

useful for languages without clear word boundaries or for tasks that require
a detailed analysis, such as spelling correction. For instance, "NLP" would be
tokenized as:
["N", "L", "P"]

3.Subword Tokenization

This strikes a balance between word and character tokenization by breaking

down text into units that are larger than a single character but smaller than a
full word. For example, "Chatbots" might be tokenized into:
["Chat", "bots"]

Detecting and correcting spelling errors

Spelling error detection and correction is a crucial aspect of
Natural Language Processing (NLP). It involves identifying
misspelled words in text and suggesting accurate replacements.
This is essential for improving the quality of written
communication and enhancing the user experience in various
applications.

Example

5
Input: "Teh cat sat on teh mat."

Output: "The cat sat on the mat."

________________________________________________________________

Unsmoothed N-grams in NLP

In Natural Language Processing (NLP), unsmoothed n-grams are a

basic approach to language modeling. They estimate the
probability of a word sequence by directly counting the
occurrences of that sequence in a given training corpus.

Example

Let's consider the following sentence: "The quick brown fox

jumps over the lazy dog."

● Bigram (2-gram) Probabilities:

○ P("quick" | "The") = Count("The quick") / Count("The")

○ P("brown" | "quick") = Count("quick brown") /
Count("quick")
○ ...
● Trigram (3-gram) Probabilities:

○ P("brown" | "The quick") = Count("The quick brown") /

Count("The quick")
○ P("jumps" | "quick brown") = Count("quick brown
jumps") / Count("quick brown")

Smoothing in NLP

Smoothing is a crucial technique in Natural Language Processing

(NLP), particularly when dealing with n-gram models. It
addresses the issue of data sparsity, where many possible word
sequences have zero probability due to their infrequent or
non-existent occurrence in the training data.

6
Why Smoothing is Necessary

● Zero Probability Problem: Unsmoothed n-gram models assign

zero probability to unseen n-grams, even if they are
grammatically correct and likely to occur. This leads to:

○ Underestimation of probabilities
○ Inability to handle unseen data
● Overfitting: Unsmoothed models tend to overfit the training
data, meaning they perform poorly on unseen text.

Common Smoothing Techniques

1.Laplace Smoothing (Add-One Smoothing)

○ Adds a small constant (usually 1) to all n-gram

counts.
○ Ensures that no n-gram has zero probability.
○ Simple but can be overly aggressive, especially for
higher-order n-grams.
2.Good-Turing Smoothing

○ Redistributes probability mass from frequent n-grams

to infrequent or unseen n-grams.
○ More sophisticated than Laplace smoothing, often
providing better results.
3.Back-off Smoothing

○ If the count of an n-gram is zero, back off to the

(n-1)-gram, and so on, until a non-zero count is
found.
○ Combines information from different n-gram orders.
4.Katz Back-off

○ A refinement of back-off smoothing that uses

Good-Turing estimates to adjust the probabilities at
each back-off level.

Example

Let's consider a bigram model with the following counts:

7
● Count("the cat") = 10
● Count("the dog") = 5
● Count("the bird") = 0

Laplace Smoothing:

● Adjusted Count("the cat") = 10 + 1 = 11

● Adjusted Count("the dog") = 5 + 1 = 6
● Adjusted Count("the bird") = 0 + 1 = 1

Impact of Smoothning

1.Improve mode robustness

2.Increase accuracy
3.Improve generalization

What is POS(Parts-Of-Speech) Tagging?

Parts of Speech tagging is a linguistic activity in Natural Language
Processing (NLP) wherein each word in a document is given a particular part
of speech (adverb, adjective, verb, etc.) or grammatical category. Through the
addition of a layer of syntactic and semantic information to the words, this
procedure makes it easier to comprehend the sentence’s structure and
meaning.

In many NLP applications, including machine translation, sentiment analysis,

and information retrieval, PoS tagging is essential. PoS tagging serves as a
link between language and machine understanding, enabling the creation of
complex language processing systems and serving as the foundation for
advanced linguistic analysis.

8
Example of POS Tagging

Consider the sentence: “The quick brown fox jumps over the lazy dog.”

After performing POS Tagging:

● “The” is tagged as determiner (DT)

● “quick” is tagged as adjective (JJ)

● “brown” is tagged as adjective (JJ)

● “fox” is tagged as noun (NN)

● “jumps” is tagged as verb (VBZ)

● “over” is tagged as preposition (IN)

● “the” is tagged as determiner (DT)

● “lazy” is tagged as adjective (JJ)

● “dog” is tagged as noun (NN)

Types of POS Tagging in NLP

1. Rule-Based Tagging
Rule-based part-of-speech (POS) tagging involves assigning words their
respective parts of speech using predetermined rules, contrasting with

9
machine learning-based POS tagging that requires training on annotated text
corpora. In a rule-based system, POS tags are assigned based on specific
word characteristics and contextual cues.

2. Transformation Based tagging

Transformation-based tagging (TBT) is a part-of-speech (POS) tagging
method that uses a set of rules to change the tags that are applied to words
inside a text. In contrast, statistical POS tagging uses trained algorithms to
predict tags probabilistically, while rule-based POS tagging assigns tags
directly based on predefined rules.

When compared to rule-based tagging, TBT can provide higher accuracy,

especially when dealing with complex grammatical structures. To attain ideal
performance, nevertheless, it might require a large rule set and additional
computer power.

Text: “The cat chased the mouse”.

Initial Tags:

● “The” – Determiner (DET)

● “cat” – Noun (N)

● “chased” – Verb (V)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

Transformation rule applied:

Change the tag of “chased” from Verb (V) to Noun (N) because it follows the
determiner “the.”

Updated tags:

10
● “The” – Determiner (DET)

● “cat” – Noun (N)

● “chased” – Noun (N)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

Advantages of POS Tagging

There are several advantages of Parts-Of-Speech (POS) Tagging including:

● Text Simplification: Breaking complex sentences down into their

constituent parts makes the material easier to understand and

easier to simplify.

● Information Retrieval: Information retrieval systems are enhanced

by point-of-sale (POS) tagging, which allows for more precise

indexing and search based on grammatical categories.

● Named Entity Recognition: POS tagging helps to identify entities

such as names, locations, and organizations inside text and is a

precondition for named entity identification.

● Syntactic Parsing: It facilitates syntactic parsing, which helps with

phrase structure analysis and word link identification.

Disadvantages of POS Tagging

Some common disadvantages in part-of-speech (POS) tagging include:

11
● Ambiguity: The inherent ambiguity of language makes POS tagging

difficult since words can signify different things depending on the

context, which can result in misunderstandings.

● Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases

can be problematic for POS tagging systems since they don’t always

follow formal grammar standards.

● Out-of-Vocabulary Words: Out-of-vocabulary words (words not

included in the training corpus) can be difficult to handle since the

model might have trouble assigning the correct POS tags.

● Domain Dependence: For best results, POS tagging models trained

on a single domain should have a lot of domain-specific training

data because they might not generalize well to other domains.

_____________________________________________________________________

Context-Free Grammar
Context Free Grammar is formal grammar, the syntax or structure of a formal

language can be described using context-free grammar (CFG), a type of

formal grammar. The grammar has four tuples: (V,T,P,S).

V - It is the collection of variables or non-terminal symbols.

T - It is a set of terminals.

P - It is the production rules that consist of both terminals

and non-terminals.

12
S - It is the starting symbol.

A grammar is said to be the Context-free grammar if every production is in

the form of :

G -> (V∪T)*, where G ∊ V

● And the left-hand side of the G, here in the example, can only be a

Variable, it cannot be a terminal.

● But on the right-hand side here it can be a Variable or Terminal or

both combination of Variable and Terminal.

The above equation states that every production which contains any

combination of the ‘V’ variable or ‘T’ terminal is said to be a context-free

grammar.

Grammar in NLP
Grammar in NLP is a set of rules for constructing sentences in a language

used to understand and analyze the structure of sentences in text data.

This includes identifying parts of speech such as nouns, verbs, and adjectives,

determining the subject and predicate of a sentence, and identifying the

relationships between words and phrases.

13
Grammar is defined as the rules for forming well-structured sentences.

Grammar also plays an essential role in describing the syntactic structure of

well-formed programs, like denoting the syntactical rules used for

conversation in natural languages.

● In the theory of formal languages, grammar is also applicable in Computer

Science, mainly in programming languages and data structures. Example - In the

C programming language, the precise grammar rules state how functions are

made with the help of lists and statements.

Treebanks

In Natural Language Processing (NLP), treebanks are collections of text that have been

manually annotated with syntactic or semantic structures. These annotations represent

the grammatical relationships between words in a sentence, often visualized as

tree-like diagrams.

In formal language theory, particularly in the context of context-free grammars (CFGs),

a normal form is a restricted form of the grammar that maintains the same language

generated by the original grammar. These restricted forms simplify the analysis and

processing of the language. Two common normal forms are:

Dependency Grammar

In Natural Language Processing (NLP), dependency grammar is a framework for

analyzing the grammatical structure of sentences by focusing on the relationships

between words rather than their grouping into phrases.

14
Dependency Grammar

In Natural Language Processing (NLP), dependency grammar is a framework for

analyzing the grammatical structure of sentences by focusing on the relationships

between words rather than their grouping into phrases.

Key Concepts

● Dependency: A directed link between two words, indicating that one word (the

head) governs or modifies the other (the dependent).

● Head: The central word in a dependency relation.

● Dependent: The word that is modified or governed by the head.

● Dependency Tree: A graphical representation of the dependency relations in a

sentence, where words are nodes and the directed links represent

dependencies.

How it Works

Dependency grammar aims to capture the underlying syntactic structure of a sentence

by identifying the head-dependent relationships between words. The head word

typically determines the grammatical function of the dependent word.

Example:

Consider the sentence: "The cat sat on the mat."

A possible dependency tree for this sentence would look like this:

15
● sat is the root of the sentence, as it is the main verb.

● The modifies cat and mat.

● on is a preposition that governs the noun phrase the mat.

Advantages of Dependency Grammar

● Focus on Meaning: By emphasizing the relationships between words,

dependency grammar can better capture the underlying meaning of a sentence.

● Flexibility: Suitable for analyzing languages with free word order, where the

position of words in a sentence is less fixed.

16
● Simplicity: Dependency trees can be more concise and easier to interpret than

phrase structure trees.

Applications

● Parsing: Dependency parsing is a widely used technique for analyzing the

syntactic structure of sentences.

● Machine Translation: Dependency-based approaches can be effective for

capturing the underlying meaning of sentences and translating them accurately.

● Information Extraction: Dependency relations can help identify key

relationships between entities in text, such as who did what to whom.

—____________________________________________________________________

Application of NLP

Intelligent work processor-

1. Machine translation : Machine translation (MT) is a prominent application

of Natural Language Processing (NLP) within intelligent work processors. It
automates the translation of text from one human language to another,
significantly enhancing efficiency and accessibility in a globalized world

How Machine Translation Works:

1. Text Analysis: The input text is broken down into smaller units like words,
phrases, or sentences.
2. Language Identification: The source language is identified.

17
3. Translation: The system uses various techniques like statistical,
rule-based, or neural machine translation to find the most appropriate
equivalent in the target language.
4. Post-Editing: While modern MT systems are quite accurate, human
post-editing is often necessary to refine the translation, especially for
nuanced or complex texts.
Benefits of Machine Translation in Intelligent Work Processors:
● Efficiency: Translating large volumes of text becomes significantly faster,
saving time and resources.
● Accessibility: Content can be made accessible to a wider global
audience, breaking down language barriers.
● Cost-Effectiveness: Automating translation reduces the need for human
translators, lowering costs.
● 24/7 Availability: MT systems can translate text anytime, anywhere,
providing on-demand access to information.

Challenges and Limitations:

● Accuracy: While accuracy has improved significantly, MT systems may

still struggle with complex sentences, idioms, and cultural nuances.
● Nuance: Capturing the full meaning and tone of the original text can be
challenging, potentially leading to misinterpretations.
● Data Dependence: The quality of MT output relies heavily on the
availability of large amounts of high-quality training data.

Examples of Machine Translation in Action:

18
● Google Translate: A widely used online translation service that supports
numerous languages.
● Microsoft Translator: Integrated into various Microsoft products, offering
real-time translation for text and speech.
● DeepL: A commercial translation service known for its high-quality output,
particularly for European languages.

2. User interfaces - A User Interface (UI) is the point of interaction between

humans and machines. It's the way we interact with and control a device,
software, or system. Think of it as the face of technology – how it looks and feels
to use.

Key Components of a UI:

● Visual Design: This encompasses the overall aesthetic appeal, including

colors, typography, imagery, and layout. It aims to create a visually
engaging and pleasing experience.
● Interaction Design: This focuses on how users interact with the interface.
It involves elements like buttons, menus, sliders, and touch gestures,
ensuring they are intuitive and easy to use.
● Usability: This emphasizes how easy it is for users to achieve their goals
with the interface. It considers factors like clarity, efficiency, and error
prevention.
● Accessibility: This ensures the UI is usable by people with disabilities,
such as those with visual, auditory, motor, or cognitive impairments.

Types of User Interfaces:

● Command-Line Interface (CLI): Users interact by typing commands.

19
● Graphical User Interface (GUI): Uses visual elements like icons and
windows for interaction.
● Touchscreen Interface: Allows users to interact directly with the screen
using touch gestures.
● Voice User Interface (VUI): Enables interaction through voice commands.
● Gesture-Based Interface: Relies on body movements and gestures for
control.

Importance of a Good UI:

● Improved User Experience: A well-designed UI makes the product more

enjoyable and engaging to use.
● Increased Efficiency: Users can accomplish tasks more quickly and
easily.
● Reduced Errors: Intuitive interfaces minimize the risk of user errors.
● Enhanced Brand Image: A visually appealing and user-friendly UI can
enhance the perception of a brand.
● Greater Accessibility: A well-designed UI ensures the product is usable
by a wider range of users.

3. Man Machine interfaces- Man-Machine Interfaces (MMI), also known as

Human-Machine Interfaces (HMI), are systems that enable humans to interact
with machines or automated systems. They act as a bridge between the human
operator and the machine, facilitating control, monitoring, and data feedback in
real-time.

key Components of an MMI:

● Hardware: This includes physical devices like touchscreens, keyboards,

mice, buttons, knobs, and sensors.

20
● Software: This encompasses the programs and applications that interpret
user input and control the machine's behavior.
● Visual Display: This presents information to the operator, often through
screens, gauges, or indicators.

Types of MMIs:

● Command-Line Interface (CLI): Users interact by typing commands.

● Graphical User Interface (GUI): Uses visual elements like icons and
windows for interaction.
● Touchscreen Interface: Allows users to interact directly with the screen
using touch gestures.
● Voice User Interface (VUI): Enables interaction through voice commands.
● Gesture-Based Interface: Relies on body movements and gestures for
control.

Importance of a Good MMI:

● Improved Efficiency: A well-designed MMI can significantly enhance the

operator's ability to control and monitor the machine, leading to increased
productivity and reduced downtime.
● Enhanced Safety: MMIs can help prevent accidents by providing clear
and concise information, alarms, and safety warnings.
● Reduced Errors: Intuitive interfaces minimize the risk of operator errors.
● Better Decision-Making: MMIs can provide real-time data and
visualizations, enabling operators to make informed decisions.

Applications of MMI:

21
● Industrial Automation: MMIs are widely used in factories and plants to
control machinery, monitor processes, and manage production lines.
● Transportation: Vehicle dashboards, flight control systems, and traffic
management systems rely on MMIs.
● Healthcare: Medical devices like MRI machines and patient monitoring
systems use MMIs for control and data visualization.
● Consumer Electronics: Remote controls, smartphone interfaces, and
gaming consoles are examples of MMIs in everyday life.

4. Natural Language qurifying - Natural Language Querying (NLQ) is a powerful

application of Natural Language Processing (NLP) that allows users to interact
with databases and other data sources using everyday language instead of
specialized query languages like SQL.

How it works:

1. User Input: The user poses a question in natural language, such as "What
were the sales figures for California in Q3 2023?"
2. Natural Language Understanding (NLU): The system analyzes the
user's question to:
○ Identify the intent (e.g., "find sales figures")
○ Extract key entities (e.g., "California," "Q3 2023")
○ Determine the relationships between entities (e.g., "sales figures for
California")
3. Query Translation: The system translates the natural language question
into a formal query language (like SQL) that the database can understand.
4. Data Retrieval: The database executes the generated query and retrieves
the relevant data.

22
5. Answer Generation: The system processes the retrieved data and
presents it to the user in a clear and concise format, such as a table, chart,
or summary.

Benefits of NLQ:

● Accessibility: Makes data analysis accessible to a wider audience,

including those without technical expertise in SQL or other query
languages.
● Efficiency: Significantly speeds up the data analysis process by
eliminating the need to write complex queries.
● Improved Productivity: Enables users to focus on insights and
decision-making rather than on the technical aspects of data retrieval.
● Enhanced User Experience: Provides a more intuitive and user-friendly
way to interact with data.

Applications of NLQ:

● Business Intelligence: Analyzing sales trends, customer behavior, and

financial performance.
● Customer Service: Answering customer questions about products,
services, and orders.
● Research: Exploring scientific data, conducting literature reviews, and
answering research questions.
● Data Exploration: Discovering hidden patterns and insights within large
datasets.

4. Speech recognition - Speech recognition, also known as automatic speech

recognition (ASR), computer speech recognition, or speech-to-text, is the
capability that enables a program to process human speech into a written format.

23
How it works:

● Speech Input: The user speaks into a microphone or other input device.
● Acoustic Analysis: The speech signal is converted into a digital
representation and analyzed to extract features like pitch, intensity, and
frequency.
● Feature Extraction: Key features of the speech signal are extracted and
represented in a numerical format.
● Pattern Recognition: The extracted features are compared to a database
of known speech patterns to identify the most likely words or phrases.
● Language Modeling: The system uses language models to predict the
most probable sequence of words based on the context and grammatical
rules.
● Output: The recognized text is displayed to the user.

Applications:

● Virtual Assistants: Siri, Alexa, Google Assistant

● Dictation Software: Dragon NaturallySpeaking, Google Docs voice typing
● Transcription Services: For meetings, interviews, and legal proceedings
● Accessibility: For individuals with disabilities who have difficulty typing
● Smart Home Devices: Controlling devices with voice commands

Challenges:

● Accuracy: Handling accents, dialects, background noise, and different

speaking styles.
● Vocabulary: Recognizing and understanding rare words or technical
jargon.

24
● Real-time Processing: Ensuring fast and accurate recognition for
real-time applications.

Natural Language Processing (NLP) has a wide range of commercial

applications across various industries. Here are some key areas:

1. Customer Service & Support:

● Chatbots & Virtual Assistants: Powering interactive conversations with

customers, answering FAQs, resolving simple issues, and providing 24/7
support.
● Sentiment Analysis: Analyzing customer feedback (reviews, social media
posts, surveys) to understand customer sentiment, identify areas for
improvement, and proactively address concerns.
● Text Summarization: Summarizing customer interactions (emails, chat
logs) to quickly identify key issues and expedite resolution.

2. Marketing & Sales:

● Social Media Monitoring: Tracking brand mentions, identifying

influencers, and analyzing customer sentiment towards competitors.
● Targeted Advertising: Personalizing ad campaigns based on customer
interests and preferences extracted from text data (e.g., website content,
social media profiles).
● Market Research: Analyzing market trends, competitor analysis, and
customer behavior to inform business decisions.

3. Finance:

25
● Fraud Detection: Identifying and preventing fraudulent activities by
analyzing patterns in financial documents (e.g., transactions, reports) and
detecting anomalies.
● Risk Assessment: Assessing credit risk by analyzing loan applications,
financial statements, and news articles.
● Investment Analysis: Analyzing news articles, financial reports, and
social media sentiment to inform investment decisions.

4. Healthcare:

● Clinical Documentation: Automating the process of documenting patient

information, such as medical histories and discharge summaries.
● Drug Discovery: Analyzing research papers and clinical trials to identify
potential new drug candidates.
● Patient Monitoring: Analyzing patient records and medical conversations
to identify potential health risks and improve patient care

5. E-commerce:

● Product Recommendations: Personalizing product recommendations

based on customer search history, browsing behavior, and purchase
history.
● Product Categorization: Automating the process of categorizing products
based on their descriptions.
● Sentiment Analysis: Analyzing customer reviews to identify product
strengths and weaknesses.

6. Legal:

26
● E-discovery: Analyzing large volumes of legal documents (emails,
contracts, legal briefs) to identify relevant information for legal proceedings.
● Contract Analysis: Automating the process of reviewing and analyzing
contracts to identify potential risks and inconsistencies.

27
28

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Calibre User Manual
No ratings yet
Calibre User Manual
9 pages
User Guides - Cs 9000 User Guide
No ratings yet
User Guides - Cs 9000 User Guide
58 pages
Smartpack II UM
100% (1)
Smartpack II UM
32 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Unit 1 Notes.pptx
No ratings yet
Unit 1 Notes.pptx
74 pages
nlp
No ratings yet
nlp
16 pages
NLP Module 2
No ratings yet
NLP Module 2
18 pages
Text Analytics and Natural Language Processing - KAI073.docx
No ratings yet
Text Analytics and Natural Language Processing - KAI073.docx
24 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
NLP-1
No ratings yet
NLP-1
13 pages
NLP_AI_X
No ratings yet
NLP_AI_X
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP m2
No ratings yet
NLP m2
74 pages
C10_AI_UNIT 3_NLP_ HALF YEARLY
No ratings yet
C10_AI_UNIT 3_NLP_ HALF YEARLY
37 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
NLP unit-1-introduction-and-word-level-analysis
No ratings yet
NLP unit-1-introduction-and-word-level-analysis
25 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
NLP l IA2
No ratings yet
NLP l IA2
23 pages
NLP-UNITS-IV-V
No ratings yet
NLP-UNITS-IV-V
30 pages
Sample
No ratings yet
Sample
8 pages
UNIT I_NLP
No ratings yet
UNIT I_NLP
24 pages
Module_5-Natural_language_processing[1]
No ratings yet
Module_5-Natural_language_processing[1]
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
Unit1 SNLP Osmania University
No ratings yet
Unit1 SNLP Osmania University
16 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Project Report
No ratings yet
Project Report
12 pages
IntroductionToNLPAbebeZerihun
No ratings yet
IntroductionToNLPAbebeZerihun
45 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLP - Viva - Que & Ans
No ratings yet
NLP - Viva - Que & Ans
15 pages
module 2
No ratings yet
module 2
26 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
module-1 ch-2
No ratings yet
module-1 ch-2
31 pages
Intro to statistical nlp
No ratings yet
Intro to statistical nlp
57 pages
Unit 6 (NLP)
No ratings yet
Unit 6 (NLP)
8 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
NLP_DeepNLP
No ratings yet
NLP_DeepNLP
61 pages
Natural Language Processing tools and approaches
No ratings yet
Natural Language Processing tools and approaches
106 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
13) Natural Language Processing
No ratings yet
13) Natural Language Processing
28 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
AI_NLP
No ratings yet
AI_NLP
9 pages
Lect_05_Preprocessing_text
No ratings yet
Lect_05_Preprocessing_text
25 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
2. ngram
No ratings yet
2. ngram
41 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
P Publication
No ratings yet
P Publication
5 pages
CSC 528 Lecture 3
No ratings yet
CSC 528 Lecture 3
42 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Natural Language Processing_Notes_Unit 2.docx
No ratings yet
Natural Language Processing_Notes_Unit 2.docx
19 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
Basics of Chat GPT: How to utilize this powerful tool to enhance your life!
From Everand
Basics of Chat GPT: How to utilize this powerful tool to enhance your life!
Adam Larsen
No ratings yet
Unicenter AutoSys 4.5 Job Management For Windows User Guide
No ratings yet
Unicenter AutoSys 4.5 Job Management For Windows User Guide
511 pages
UX Audit Cheat Sheet GGUX
100% (1)
UX Audit Cheat Sheet GGUX
15 pages
CSC203 - Operating System Concepts
No ratings yet
CSC203 - Operating System Concepts
55 pages
Politeknik Seberang Perai: Network Configuration
No ratings yet
Politeknik Seberang Perai: Network Configuration
27 pages
Zynaptiq Pitchshift and Pitchshift Pro Manual
No ratings yet
Zynaptiq Pitchshift and Pitchshift Pro Manual
42 pages
ISM User Guide
No ratings yet
ISM User Guide
201 pages
Chapter 1
No ratings yet
Chapter 1
12 pages
Ict Book Answers
No ratings yet
Ict Book Answers
59 pages
The Future Role of Telecom: Ericsson White Paper
No ratings yet
The Future Role of Telecom: Ericsson White Paper
8 pages
LightWave 3D 8 Revealed 1st Edition Kelly L.(Kelly L. Murdock) Murdock download
No ratings yet
LightWave 3D 8 Revealed 1st Edition Kelly L.(Kelly L. Murdock) Murdock download
59 pages
Java Event Handling
100% (1)
Java Event Handling
3 pages
ANSYS Polyflow Users Guide 18.2
100% (1)
ANSYS Polyflow Users Guide 18.2
806 pages
Antennas and Accessories Product Description
No ratings yet
Antennas and Accessories Product Description
84 pages
Receptor Satelital ID
No ratings yet
Receptor Satelital ID
47 pages
Virtual Box Tutorial
100% (1)
Virtual Box Tutorial
52 pages
Cranex.D Installation Manual 2010
100% (4)
Cranex.D Installation Manual 2010
79 pages
CAESES-The HOLISHIP Platform For Process Integration and Design Optimization
No ratings yet
CAESES-The HOLISHIP Platform For Process Integration and Design Optimization
47 pages
ACTE Student Workbook r4.1
No ratings yet
ACTE Student Workbook r4.1
38 pages
Eval l99dz200
No ratings yet
Eval l99dz200
12 pages
3COM 4400 Management Interface Reference Guide
No ratings yet
3COM 4400 Management Interface Reference Guide
151 pages
Software For Educators and Students
No ratings yet
Software For Educators and Students
8 pages
Alphabet - Case 2018 PPT NR B
100% (1)
Alphabet - Case 2018 PPT NR B
15 pages
FortiSwitchOS-7.2.0-Administration_Guide—Standalone_Mode
No ratings yet
FortiSwitchOS-7.2.0-Administration_Guide—Standalone_Mode
392 pages
FortiGate Enterprise Firewall 6.0 Lab Guide-Online PDF
No ratings yet
FortiGate Enterprise Firewall 6.0 Lab Guide-Online PDF
150 pages
Operating System Report
No ratings yet
Operating System Report
26 pages
Computer Palak Sir Liberty
No ratings yet
Computer Palak Sir Liberty
89 pages
Python Gui Programming A Complete Reference Guide Develop Responsive And Powerful Gui Applications With Pyqt And Tkinter Harwani instant download
No ratings yet
Python Gui Programming A Complete Reference Guide Develop Responsive And Powerful Gui Applications With Pyqt And Tkinter Harwani instant download
36 pages

Natural Language Processing

Uploaded by

Natural Language Processing

Uploaded by

Language Model

A language model in natural language processing (NLP) is a statistical or

Common Types of Statistical Language Models

1.​ N-gram Models:​

○​ Unigrams: Predict the probability of a single word.

1.​ Start by understanding the special characters used in regex, such as

“.”, “*”, “+”, “?”, and more.

2.​ Choose a programming language or tool that supports regex, such

as Python, Perl, or grep.

Finite set automata

Features of Finite Automata

●​ Output: Accept or reject based on the input pattern.

●​ State Relation: The transitions between states.

●​ Output Relation: Based on the final state, the output decision is

Consider the word "unhappiness."

●​ Morphemes: un- (prefix), happy (root), -ness (suffix)

1.​ Word Tokenization

In this method, text is split into individual characters. This is particularly

This strikes a balance between word and character tokenization by breaking

Detecting and correcting spelling errors

Output: "The cat sat on the mat."

Unsmoothed N-grams in NLP

In Natural Language Processing (NLP), unsmoothed n-grams are a

Let's consider the following sentence: "The quick brown fox

●​ Bigram (2-gram) Probabilities:​

○​ P("quick" | "The") = Count("The quick") / Count("The")

○​ P("brown" | "The quick") = Count("The quick brown") /

Smoothing is a crucial technique in Natural Language Processing

●​ Zero Probability Problem: Unsmoothed n-gram models assign

Common Smoothing Techniques

1.​Laplace Smoothing (Add-One Smoothing)​

○​ Adds a small constant (usually 1) to all n-gram

○​ Redistributes probability mass from frequent n-grams

○​ If the count of an n-gram is zero, back off to the

○​ A refinement of back-off smoothing that uses

Let's consider a bigram model with the following counts:

●​ Adjusted Count("the cat") = 10 + 1 = 11

1.​Improve mode robustness

What is POS(Parts-Of-Speech) Tagging?

In many NLP applications, including machine translation, sentiment analysis,

After performing POS Tagging:

●​ “The” is tagged as determiner (DT)

●​ “quick” is tagged as adjective (JJ)

●​ “brown” is tagged as adjective (JJ)

●​ “fox” is tagged as noun (NN)

●​ “jumps” is tagged as verb (VBZ)

●​ “over” is tagged as preposition (IN)

●​ “the” is tagged as determiner (DT)

●​ “lazy” is tagged as adjective (JJ)

●​ “dog” is tagged as noun (NN)

Types of POS Tagging in NLP

2. Transformation Based tagging

When compared to rule-based tagging, TBT can provide higher accuracy,

Text: “The cat chased the mouse”.

●​ “The” – Determiner (DET)

●​ “cat” – Noun (N)

●​ “chased” – Verb (V)

●​ “the” – Determiner (DET)

●​ “mouse” – Noun (N)

Transformation rule applied:

●​ “cat” – Noun (N)

●​ “chased” – Noun (N)

●​ “the” – Determiner (DET)

●​ “mouse” – Noun (N)

Advantages of POS Tagging

There are several advantages of Parts-Of-Speech (POS) Tagging including:

●​ Text Simplification: Breaking complex sentences down into their

constituent parts makes the material easier to understand and

●​ Information Retrieval: Information retrieval systems are enhanced

by point-of-sale (POS) tagging, which allows for more precise

indexing and search based on grammatical categories.

●​ Named Entity Recognition: POS tagging helps to identify entities

such as names, locations, and organizations inside text and is a

precondition for named entity identification.

●​ Syntactic Parsing: It facilitates syntactic parsing, which helps with

phrase structure analysis and word link identification.

1. N-gram Models:

○ Unigrams: Predict the probability of a single word.

1. Start by understanding the special characters used in regex, such as

2. Choose a programming language or tool that supports regex, such

● Output: Accept or reject based on the input pattern.

● State Relation: The transitions between states.

● Output Relation: Based on the final state, the output decision is

● Morphemes: un- (prefix), happy (root), -ness (suffix)

1. Word Tokenization

● Bigram (2-gram) Probabilities:

○ P("quick" | "The") = Count("The quick") / Count("The")

○ P("brown" | "The quick") = Count("The quick brown") /

● Zero Probability Problem: Unsmoothed n-gram models assign

1.Laplace Smoothing (Add-One Smoothing)

○ Adds a small constant (usually 1) to all n-gram

○ Redistributes probability mass from frequent n-grams

○ If the count of an n-gram is zero, back off to the

○ A refinement of back-off smoothing that uses

● Adjusted Count("the cat") = 10 + 1 = 11

1.Improve mode robustness

● “The” is tagged as determiner (DT)

● “quick” is tagged as adjective (JJ)

● “brown” is tagged as adjective (JJ)

● “fox” is tagged as noun (NN)

● “jumps” is tagged as verb (VBZ)

● “over” is tagged as preposition (IN)

● “the” is tagged as determiner (DT)

● “lazy” is tagged as adjective (JJ)

● “dog” is tagged as noun (NN)

● “The” – Determiner (DET)

● “cat” – Noun (N)

● “chased” – Verb (V)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

● “cat” – Noun (N)

● “chased” – Noun (N)

● “the” – Determiner (DET)

● “mouse” – Noun (N)

● Text Simplification: Breaking complex sentences down into their

● Information Retrieval: Information retrieval systems are enhanced

● Named Entity Recognition: POS tagging helps to identify entities

● Syntactic Parsing: It facilitates syntactic parsing, which helps with

● Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases

● Out-of-Vocabulary Words: Out-of-vocabulary words (words not

● Domain Dependence: For best results, POS tagging models trained

● But on the right-hand side here it can be a Variable or Terminal or

● In the theory of formal languages, grammar is also applicable in Computer

● Head: The central word in a dependency relation.

● Dependent: The word that is modified or governed by the head.

● Dependency Tree: A graphical representation of the dependency relations in a

● The modifies cat and mat.

● on is a preposition that governs the noun phrase the mat.

● Focus on Meaning: By emphasizing the relationships between words,

● Parsing: Dependency parsing is a widely used technique for analyzing the

● Machine Translation: Dependency-based approaches can be effective for

● Information Extraction: Dependency relations can help identify key

1. Machine translation : Machine translation (MT) is a prominent application

● Accuracy: While accuracy has improved significantly, MT systems may