0% found this document useful (0 votes)
30 views

Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance

Uploaded by

Anvitha m
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance

Uploaded by

Anvitha m
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

1 January 22, 2025

Module I:
2 January 22, 2025

UNIT- I
Introduction to NLP: Basics of Text Processing, Spelling
Correction-Edit Distance, Weighted Edit Distance,
Noisy Channel Model for Spelling Correction-Gram
Language Models, Evaluation of Language Models and
Basic Smoothing.
3 January 22, 2025

Basics of Text Processing:

• Text processing in Natural Language Processing (NLP) is the manipulation and analysis
of textual data to extract meaningful information and gain insights.

• It involves several essential steps to preprocess and transform raw text into a format
that can be used for various NLP tasks like sentiment analysis, language translation,
named entity recognition, text classification
4 January 22, 2025

The basics of text processing in NLP

 Tokenization: Tokenization is the process of breaking down a text into individual units, known as
tokens. These tokens can be words, subwords, or characters, depending on the level of granularity
needed for the specific task. Tokenization makes it easier to process and analyze text (Jhon is a
person.).

 Whitespace Tokenization
Example:
Text: "Hello, world! This is a test."
Tokens: ["Hello,", "world!", "This", "is", "a", "test."]
Advantages:
Simple and fast to implement.
No need for complex rules or libraries.
Disadvantages:
Punctuation remains attached to words (e.g., "world!" instead of "world").
Does not handle contractions or special characters well (e.g., "don't" remains as one
token).
Sensitive to variations in whitespace (e.g., multiple spaces or newlines).
5 January 22, 2025

 Simple Tokenization
A simple tokenizer, sometimes referred to as a word tokenizer,
breaks text into tokens based on both whitespace and punctuation.
This type of tokenizer typically removes punctuation and handles
basic cases more intelligently than a whitespace tokenizer.
Example:
Text: "Hello, world! This is a test."
Tokens: ["Hello", "world", "This", "is", "a", "test"]
Advantages:
More accurate tokenization as punctuation is handled separately.
Commonly used in many NLP applications.
Disadvantages:
Slightly more complex than whitespace tokenization.
Still may not handle all edge cases (e.g., contractions, hyphenated
words) without additional rules or customization.
6 January 22, 2025

 Lowercasing/Uppercasing: Converting all text to either lowercase or


uppercase is often a common step to ensure uniformity and avoid
duplication of words based on capitalization.
Lowercasing: If the text is converted to lowercase before
tokenization, then uppercase and lowercase words are treated
as the same. ...
Applications: This approach is common in text classification,
sentiment analysis, and other tasks where the specific case of
the words is not critical.
7 January 22, 2025

Stop Word Removal: Stop words are common words that do not carry significant meaning (e.g.,
"the," "is," "and"). Removing them can reduce noise and save computational resources during text
analysis.
Types of Stopwords
Common Stopwords: These are the most frequently occurring words in a language and are
often removed during text preprocessing. Examples include “the,” “is,” “in,” “for,” “where,” “when,”
“to,” “at,” etc.
Custom Stopwords: Depending on the specific task or domain, additional words may be
considered as stopwords. These could be domain-specific terms that don’t contribute much
to the overall meaning. For example, in a medical context, words like “patient” or “treatment”
might be considered as custom stopwords.
Numerical Stopwords: Numbers and numeric characters may be treated as stopwords in
certain cases, especially when the analysis is focused on the meaning of the text rather than
specific numerical values.
Single-Character Stopwords: Single characters, such as “a,” “I,” “s,” or “x,” may be considered
stopwords, particularly in cases where they don’t convey much meaning on their own.
Contextual Stopwords: Words that are stopwords in one context but meaningful in another
may be considered as contextual stopwords. For instance, the word “will” might be a stopword
in the context of general language processing but could be important in predicting future events.
8 January 22, 2025

The basics of text processing in NLP


 Special Character Removal: Eliminating punctuation marks, symbols, and other special
characters can simplify text and help focus on the essential content.
Noise Reduction: Punctuation and special characters often don't carry
significant semantic meaning on their own. Removing them can help reduce
the noise in the text and make it easier for NLP models to focus on the
meaningful words and phrases.
 Stemming and Lemmatization: These techniques aim to reduce words to their base or
root form. Stemming chops off suffixes (e.g., "running" to "run"), while lemmatization
transforms words to their base dictionary form (e.g., "better" to "good").
Stemming
It is just like cutting down the branches of a tree to its stems. For
example, the stem of the words eating, eats, eaten is eat.
9 January 22, 2025

The basics of text processing in NLP

 Lemmatization

Lemmatization technique is like stemming. The output we will


get after lemmatization is called ‘lemma’, which is a root word
rather than root stem, the output of stemming. After
lemmatization, we will be getting a valid word that
means the same thing.
10 January 22, 2025

The basics of text processing in NLP

 Part-of-Speech (POS) Tagging: POS tagging assigns a grammatical


category (noun, verb, adjective, etc.) to each word in the text. This
information is useful for understanding the syntactic structure of the text
and for subsequent analysis.
11 January 22, 2025

POS Word Relationship


12 January 22, 2025

 Workflow of POS Tagging in NLP


 The following are the processes in a typical natural language processing (NLP)
example of part-of-speech (POS) tagging:
 Tokenization: Divide the input text into discrete tokens, which are usually units of
words or subwords. The first stage in NLP tasks is tokenization.
 Loading Language Models: To utilize a library such as NLTK or SpaCy, be sure to
load the relevant language model. These models offer a foundation for
comprehending a language’s grammatical structure since they have been trained on
a vast amount of linguistic data.
 Text Processing: If required, preprocess the text to handle special characters,
convert it to lowercase, or eliminate superfluous information. Correct PoS labeling is
aided by clear text.
 Linguistic Analysis & Part-of-Speech Tagging: To determine the text’s
grammatical structure, use linguistic analysis. This entails understanding each word’s
purpose inside the sentence, including whether it is an adjective, verb, noun, or other.
 Results Analysis: Verify the accuracy and consistency of the PoS tagging findings
with the source text. Determine and correct any possible problems or mistagging.
13 January 22, 2025

The basics of text processing in NLP

 Named Entity Recognition (NER): NER identifies and classifies named entities in the text, such as names
of people, organizations, locations, dates, etc. This helps in extracting relevant information from unstructured
text.
 Step 1: Labelled data (10k M), Step 2: Training (Sendence, Algarithem, Model), Step 3: Test (with
model)
Ambiguity in NER
For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in
classification. Let’s look at some ambiguous examples:
 England (Organization) won the 2019 world cup vs The 2019 world cup happened in England (Location).
 Washington (Location) is the capital of the US vs The first president of the US was Washington (Person).

Named Entity Recognition (NER) Methods


Lexicon Based Method (dictionary)
Rule Based Method (set of predefined rules)
Machine Learning-Based Method ( requires a lot of labelling)
Deep Learning Based Method (understanding the semantic and syntactic relationship between various
words.)
14 January 22, 2025

The basics of text processing in NLP


 Sentiment Analysis: Sentiment analysis determines the sentiment or emotion
expressed in a given piece of text, which can be positive, negative, or neutral. It
is widely used in social media monitoring and customer feedback analysis.

 why sentiment analysis is important for business:


 Customer Feedback Analysis:
 Brand Reputation Management:
 Product Development and Innovation:
 Competitor Analysis:
 Marketing Campaign Effectiveness
15 January 22, 2025

 What are the challenges in Sentiment Analysis?

 There are major challenges in the sentiment analysis approach:


 If the data is in the form of a tone, then it becomes really
difficult to detect whether the comment is pessimist or
optimistic.
 If the data is in the form of emoji, then you need to
detect whether it is good or bad.
 Even the ironic, sarcastic, comparing comments detection
is really hard.
 Comparing a neutral statement is a big task.
16 January 22, 2025

The basics of text processing in NLP

 Text Vectorization: In order to apply machine learning algorithms to


text data, it needs to be converted into numerical vectors. Techniques
like Bag-of-Words, TF-IDF (Term Frequency-Inverse Document
Frequency), and word embeddings (e.g., Word2Vec, GloVe) are used
for this purpose.
17 January 22, 2025
18 January 22, 2025
19 January 22, 2025
20 January 22, 2025

The basics of text processing in NLP

 Language Model: Language models, like GPT (Generative Pre-trained


Transformer), are large neural networks trained on vast amounts of text
data. They can understand and generate human-like text, making them
useful for various NLP tasks.
 N-gram
 N-gram can be defined as the contiguous sequence of n items from a
given sample of text or speech. The items can be letters, words, or
base pairs according to the application. The N-grams typically are
collected from a text or speech corpus (A long text dataset).
21 January 22, 2025
22 January 22, 2025
23 January 22, 2025

Spelling Correction-Edit Distance:

 Spelling correction using edit distance is a common technique in Natural Language


Processing (NLP) and is based on the concept of measuring the similarity between two
words by counting the number of operations required to transform one word into the
other.

 The edit distance is also known as Levenshtein distance, named after the Soviet
mathematician Vladimir Levenshtein.

 Edit distance measures the minimum number of insertions, deletions, or substitutions


required to convert one word into another. Each operation has a cost of 1, and the goal
is to find the sequence of operations with the lowest total cost.
24 January 22, 2025

Let's take an example to illustrate the concept:


 Original word: "kitten“

 Target word: "sitting“

 Step 1: Insert 's' at the beginning: "skitten" (cost: 1)

 Step 2: Substitute 'k' with 's': "sitten" (cost: 1)

 Step 3: Substitute 'e' with 'i': "sittin" (cost: 1)

 Step 4: Insert ‘g' at the last: sitting" (cost: 1)

 Total cost = 1 + 1 + 1+1 = 4

 So, the edit distance between "kitten" and "sitting" is 4.


25 January 22, 2025

 The edit distance alone may not always be sufficient for accurate spelling
correction, especially for longer words or words with multiple errors.

 In practice, more sophisticated methods, such as using language models


and context-aware techniques, are often employed to improve the accuracy
of spelling correction in NLP applications.

 These advanced approaches take into account the surrounding words and
the context of the sentence to make more informed suggestions for spelling
corrections.
26 January 22, 2025

Weighted Edit Distance:


 Weighted Edit Distance is an extension of the traditional edit distance (Levenshtein
distance) that assigns different costs or weights to insertion, deletion, and
substitution operations.

 In NLP, this technique is used to measure the similarity between two strings while
considering the relative importance or difficulty of different types of operations.

 The traditional edit distance assumes that all operations (insertion, deletion,
substitution) have equal costs (usually a cost of 1).

 However, in many real-world scenarios, certain operations might be more or less


likely to occur, or some operations might be more harmful than others.

 For example, in some applications, a substitution error might be considered more


severe than a simple insertion or deletion.
27 January 22, 2025

 weighted edit distance introduces different costs for each operation.

 The weights can be based on various factors, such as the frequency of


certain errors in a given dataset, the context of the application, or expert
knowledge.
28 January 22, 2025

Let's illustrate this with an example:


 Original word: "intention"
 Target word: "execution"
 We have the following weighted costs:
 Insertion: 1.5
 Deletion: 1.0
 Substitution: 2.0
 Step 1: Substitute 'i' with 'e': "ention" (cost: 2.0)
 Step 2: Substitute 'n' with 'x': "extion" (cost: 2.0)
 Step 3: Substitute 't' with 'e': "exeiin" (cost: 2.0)
 Step 4: Substitute 'n' with 'c': "exeiic" (cost: 2.0)
 Step 5: Insert 'u' after 'i': "exeuic" (cost: 1.5)
 Step 6: Insert 'o' after 'e': "exeuoic" (cost: 1.5)
 Step 7: Insert 'n' after 'o': "exeuonic" (cost: 1.5)
 Total cost = 2.0 + 2.0 + 2.0 + 2.0 + 1.5 + 1.5 + 1.5 = 12.5
 So, the weighted edit distance between "intention" and "execution" is 12.5.
29 January 22, 2025

 By introducing weighted costs, we can customize the edit distance calculation to


better suit the specific needs of the application.

 This can lead to more accurate spelling correction, error detection, and similarity
measurement in various NLP tasks.

 However, determining the appropriate weights might require domain knowledge


or experimentation with the data to achieve the desired results.
30 January 22, 2025

Noisy Channel Model

 The noisy channel model is a framework that computers use to


check spelling, answer questions, recognize speech, and
perform machine translation.
 It aims to determine the correct word if you type its misspelled
version or mispronounce it.
 The noisy channel model can correct several typing mistakes,
including missing letters (changing “leter” to “letter”),
accidental letter additions (replacing “misstake” with
“mistake”), swapped letters (changing “recieve” to “receive”),
and replacing incorrect letters (replacing “fimite” with “finite”).
31 January 22, 2025
32 January 22, 2025
33 January 22, 2025
34 January 22, 2025
35 January 22, 2025

Introduction to Software Engineering

You might also like