Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
Introduction to NLP Basics of Text Processing, Spelling Correction-Edit Distance, Weighted Edit Distance
Module I:
2 January 22, 2025
UNIT- I
Introduction to NLP: Basics of Text Processing, Spelling
Correction-Edit Distance, Weighted Edit Distance,
Noisy Channel Model for Spelling Correction-Gram
Language Models, Evaluation of Language Models and
Basic Smoothing.
3 January 22, 2025
• Text processing in Natural Language Processing (NLP) is the manipulation and analysis
of textual data to extract meaningful information and gain insights.
• It involves several essential steps to preprocess and transform raw text into a format
that can be used for various NLP tasks like sentiment analysis, language translation,
named entity recognition, text classification
4 January 22, 2025
Tokenization: Tokenization is the process of breaking down a text into individual units, known as
tokens. These tokens can be words, subwords, or characters, depending on the level of granularity
needed for the specific task. Tokenization makes it easier to process and analyze text (Jhon is a
person.).
Whitespace Tokenization
Example:
Text: "Hello, world! This is a test."
Tokens: ["Hello,", "world!", "This", "is", "a", "test."]
Advantages:
Simple and fast to implement.
No need for complex rules or libraries.
Disadvantages:
Punctuation remains attached to words (e.g., "world!" instead of "world").
Does not handle contractions or special characters well (e.g., "don't" remains as one
token).
Sensitive to variations in whitespace (e.g., multiple spaces or newlines).
5 January 22, 2025
Simple Tokenization
A simple tokenizer, sometimes referred to as a word tokenizer,
breaks text into tokens based on both whitespace and punctuation.
This type of tokenizer typically removes punctuation and handles
basic cases more intelligently than a whitespace tokenizer.
Example:
Text: "Hello, world! This is a test."
Tokens: ["Hello", "world", "This", "is", "a", "test"]
Advantages:
More accurate tokenization as punctuation is handled separately.
Commonly used in many NLP applications.
Disadvantages:
Slightly more complex than whitespace tokenization.
Still may not handle all edge cases (e.g., contractions, hyphenated
words) without additional rules or customization.
6 January 22, 2025
Stop Word Removal: Stop words are common words that do not carry significant meaning (e.g.,
"the," "is," "and"). Removing them can reduce noise and save computational resources during text
analysis.
Types of Stopwords
Common Stopwords: These are the most frequently occurring words in a language and are
often removed during text preprocessing. Examples include “the,” “is,” “in,” “for,” “where,” “when,”
“to,” “at,” etc.
Custom Stopwords: Depending on the specific task or domain, additional words may be
considered as stopwords. These could be domain-specific terms that don’t contribute much
to the overall meaning. For example, in a medical context, words like “patient” or “treatment”
might be considered as custom stopwords.
Numerical Stopwords: Numbers and numeric characters may be treated as stopwords in
certain cases, especially when the analysis is focused on the meaning of the text rather than
specific numerical values.
Single-Character Stopwords: Single characters, such as “a,” “I,” “s,” or “x,” may be considered
stopwords, particularly in cases where they don’t convey much meaning on their own.
Contextual Stopwords: Words that are stopwords in one context but meaningful in another
may be considered as contextual stopwords. For instance, the word “will” might be a stopword
in the context of general language processing but could be important in predicting future events.
8 January 22, 2025
Lemmatization
Named Entity Recognition (NER): NER identifies and classifies named entities in the text, such as names
of people, organizations, locations, dates, etc. This helps in extracting relevant information from unstructured
text.
Step 1: Labelled data (10k M), Step 2: Training (Sendence, Algarithem, Model), Step 3: Test (with
model)
Ambiguity in NER
For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in
classification. Let’s look at some ambiguous examples:
England (Organization) won the 2019 world cup vs The 2019 world cup happened in England (Location).
Washington (Location) is the capital of the US vs The first president of the US was Washington (Person).
The edit distance is also known as Levenshtein distance, named after the Soviet
mathematician Vladimir Levenshtein.
The edit distance alone may not always be sufficient for accurate spelling
correction, especially for longer words or words with multiple errors.
These advanced approaches take into account the surrounding words and
the context of the sentence to make more informed suggestions for spelling
corrections.
26 January 22, 2025
In NLP, this technique is used to measure the similarity between two strings while
considering the relative importance or difficulty of different types of operations.
The traditional edit distance assumes that all operations (insertion, deletion,
substitution) have equal costs (usually a cost of 1).
This can lead to more accurate spelling correction, error detection, and similarity
measurement in various NLP tasks.