NLP Q2 21SAL54 Scheme
NLP Q2 21SAL54 Scheme
PART A
PART B
1 a NLP applications:
• Email platforms, such as Gmail, Outlook, etc., use NLP extensively to
provide a range of product features, such as spam classification, priority
inbox, calendar event extraction, auto-complete, etc.
• Voice-based assistants, such as Apple Siri, Google Assistant, Microsoft
Cortana, and Amazon Alexa rely on a range of NLP techniques to
interact with the user, understand user commands, and respond
accordingly.
• Modern search engines, such as Google and Bing, which are the
cornerstone of today’s internet, use NLP heavily for various subtasks,
such as query understanding, query expansion, question answering,
information retrieval, and ranking and grouping of the results, to name
a few.
• Machine translation services, such as Google Translate, Bing Microsoft
Translator, and Amazon Translate are increasingly used in today’s
world to solve a wide range of scenarios and business use cases. These
services are direct applications of NLP.
1 b • Text Classification: Text classification, also known as text
categorization, is the process of assigning predefined categories or
labels to text documents based on their content. Machine learning
algorithms, particularly natural language processing (NLP) techniques,
are commonly used for text classification.
• Information Extraction: Information extraction (IE) is the process of
automatically extracting structured information from unstructured or
semi-structured text. This involves identifying specific entities (such as
names, dates, and locations) and their relationships from a given text.
Information Retrieval:
• Information retrieval (IR) is the process of retrieving relevant
information from a large dataset, typically a collection of documents, in
response to user queries or search terms. Search engines are a common
example of information retrieval systems, where users input keywords,
and the system returns a list of relevant documents.
• Language Modelling: This is the task of predicting what the next word
in a sentence will be based on the history of previous words. The goal
of this task is to learn the probability of a sequence of words appearing
in a given language. Language modeling is useful for building solutions
for a wide variety of problems, such as speech recognition, optical
character recognition, handwriting recognition, machine translation,
and spelling correction
Naive Bayes Naive Bayes is a classic algorithm for classification tasks that
mainly relies on Bayes’ theorem (as is evident from the name). Using Bayes’
theorem, it calculates the probability of observing a class label given the set of
features for the input data. A characteristic of this algorithm is that it assumes
each feature is independent of all other features. For the news classification
example mentioned earlier in this chapter, one way to represent the text
numerically is by using the count of domain-specific words, such as sport-
specific or politics-specific words, present in the text. We assume that these
word counts are not correlated to one another. If the assumption holds, we can
use Naive Bayes to classify news articles. While this is a strong assump‐ tion to
make in many cases, Naive Bayes is commonly used as a starting algorithm for
text classification. This is primarily because it is simple to understand and very
fast to train and run.
3 a Stemming:
Definition: Stemming involves removing suffixes from words to obtain their
root form. The goal is to reduce words to a common base or stem.
Example: Consider the word "running." The stem of this word, obtained
through stemming, would be "run." Similarly, "jumps" would be stemmed to
"jump."
Output : ['run', 'jump', 'happili', 'better']
Lemmatization:
Definition: Lemmatization, like stemming, aims to reduce words to their base
form. However, lemmatization considers the word's meaning and context,
ensuring that the resulting lemma is a valid word.
Example: Consider the word "better." The lemma of this word, obtained
through lemmatization, would also be "better." Similarly, "running" would be
lemmatized to "run."
Output: ['running', 'jump', 'happily', 'better']
b • Overfitting on small datasets
• Few-shot learning and synthetic data generation
• Domain adaptation
• Cost
• Interpretable models
• Common sense and world knowledge
• On-device deployment
Explain any least fou