0% found this document useful (0 votes)
19 views

NLP Q2 21SAL54 Scheme

The document discusses an end semester examination for a course on mobile device forensics. It includes questions in two parts - Part A has multiple choice questions on topics related to natural language processing and Part B has questions requiring longer answers on applications of NLP, algorithms used in NLP and preprocessing text data techniques like stemming and lemmatization.

Uploaded by

prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

NLP Q2 21SAL54 Scheme

The document discusses an end semester examination for a course on mobile device forensics. It includes questions in two parts - Part A has multiple choice questions on topics related to natural language processing and Part B has questions requiring longer answers on applications of NLP, algorithms used in NLP and preprocessing text data techniques like stemming and lemmatization.

Uploaded by

prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

SRINIVAS UNIVERSITY

INSTITUTE OF ENGINEERING & TECHNOLOGY


DEPARTMENT OF ARTIFICIAL INTELIGENCE and MACHINE LEARNING
5th Sem B. Tech End Semester Examination, Dec-2023
Subject: Mobile Device Forensics- Scheme
Question No Question Particulars with Scheme

PART A

1 1. To classify text documents


2. Behavior tree
3. Analysis
4. All the Above
5. Text representation
6. Robot
7. The smallest units of sound in a language
8. Natural Language Toolkit
9. Image recognition
10. Common sense and world knowledge

PART B

1 a NLP applications:
• Email platforms, such as Gmail, Outlook, etc., use NLP extensively to
provide a range of product features, such as spam classification, priority
inbox, calendar event extraction, auto-complete, etc.
• Voice-based assistants, such as Apple Siri, Google Assistant, Microsoft
Cortana, and Amazon Alexa rely on a range of NLP techniques to
interact with the user, understand user commands, and respond
accordingly.
• Modern search engines, such as Google and Bing, which are the
cornerstone of today’s internet, use NLP heavily for various subtasks,
such as query understanding, query expansion, question answering,
information retrieval, and ranking and grouping of the results, to name
a few.
• Machine translation services, such as Google Translate, Bing Microsoft
Translator, and Amazon Translate are increasingly used in today’s
world to solve a wide range of scenarios and business use cases. These
services are direct applications of NLP.
1 b • Text Classification: Text classification, also known as text
categorization, is the process of assigning predefined categories or
labels to text documents based on their content. Machine learning
algorithms, particularly natural language processing (NLP) techniques,
are commonly used for text classification.
• Information Extraction: Information extraction (IE) is the process of
automatically extracting structured information from unstructured or
semi-structured text. This involves identifying specific entities (such as
names, dates, and locations) and their relationships from a given text.
Information Retrieval:
• Information retrieval (IR) is the process of retrieving relevant
information from a large dataset, typically a collection of documents, in
response to user queries or search terms. Search engines are a common
example of information retrieval systems, where users input keywords,
and the system returns a list of relevant documents.
• Language Modelling: This is the task of predicting what the next word
in a sentence will be based on the history of previous words. The goal
of this task is to learn the probability of a sequence of words appearing
in a given language. Language modeling is useful for building solutions
for a wide variety of problems, such as speech recognition, optical
character recognition, handwriting recognition, machine translation,
and spelling correction

2 a Explanation of each module:

2 b Support vector machine The support vector machine (SVM) is another


popular classification algorithm. An SVM can learn both a linear and nonlinear
decision boundary to separate data points belonging to different classes. A
linear decision boundary learns to repre‐ sent the data in a way that the class
differences become apparent. An SVM learns an optimal decision boundary so
that the distance between points across classes is at its maximum. The
conditional random field (CRF) is another algorithm that is used for sequential
data. Conceptually, a CRF essentially performs a classification task on each
element in the sequence. Imagine the same example of POS tagging, where a
CRF can tag word by word by classifying them to one of the parts of speech
from the pool of all POS tags. Since it takes the sequential input and the context
of tags into consideration, it becomes more expressive than the usual
classification methods and generally performs better. CRFs outperform HMMs
for tasks such as POS tagging, which rely on the sequential nature of language.
Hidden Markov Model The hidden Markov model (HMM) is a statistical
model that assumes there is an underlying, unobservable process with hidden
states that generate the data—i.e., we can only observe the data once it is
generated. An HMM then tries to model the hidden states from this data.

Naive Bayes Naive Bayes is a classic algorithm for classification tasks that
mainly relies on Bayes’ theorem (as is evident from the name). Using Bayes’
theorem, it calculates the probability of observing a class label given the set of
features for the input data. A characteristic of this algorithm is that it assumes
each feature is independent of all other features. For the news classification
example mentioned earlier in this chapter, one way to represent the text
numerically is by using the count of domain-specific words, such as sport-
specific or politics-specific words, present in the text. We assume that these
word counts are not correlated to one another. If the assumption holds, we can
use Naive Bayes to classify news articles. While this is a strong assump‐ tion to
make in many cases, Naive Bayes is commonly used as a starting algorithm for
text classification. This is primarily because it is simple to understand and very
fast to train and run.

3 a Stemming:
Definition: Stemming involves removing suffixes from words to obtain their
root form. The goal is to reduce words to a common base or stem.
Example: Consider the word "running." The stem of this word, obtained
through stemming, would be "run." Similarly, "jumps" would be stemmed to
"jump."
Output : ['run', 'jump', 'happili', 'better']

Lemmatization:
Definition: Lemmatization, like stemming, aims to reduce words to their base
form. However, lemmatization considers the word's meaning and context,
ensuring that the resulting lemma is a valid word.
Example: Consider the word "better." The lemma of this word, obtained
through lemmatization, would also be "better." Similarly, "running" would be
lemmatized to "run."
Output: ['running', 'jump', 'happily', 'better']
b • Overfitting on small datasets
• Few-shot learning and synthetic data generation
• Domain adaptation
• Cost
• Interpretable models
• Common sense and world knowledge
• On-device deployment
Explain any least fou

4 a Feature Engineering with classical NLP:

b List and Discuss the evaluation parameters.


5 a BoW maps words to unique integer IDs between 1 and |V|. Each document in
the corpus is then converted into a vector of |V| dimensions where in the i th
component of the vector, i = wid, is simply the number of times the word w
occurs in the document, i.e., we simply score each word in V by their occur‐
rence count in the document.
Thus, for our toy corpus, where the word IDs are dog = 1, bites = 2, man = 3,
meat = 4 , food = 5, eats = 6,
D1 becomes [1 1 1 0 0 0]. This is because the first three words in the vocabulary
appeared exactly once in D1, and the last three did not appear at all.
D2 becomes [1 1 1 0 0 0 ]
D3 Becomes [1 0 0 1 0 1]
D4 becomes [0 0 1 0 1 1]
5 b One Hot Encoding
D1 Dog bites man.
D2 Man bites dog.
D3 Dog eats meat.
D4 Man eats food. First map each of the six words to unique IDs: dog = 1, bites
= 2, man = 3, meat = 4 , food = 5, eats = 6.
Let’s consider document D1: “dog bites man”. As per the scheme, each word
is a six-dimensional vector. The dog is represented as [1 0 0 0 0 0], as the word
“dog” is mapped to ID 1. Bites is represented as [0 1 0 0 0 0], and so on. Thus,
D1 is represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]]. D4 is represented
as [ [ 0 0 1 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]].

You might also like