0% found this document useful (0 votes)

46 views14 pages

KNN MCQs 3

The document is a midterm exam for a Text Data Mining course, consisting of six questions and a bonus question, covering topics such as inter-annotator agreement, training and testing models, evaluation metrics, instance-based classification, and feature representation. Each question has a specified point value, guiding the time allocation for responses, and the total score possible is 100 points, with an additional 5 points for the bonus. The exam emphasizes the importance of thorough and relevant answers, with penalties for irrelevant details.

Uploaded by

oumaima abaied

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views14 pages

KNN MCQs 3

Uploaded by

oumaima abaied

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Student Name: ____________________________

Midterm Exam
Text Data Mining (INLS 613)
October 16th, 2013

Answer all of the following questions. Each answer should be thorough, complete, and relevant.
Points will be deducted for irrelevant details. Use the back of the pages if you need more room
for your answer.

The points are a clue about how much time you should spend on each question. Plan your time
accordingly.

There is a 5-point bonus question. You lose nothing by attempting the bonus question.
However, your grade will not be more than 100%.

Good luck!

Question Points
1 15
2 15
3 20
4 20
5 15
6 15
Total 100
Bonus 5
1. Inter-annotator Agreement [15 points]

Predictive analysis of text often requires annotating data. In doing so, one important step is
verifying whether human annotators can reliably detect the phenomenon of interest (e.g.,
whether a product review is positive or negative).

Suppose that two annotators (denoted A and B) independently annotate 500 product reviews
and produce the following contingency matrix. Answer the following questions.

Annotator B
Positive Negative
Positive 25 25
Annotator A
Negative 100 350

(a) What is the inter-annotator agreement between A and B based on accuracy (i.e., the
percentage of times that both annotators agreed)? [7.5 points]

Accuracy = (true positives + true negatives) / total = (25 + 350) / 500 = 0.9375
(b) In this particular case, do you think that accuracy is a good measure of inter-annotator
agreement? Why or why not? [7.5 points]

Accuracy does not take into account the level of agreement due to random chance. Here,
we have two classes. Therefore, if both of these annotators were independently making
random guesses with uniform probability, we would expect them to agree 50% of the
time (on average).

Furthermore, it appears that the NEGATIVE class is the majority class. We can see that
both annotators favored the NEGATIVE class. Annotator A chose the NEGATIVE class
90% of the time (450/500) and Annotator B chose the NEGATIVE class 75% of the time
(357/500). Therefore, if these annotators were making random guesses consistent with
their individual biases (favoring the majority class), we would expect them to agree more
than 50% of the time. More specifically, their level of agreement due to chance
(considering their individual biases) would be 70%.
2. Training and Testing [15 points]
The goal in predictive analysis is to use training data to learn a model that can make
predictions on new data. Answer the following questions.

(a) Suppose we increased the size of the training set. Would this likely improve or
deteriorate the performance of the model on new data? Why? [7.5 points]

Increasing the size of the training set is likely to improve the model’s performance on
new data. Increasing the size of the training set tends to add more variability to the data.
More variability tends to make it easier to detect which features are truly correlated with
the target class and which features are not. If we had only five training instances, for
example, it would be nearly impossible to determine which features are truly correlated.

(b) Suppose we reduced the feature representation to include only the features with the
highest mutual information with the target concept. Would this likely improve or
deteriorate the performance of the model on new data? Why? [7.5 points]

Reducing the number of features to those that are most correlated with the target class is
usually a good thing. It reduces the risk that the model will catch on to regularities in the
training data that seem important only due to limited training data.
3. Evaluation Metrics [20 points]

Suppose we train a model to predict whether an email is Spam or Not Spam. After training
the model, we apply it to a test set of 500 new emails (also labeled) and the model produces
the following contingency table.

True Class
Spam Not Spam
Predicted Spam 70 30
Class Not Spam 70 330

(a) Compute the precision of this model with respect to the Spam class [5 points]

Precision with respect to SPAM = # correctly predicted as SPAM / # predicted as SPAM

= 70 / (70 + 30) = 70 / 100 = 70%

(b) Compute the recall of this model with respect to the Spam class [5 points]

Recall with respect to SPAM = # correctly predicted as SPAM / # true SPAM

= 70 / (70 + 70) = 70 / 140 = 50%

Emily hates seeing spam emails in her inbox! However, she doesn’t mind
periodically checking the “Junk” directory for genuine emails incorrectly marked
as spam.

Simon doesn’t even know where the “Junk” directory is. He would much prefer to
see spam emails in his inbox than to miss genuine emails without knowing!

Which user is more likely to be satisfied with this classifier? Why? [10 points]

In order to answer this question, let’s think about what it means to have high precision
and low recall with respect to SPAM and, conversely, what it means to have high recall
and low precision with respect to SPAM.

High-precision and low recall with respect to SPAM: whatever the model classifies as
SPAM is probably SPAM. However, many emails that are truly SPAM are misclassified
as NOT SPAM. The user is likely to see some SPAM messages in his/her inbox, but will
never have to go to the “junk” directory to look for genuine messages incorrectly marked
as SPAM.

High recall and low precision with respect to SPAM: the model filters all the SPAM
emails, but also incorrectly classifies some genuine emails as SPAM. The user will never
see SPAM emails in his/her inbox, but will have to periodically check the “junk”
directory for genuine emails incorrectly marked as SPAM.

Because the classifier achieves higher precision than recall, Simon is more likely to be
satisfied with the classifier
4. Instance-Based Classification [20 points]

A KNN classifier assigns a test instance the majority class associated with its K nearest
training instances. Distance between instances is measured using Euclidean distance.

Suppose we have the following training set of positive (+) and negative (-) instances and a
single test instance (o).

All instances are projected onto a vector space of two real-valued features (X and Y). Answer
the following questions. Assume “unweighted” KNN (every nearest neighbor contributes
equally to the final vote).

+ + - -
+ + + -
+ + + +
Y -
+ + + + o-
+ + ++
+ +
+ +++
+ + + test instance
+
X
(a) What would be the class assigned to this test instance for K=1 [5 points]

KNN assigns a test instance the target class associated with the majority of the test
instance’s K nearest neighbors. For K=1, this test instance would be predicted negative
because it’s single nearest neighbor is negative.
(b) What would be the class assigned to this test instance for K=3 [5 points]

KNN assigns a test instance the target class associated with the majority of the test
instance’s K nearest neighbors. For K=3, this test instance would be predicted negative.
Out of its three nearest neighbors, two are negative and one is positive.

KNN assigns a test instance the target class associated with the majority of the test
instance’s K nearest neighbors. For K=5, this test instance would be predicted negative.
Out of its five nearest neighbors, two are negative and three are positive.
(d) Setting K to a large value seems like a good idea. We get more votes! Given this
particular training set, would you recommend setting K = 11? Why or why not?
[5 points]

There are only 5 negative instances in the training set. Therefore, any value of K >
10 would have a majority of positive instances.
5. Instance-Based Classification vs. Naïve Bayes [15 points]

(a) Suppose we have the following data, represented using two real-valued features (X and
Y, as in Question 4) and suppose that our goal is to randomly split this data into a training
set (90%) and a test set (10%) and to train and evaluate a model.

+
-
+
-
+
-
Y +
-
+
-
+
-
+
-
X

Which classifier do you think would have a higher chance of doing well in terms of
accuracy: KNN (with K=1) or Naïve Bayes? Why? [7.5 points]

KNN (with K=1) is almost guaranteed to do terrible if we were to split this dataset into
90% for training and 10% for testing. That is because every data point’s single nearest
neighbor has the opposite class. Every positive test instance would be predicted negative
and every negative test instance would be predicted positive.

Naïve Bayes is an example of a linear classifier. During training, it tries to fit a line
(assuming two features) or a hyperplane (assuming > two features) that separates the
positive and negative instances.

As this data shows, there is a diagonal line that, if the model were to learn, would
perform perfectly. Of course, this is not guaranteed to happen. However, at least Naïve
Bayes has a non-zero chance.
(b) Suppose we have the following data, represented using two real-valued features (X and
Y, as in Question 4) and suppose that our goal is to randomly split this data into a training
set (90%) and a test set (10%) and to train and evaluate a model.

++ --
++ -
Y
-- ++
-- +

X
Which classifier do you think would have a higher chance of doing well in terms of
accuracy: KNN (with K=1) or Naïve Bayes? Why? [7.5 points]

Naïve Bayes is a linear classifier. In this dataset, there isn’t a line we would draw to
perfectly separate the positive instances from the negative instances. So, if we were to
split the data into 90% for training and 10% for testing, Naïve Bayes is guaranteed to get
some test instances wrong.

KNN tends to do well when the data is neatly clustered. There is a greater chance that
the single nearest neighbor associated with a test instance will have the same target class
value.

Of course, things could go horribly wrong for KNN (with K=1) if the test set included
ALL of the instances in one of these clusters. The entire cluster would be misclassified.
The nearest neighbor of every cluster has the opposite target class value. The question,
however, is whether this worst-case scenario for KNN (with K=1) is likely to happen.
6. Zipf’s law and Feature Representation [15 points]

Zipf’s law tells us that in most collections of text, a few terms will occur very frequently and
many terms will occur very infrequently. In other words, if we were to plot term frequency
(Y-axis) as a function of frequency-based rank (X-axis), we get obtain the graph below
(with three general regions: A, B, and C).

Frequency

A B C

Frequency-based rank

In predictive analysis of text, we often use terms as features. As we discussed in class, if is

oftentimes beneficial to use only those terms from region B as features and to ignore those
terms from regions A and C. Answer the following questions.

(a) What is the justification behind ignoring those terms from region A? [7.5 points]

The features in region A are very frequent. Because they are very frequent, they are not
likely to help discriminate between target class values. Put differently, they are likely to
co-occur with all target class values.
(b) What is the justification behind ignoring those terms from region C? [7.5 points]

The features in region C are very rare. Rare terms are problematic in two respects. First,
they don’t occur frequently enough for a model to determine whether they are really
correlated with a target class value or whether their level of co-occurrence is a statistical
anomaly. Second, because they are rare, they are not likely to occur in the test data.
BONUS: Naïve Bayes [5 points]

Suppose you have the following toy training set of positive (+) and negative (-) movie
reviews. There are only 5 training instances and 3 features.

great fine terrible class

1 0 0 +
0 1 1 --
0 1 1 --
0 0 0 +
1 0 1 --

If we trained a Naïve Bayes model with no smoothing using this training set, what would be
the label assigned to a movie review that just says: “terrible!”? Explain your answer.

Hint: notice that this test instance would have the following feature values: great=0, fine=0,
and terrible=1.

Naive Bayes would assign this instance to the positive class if

P(great=0|+)P(fine=0|+)P(terrible=1|+)P(+) > P(great=0|-)P(fine=0|-)P(terrible=1|-)P(-)

Otherwise, it would assign this instance to the negative class.

The trick is solving this question quickly was to notice that all the P(word={0,1}|class={+,-})
statistics shown above have a non-zero value except for one: P(terrible=1|+) = 0.

P(terrible=1|+) = 0 because no single training instance was positive and contained the term
“terrible”.

Thus, “terrible!” would be predicted as belonging to the negative class.

Evaluation-Practice Questions(Answer key)
No ratings yet
Evaluation-Practice Questions(Answer key)
4 pages
ISYE6740_Fall2024_HW4_Rubric
No ratings yet
ISYE6740_Fall2024_HW4_Rubric
5 pages
UNIT 3 Evaluating Models Q-Ans
No ratings yet
UNIT 3 Evaluating Models Q-Ans
6 pages
ML_M3
No ratings yet
ML_M3
35 pages
CP1407 prac6-9
No ratings yet
CP1407 prac6-9
45 pages
cp4252-machine-learning-lab-manual
No ratings yet
cp4252-machine-learning-lab-manual
38 pages
KNN Interview Question Rev 2.0
No ratings yet
KNN Interview Question Rev 2.0
17 pages
CS464_Ch5_FeatureSelection
No ratings yet
CS464_Ch5_FeatureSelection
31 pages
unit 6 ai
No ratings yet
unit 6 ai
28 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
sensitivity unit 4
No ratings yet
sensitivity unit 4
4 pages
Ml Review Exam So Lns
No ratings yet
Ml Review Exam So Lns
6 pages
ML1 - Classification - KNN & NB
No ratings yet
ML1 - Classification - KNN & NB
23 pages
MSBD5001_WrittenAssignment2_2024F
No ratings yet
MSBD5001_WrittenAssignment2_2024F
5 pages
Midterm Sample
No ratings yet
Midterm Sample
16 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
4_5866132153250614246
No ratings yet
4_5866132153250614246
15 pages
Wa0013.
No ratings yet
Wa0013.
9 pages
Us Formal Lesson Plan 2 1
No ratings yet
Us Formal Lesson Plan 2 1
9 pages
ML UNIT - III-Complete
No ratings yet
ML UNIT - III-Complete
52 pages
Mid-Sem_11
No ratings yet
Mid-Sem_11
2 pages
Co-2 ML 2019
No ratings yet
Co-2 ML 2019
71 pages
Univt - IV
No ratings yet
Univt - IV
72 pages
L05-Predictive Analytics I
No ratings yet
L05-Predictive Analytics I
49 pages
Exercises ML PDF
No ratings yet
Exercises ML PDF
4 pages
ML Mid Sem Sep2023 Paper
No ratings yet
ML Mid Sem Sep2023 Paper
3 pages
Classification and K Nearest Neighbour Algorithm
No ratings yet
Classification and K Nearest Neighbour Algorithm
53 pages
Midterm - APS1070 - 2019 - 09 Fall
No ratings yet
Midterm - APS1070 - 2019 - 09 Fall
2 pages
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
No ratings yet
Ifjo 320 Fy 98324 Fo 3 F 2 Ifr
6 pages
Classification and Regression: Arturo Calder On Mora
No ratings yet
Classification and Regression: Arturo Calder On Mora
8 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
Practical 7 Classification Revision Questions
No ratings yet
Practical 7 Classification Revision Questions
8 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
33 pages
Week 3
No ratings yet
Week 3
26 pages
Data MIning Chapter 8
No ratings yet
Data MIning Chapter 8
11 pages
hw2 2011spring
0% (1)
hw2 2011spring
3 pages
DM-I Q Paper 2024
No ratings yet
DM-I Q Paper 2024
12 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
58-Khushbu Khamar-Short Text Classification USING
No ratings yet
58-Khushbu Khamar-Short Text Classification USING
4 pages
19ECE357_V Sem End_Odd 2023
No ratings yet
19ECE357_V Sem End_Odd 2023
4 pages
IGNOU BCA Statistical Techniques Previous Year Unsolved Papers BCS 040
From Everand
IGNOU BCA Statistical Techniques Previous Year Unsolved Papers BCS 040
Manish Soni
No ratings yet
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
ML MID-1 Question Bank
No ratings yet
ML MID-1 Question Bank
6 pages
Unit III 1
No ratings yet
Unit III 1
21 pages
Unit 2 AAM
No ratings yet
Unit 2 AAM
32 pages
Final2019 Solutions
No ratings yet
Final2019 Solutions
23 pages
MLFA Spring 2024
No ratings yet
MLFA Spring 2024
11 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Performance Measures - Session 2
No ratings yet
Performance Measures - Session 2
35 pages
Final 2019
No ratings yet
Final 2019
15 pages
Nptel Week 6 - 2
No ratings yet
Nptel Week 6 - 2
4 pages
Homework 4
0% (1)
Homework 4
4 pages
Machine Learning Foundations and Applications Assignment 1 Due Date: 10 October, 2021
No ratings yet
Machine Learning Foundations and Applications Assignment 1 Due Date: 10 October, 2021
3 pages
Week 3
No ratings yet
Week 3
11 pages
P15CS71 - Z2
No ratings yet
P15CS71 - Z2
3 pages
Data Science For Online Customer Analytics - Assignment
No ratings yet
Data Science For Online Customer Analytics - Assignment
11 pages
Arjuna NEET 2026 for Class 11
No ratings yet
Arjuna NEET 2026 for Class 11
10 pages
Perception and the Representative Design of Psychological Experiments Egon Brunswik pdf download
No ratings yet
Perception and the Representative Design of Psychological Experiments Egon Brunswik pdf download
52 pages
CS771 IITK EndSem Solutions
100% (1)
CS771 IITK EndSem Solutions
8 pages
SysAdmin MCQ
No ratings yet
SysAdmin MCQ
92 pages
DM Chapter 8 (2)
No ratings yet
DM Chapter 8 (2)
7 pages
Online Mba Nmims
No ratings yet
Online Mba Nmims
4 pages
F_-_MAJOR_PLATE_NO.1_-_IMUS_CITY_SPORTS_COMPLEX
No ratings yet
F_-_MAJOR_PLATE_NO.1_-_IMUS_CITY_SPORTS_COMPLEX
4 pages
Machine Learning Unit 4 MCQ
No ratings yet
Machine Learning Unit 4 MCQ
28 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
DM Chapter 7
No ratings yet
DM Chapter 7
6 pages
Instant Download (Ebook PDF) Social Psychology 10th Edition by Saul Kassin PDF All Chapters
100% (4)
Instant Download (Ebook PDF) Social Psychology 10th Edition by Saul Kassin PDF All Chapters
51 pages
LAUTECH ADMISSION GUIDE
No ratings yet
LAUTECH ADMISSION GUIDE
40 pages
keerthana resume HDFC.
No ratings yet
keerthana resume HDFC.
3 pages
Data Mining Chapter 2: Market Basket Analysis
No ratings yet
Data Mining Chapter 2: Market Basket Analysis
4 pages
Science 10 Protein Synthesis
No ratings yet
Science 10 Protein Synthesis
2 pages
TSA Chapters 1: introduction to time series
No ratings yet
TSA Chapters 1: introduction to time series
4 pages
Adv Econ Chapter 1: Modeling Framework
No ratings yet
Adv Econ Chapter 1: Modeling Framework
5 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
YR1_Schedule_Sem1_2022-2023_v4 (1)
No ratings yet
YR1_Schedule_Sem1_2022-2023_v4 (1)
1 page
BIT ICT - July 2024
No ratings yet
BIT ICT - July 2024
2 pages
Letter To Principal and References
No ratings yet
Letter To Principal and References
2 pages
Wi Kanti
No ratings yet
Wi Kanti
19 pages
2023 GPT4All Technical Report
No ratings yet
2023 GPT4All Technical Report
3 pages
Occt504 - Quantitative Research Paper PDF
No ratings yet
Occt504 - Quantitative Research Paper PDF
11 pages
Indian Mathematicians and Their Contributions Upsc Notes 66
No ratings yet
Indian Mathematicians and Their Contributions Upsc Notes 66
5 pages
Last Thesis Ethics Applicatation Form PDF
No ratings yet
Last Thesis Ethics Applicatation Form PDF
9 pages
DLL - Mapeh-Music 6 - Q3 - W1 - Gerome
No ratings yet
DLL - Mapeh-Music 6 - Q3 - W1 - Gerome
4 pages
Pud 1 3RD Egb
100% (1)
Pud 1 3RD Egb
4 pages
MICAT Preparation
No ratings yet
MICAT Preparation
2 pages
Reflective Journal in Purposive Communication
No ratings yet
Reflective Journal in Purposive Communication
9 pages
Sample Judging Form: Gacs Speech Adjudication Form Declamation
No ratings yet
Sample Judging Form: Gacs Speech Adjudication Form Declamation
7 pages
Maintaining A Positive Attitude in The Classroom
No ratings yet
Maintaining A Positive Attitude in The Classroom
2 pages
Report On Least Learned Competencies in English S.Y. 2020-2021
No ratings yet
Report On Least Learned Competencies in English S.Y. 2020-2021
3 pages
C. The Teacher Discusses The " .: SWOT Analysis For Events Management "
No ratings yet
C. The Teacher Discusses The " .: SWOT Analysis For Events Management "
4 pages
ISA D14 TTT CyberSecurity
100% (1)
ISA D14 TTT CyberSecurity
12 pages
FP-Engineer For Backend
No ratings yet
FP-Engineer For Backend
1 page
Catcher in The Rye Essay
No ratings yet
Catcher in The Rye Essay
6 pages
Clinical Teaching Plan
100% (2)
Clinical Teaching Plan
5 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Daily Lesson Plan in Physical Scienc1
100% (1)
Daily Lesson Plan in Physical Scienc1
4 pages