KNN MCQs 3
KNN MCQs 3
Midterm Exam
Text Data Mining (INLS 613)
October 16th, 2013
Answer all of the following questions. Each answer should be thorough, complete, and relevant.
Points will be deducted for irrelevant details. Use the back of the pages if you need more room
for your answer.
The points are a clue about how much time you should spend on each question. Plan your time
accordingly.
There is a 5-point bonus question. You lose nothing by attempting the bonus question.
However, your grade will not be more than 100%.
Good luck!
Question Points
1 15
2 15
3 20
4 20
5 15
6 15
Total 100
Bonus 5
1. Inter-annotator Agreement [15 points]
Predictive analysis of text often requires annotating data. In doing so, one important step is
verifying whether human annotators can reliably detect the phenomenon of interest (e.g.,
whether a product review is positive or negative).
Suppose that two annotators (denoted A and B) independently annotate 500 product reviews
and produce the following contingency matrix. Answer the following questions.
Annotator B
Positive Negative
Positive 25 25
Annotator A
Negative 100 350
(a) What is the inter-annotator agreement between A and B based on accuracy (i.e., the
percentage of times that both annotators agreed)? [7.5 points]
Accuracy = (true positives + true negatives) / total = (25 + 350) / 500 = 0.9375
(b) In this particular case, do you think that accuracy is a good measure of inter-annotator
agreement? Why or why not? [7.5 points]
Accuracy does not take into account the level of agreement due to random chance. Here,
we have two classes. Therefore, if both of these annotators were independently making
random guesses with uniform probability, we would expect them to agree 50% of the
time (on average).
Furthermore, it appears that the NEGATIVE class is the majority class. We can see that
both annotators favored the NEGATIVE class. Annotator A chose the NEGATIVE class
90% of the time (450/500) and Annotator B chose the NEGATIVE class 75% of the time
(357/500). Therefore, if these annotators were making random guesses consistent with
their individual biases (favoring the majority class), we would expect them to agree more
than 50% of the time. More specifically, their level of agreement due to chance
(considering their individual biases) would be 70%.
2. Training and Testing [15 points]
The goal in predictive analysis is to use training data to learn a model that can make
predictions on new data. Answer the following questions.
(a) Suppose we increased the size of the training set. Would this likely improve or
deteriorate the performance of the model on new data? Why? [7.5 points]
Increasing the size of the training set is likely to improve the model’s performance on
new data. Increasing the size of the training set tends to add more variability to the data.
More variability tends to make it easier to detect which features are truly correlated with
the target class and which features are not. If we had only five training instances, for
example, it would be nearly impossible to determine which features are truly correlated.
(b) Suppose we reduced the feature representation to include only the features with the
highest mutual information with the target concept. Would this likely improve or
deteriorate the performance of the model on new data? Why? [7.5 points]
Reducing the number of features to those that are most correlated with the target class is
usually a good thing. It reduces the risk that the model will catch on to regularities in the
training data that seem important only due to limited training data.
3. Evaluation Metrics [20 points]
Suppose we train a model to predict whether an email is Spam or Not Spam. After training
the model, we apply it to a test set of 500 new emails (also labeled) and the model produces
the following contingency table.
True Class
Spam Not Spam
Predicted Spam 70 30
Class Not Spam 70 330
(a) Compute the precision of this model with respect to the Spam class [5 points]
(b) Compute the recall of this model with respect to the Spam class [5 points]
Emily hates seeing spam emails in her inbox! However, she doesn’t mind
periodically checking the “Junk” directory for genuine emails incorrectly marked
as spam.
Simon doesn’t even know where the “Junk” directory is. He would much prefer to
see spam emails in his inbox than to miss genuine emails without knowing!
Which user is more likely to be satisfied with this classifier? Why? [10 points]
In order to answer this question, let’s think about what it means to have high precision
and low recall with respect to SPAM and, conversely, what it means to have high recall
and low precision with respect to SPAM.
High-precision and low recall with respect to SPAM: whatever the model classifies as
SPAM is probably SPAM. However, many emails that are truly SPAM are misclassified
as NOT SPAM. The user is likely to see some SPAM messages in his/her inbox, but will
never have to go to the “junk” directory to look for genuine messages incorrectly marked
as SPAM.
High recall and low precision with respect to SPAM: the model filters all the SPAM
emails, but also incorrectly classifies some genuine emails as SPAM. The user will never
see SPAM emails in his/her inbox, but will have to periodically check the “junk”
directory for genuine emails incorrectly marked as SPAM.
Because the classifier achieves higher precision than recall, Simon is more likely to be
satisfied with the classifier
4. Instance-Based Classification [20 points]
A KNN classifier assigns a test instance the majority class associated with its K nearest
training instances. Distance between instances is measured using Euclidean distance.
Suppose we have the following training set of positive (+) and negative (-) instances and a
single test instance (o).
All instances are projected onto a vector space of two real-valued features (X and Y). Answer
the following questions. Assume “unweighted” KNN (every nearest neighbor contributes
equally to the final vote).
+ + - -
+ + + -
+ + + +
Y -
+ + + + o-
+ + ++
+ +
+ +++
+ + + test instance
+
X
(a) What would be the class assigned to this test instance for K=1 [5 points]
KNN assigns a test instance the target class associated with the majority of the test
instance’s K nearest neighbors. For K=1, this test instance would be predicted negative
because it’s single nearest neighbor is negative.
(b) What would be the class assigned to this test instance for K=3 [5 points]
KNN assigns a test instance the target class associated with the majority of the test
instance’s K nearest neighbors. For K=3, this test instance would be predicted negative.
Out of its three nearest neighbors, two are negative and one is positive.
(c) What would be the class assigned to this test instance for K=5 [5 points]
KNN assigns a test instance the target class associated with the majority of the test
instance’s K nearest neighbors. For K=5, this test instance would be predicted negative.
Out of its five nearest neighbors, two are negative and three are positive.
(d) Setting K to a large value seems like a good idea. We get more votes! Given this
particular training set, would you recommend setting K = 11? Why or why not?
[5 points]
There are only 5 negative instances in the training set. Therefore, any value of K >
10 would have a majority of positive instances.
5. Instance-Based Classification vs. Naïve Bayes [15 points]
(a) Suppose we have the following data, represented using two real-valued features (X and
Y, as in Question 4) and suppose that our goal is to randomly split this data into a training
set (90%) and a test set (10%) and to train and evaluate a model.
+
-
+
-
+
-
Y +
-
+
-
+
-
+
-
X
Which classifier do you think would have a higher chance of doing well in terms of
accuracy: KNN (with K=1) or Naïve Bayes? Why? [7.5 points]
KNN (with K=1) is almost guaranteed to do terrible if we were to split this dataset into
90% for training and 10% for testing. That is because every data point’s single nearest
neighbor has the opposite class. Every positive test instance would be predicted negative
and every negative test instance would be predicted positive.
Naïve Bayes is an example of a linear classifier. During training, it tries to fit a line
(assuming two features) or a hyperplane (assuming > two features) that separates the
positive and negative instances.
As this data shows, there is a diagonal line that, if the model were to learn, would
perform perfectly. Of course, this is not guaranteed to happen. However, at least Naïve
Bayes has a non-zero chance.
(b) Suppose we have the following data, represented using two real-valued features (X and
Y, as in Question 4) and suppose that our goal is to randomly split this data into a training
set (90%) and a test set (10%) and to train and evaluate a model.
++ --
++ -
Y
-- ++
-- +
X
Which classifier do you think would have a higher chance of doing well in terms of
accuracy: KNN (with K=1) or Naïve Bayes? Why? [7.5 points]
Naïve Bayes is a linear classifier. In this dataset, there isn’t a line we would draw to
perfectly separate the positive instances from the negative instances. So, if we were to
split the data into 90% for training and 10% for testing, Naïve Bayes is guaranteed to get
some test instances wrong.
KNN tends to do well when the data is neatly clustered. There is a greater chance that
the single nearest neighbor associated with a test instance will have the same target class
value.
Of course, things could go horribly wrong for KNN (with K=1) if the test set included
ALL of the instances in one of these clusters. The entire cluster would be misclassified.
The nearest neighbor of every cluster has the opposite target class value. The question,
however, is whether this worst-case scenario for KNN (with K=1) is likely to happen.
6. Zipf’s law and Feature Representation [15 points]
Zipf’s law tells us that in most collections of text, a few terms will occur very frequently and
many terms will occur very infrequently. In other words, if we were to plot term frequency
(Y-axis) as a function of frequency-based rank (X-axis), we get obtain the graph below
(with three general regions: A, B, and C).
Frequency
A B C
Frequency-based rank
(a) What is the justification behind ignoring those terms from region A? [7.5 points]
The features in region A are very frequent. Because they are very frequent, they are not
likely to help discriminate between target class values. Put differently, they are likely to
co-occur with all target class values.
(b) What is the justification behind ignoring those terms from region C? [7.5 points]
The features in region C are very rare. Rare terms are problematic in two respects. First,
they don’t occur frequently enough for a model to determine whether they are really
correlated with a target class value or whether their level of co-occurrence is a statistical
anomaly. Second, because they are rare, they are not likely to occur in the test data.
BONUS: Naïve Bayes [5 points]
Suppose you have the following toy training set of positive (+) and negative (-) movie
reviews. There are only 5 training instances and 3 features.
If we trained a Naïve Bayes model with no smoothing using this training set, what would be
the label assigned to a movie review that just says: “terrible!”? Explain your answer.
Hint: notice that this test instance would have the following feature values: great=0, fine=0,
and terrible=1.
The trick is solving this question quickly was to notice that all the P(word={0,1}|class={+,-})
statistics shown above have a non-zero value except for one: P(terrible=1|+) = 0.
P(terrible=1|+) = 0 because no single training instance was positive and contained the term
“terrible”.