hw1
hw1
(a) In Table 1 you are given the data from all 800 planets surveyed so far. The features
observed by telescope are Size (“Big” or “Small”), and Orbit (“Near” or “Far”). Each
row indicates the values of the features and habitability, and how many times that set
of values was observed. So, for example, there were 20 “Big” planets “Near” their star
that were habitable. Derive and draw the decision tree learned by ID3 on this data
(use the maximum information gain criterion for splits, don’t do any pruning). Make
sure to clearly mark at each node what attribute you are splitting on, and which value
corresponds to which branch. By each leaf node of the tree, write in the number of
habitable and inhabitable planets in the training data (i.e. the data in Table 1) that
belong to that node.
(b) For just 9 of the planets, a third feature, Temperature (in Kelvin), has been measured, as
shown in Table 2. Redo all the steps from part (a) on this data using all three features.
For the Temperature feature, in each iteration you must maximize over all possible
binary thresholding splits (such as T ≤ 250 v.s. T > 250, for example). According to
your decision tree, would a planet with the features (Big, Near, 280) be predicted to be
habitable or not habitable?
1
Table 2: Planet size, orbit, and temperature vs. habitability.
Size Orbit Temperature Habitable
Big Far 205 No
Big Near 205 No
Big Near 260 Yes
Big Near 380 Yes
Small Far 205 No
Small Far 260 Yes
Small Near 260 Yes
Small Near 380 No
Small Near 380 No
2. [15 points] In this problem you’ll see why simple feature-wise (i.e. coordinate-wise) splitting
of the data isn’t always the best approach to classification. Throughout the problem, assume
that each feature can be used for splitting the data multiple times in a decision tree. Suppose
you are given n non-overlapping points in the unit square [0, 1] × [0, 1], each labeled either +
or −.
(a) Prove that there exists a decision tree of depth at most log2 n that correctly labels all
n points. At each node the decision tree should only perform a binary threshold split
on a single coordinate. (Note that a binary tree of depth log2 n can have as many as
2log2 n = n internal nodes, i.e. splits.)
(b) Describe (either mathematically, or in a few concise sentences) a set of n points in
[0, 1] × [0, 1], along with corresponding + or − labels, so that the smallest decision tree
that correctly labels them all has at least n − 1 splits. (Hint: if you can do it with n = 3,
you can do it with arbitrary n.)
(c) Describe n points and corresponding labels that, as in part (b), can only be correctly
labeled by a tree with at least n − 1 splits, with the additional condition that the points
labeled + and the points labeled − must be separable by a straight line. In other words,
there must exist a line segment splitting the unit square in two (not necessarily parallel
to either axis), so that all points labeled + are in one part, and all points labeled − are
in the other. (You will soon see classifiers that would have had a much easier time with
this type of data.)
y = c1 x1 + c2 x2 + (1)
In other words having n equations in your hand is equivalent to having n equations of the
following form: yj = c1 xj1 + c2 xj2 + j , j = 1 . . . n. The goal is to estimate c1 , c2 from those
2
Figure 1: Exercise 2
measurements by maximizing conditional log-likelihood given the input, under different assumptions
for the noise. Specifically:
1. [10 points] Assume that the i for i = 1 . . . n are iid Gaussian random variables with zero
mean and variance σ 2 .
2. [10 points] Assume that the i for i = 1 . . . n are independent Gaussian random variable
with zero mean and variance V ar(i ) = σi .
1. [3 points] Provide descriptions of Naive Bayes and Logistic Regression algorithms for the
dataset above, deriving
3
(a) P (Y = A|X1 ...X16 ) and P (Y = B|X1 ...X16 )
(b) how to classify a new example (i.e. the classification rule)
(c) how to estimate the model parameters
Note: you only need to derive the equation, no need to plug in the actual values.
(a) asymptotically (as the number of training examples grows toward infinity), do you think
Logistic Regression and the Gaussian Naive Bayes will converge toward identical classi-
fiers? Comment on why.
(b) Naive Bayes has the assumption of conditional independence and may not work well
when the data violates this assumption. Do you think Logistic Regression also faces this
problem? If not, why?
4. [10 points] Implement Logistic Regression and Naive Bayes for the dataset above. Use add-
one smoothing when estimating the parameters of your Naive Bayes classifier. For logistic
regression, use a step size around .0001. To train and test, follow these steps:
(a) Randomly split dataset into 2/3 training set, 1/3 testing set.
(b) Choose a random subset of the training data to train, with training sizes m from 2 to
200 (with an increment of 1 or close to 1).
(c) After training each subset, test against the held-out testing set. Calculate the classifi-
cation error as the ratio of incorrectly classified to the total testing set size.
(d) Repeat 100 times from beginning, averaging the classification error over the 100 runs.
(e) Plot the average error vs. the training sizes m, comparing Logistic Regression and Naive
Bayes.
Submit your code online. Submit your printed plot along with your homework.