Udacity Machine Learning Analysis Supervised Learning
Udacity Machine Learning Analysis Supervised Learning
Numpy Documentation
Panads Documentation
Confusion Matrix
From PCA Lessons
http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
Regression Metric
http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.me
an_absolute_error
http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mea
n_squared_error
http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score
http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.
explained_variance_score
8. Causes of Error
Bias and Variance
http://scott.fortmann-roe.com/docs/BiasVariance.html
9. Nature of Data & Model Building
10. Training & Testing
#!/usr/bin/python
http://scikit-learn.org/stable/modules/cross_validation.html
"""
from sklearn import datasets
from sklearn.svm import SVC
iris = datasets.load_iris()
features = iris.data
labels = iris.target
###############################################################
### YOUR CODE HERE
###############################################################
### import the relevant code and make your train/test split
### name the output datasets features_train, features_test,
### labels_train, and labels_test
###############################################################
##############################################################
def submitAcc():
return clf.score(features_test, labels_test)
https://www.python.org/download/releases/2.7/
http://www.numpy.org/
http://scikit-learn.org/stable/
http://matplotlib.org/
http://ipython.org/notebook.html
https://archive.ics.uci.edu/ml/datasets/Housing
https://review.udacity.com/#!/projects/5415419142/rubric
https://www.udacity.com/me
http://discussions.udacity.com/
https://discussions.udacity.com/c/nd009-model-evaluation-validation
4 supervised learning:
1. Supervised Learning Into
2. Decision Trees
R or C based on Output ( Continues or discrete )
Attributes A1 A2 A3
Number of nodes
Important need understand
Information gain + Entropy measure of randomness
S Training
A >> Attribute
Low entropy and high entropy
Maximum information gain
Inductive BIAS
Restriction Bias
Preference Bias
Why ID3 DT Prefer?
PDF File
3. More Decision Trees
#!/usr/bin/python
""" lecture and example code for decision tree unit """
import sys
from class_vis import prettyPicture, output_image
from prep_terrain_data import makeTerrainData
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
from classifyDT import classify
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
#################################################################################
def submitAccuracies():
return {"acc":round(acc,3)}
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
def submitAccuracies():
return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
"acc_min_samples_split_50":round(acc_min_samples_split_50,3)}
In between
Projections (linear algebra )
#!/usr/bin/python
import numpy
import matplotlib
matplotlib.use('agg')
plt.clf()
plt.scatter(ages_train, net_worths_train, color="b", label="train data")
plt.scatter(ages_test, net_worths_test, color="r", label="test data")
plt.plot(ages_test, reg.predict(ages_test), color="black")
plt.legend(loc=2)
plt.xlabel("ages")
plt.ylabel("net worths")
plt.savefig("test.png")
output_image("test.png", "png", open("test.png", "rb").read())
def studentReg(ages_train, net_worths_train):
### import the sklearn regression module, create, and train your regression
### name your regression reg
return reg
Very important idea to make function for regression
import numpy
import matplotlib.pyplot as plt
reg = LinearRegression()
reg.fit(ages_train, net_worths_train)
def submitFit():
# all of the values in the returned dictionary are expected to be
# numbers for the purpose of the grader.
return {"networth":km_net_worth,
"slope":slope,
"intercept":intercept,
"stats on test":test_score,
"stats on training": training_score}
import numpy
import random
def ageNetWorthData():
random.seed(42)
numpy.random.seed(42)
ages = []
for ii in range(100):
ages.append( random.randint(20,65) )
net_worths = [ii * 6.25 + numpy.random.normal(scale=40.) for ii in ages]
### need massage list into a 2d numpy array to get it to work in LinearRegression
ages = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))
For continuous functions you can can add a hidden layer to the network to map the output from
the first layer to match the continous function.
Even arbitrary functions can be modeled by adding a second hidden layer to jump around.
Since there is not much restriction going on here, neural networks are prone to overfitting. Use
cross-validation to measure performance and pick the correct complexity (e.g. number and size
of hidden layers).
preference bias
Note: Considering Gradient Descent over the perceptron training rule for the notes below.
In general, we prefer low complexity in our neural networks. Smaller weights, fewer hidden
layers, and smaller hidden layers.
Choosing small, random values for the initial input weights. Helps us avoid local minima and
ensures that when the algorithm is run subsequent times that it doesn’t fall into the same traps.
Smaller values for weights help avoid the overfitting that large values are prone to (since larger
values allow a wider range of weights that can be applied).
7.5 Neural Nets Mini-project
1.Build a Perceptron.py
#-----------------------------------
#
# In this exercise you will put the finishing touches on a perceptron class
#
# Finish writing the activate() method by using numpy.dot and adding in the thresholded
# activation function
import numpy
class Perceptron:
weights = [1]
threshold = 0
def activate(self,values):
'''Takes in @param values, a list of numbers.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''
return result
def __init__(self,weights=None,threshold=None):
if weights:
self.weights = weights
if threshold:
self.threshold = threshold
#-----------------------------------
#
# In this exercise we write a perceptron class
# which can update its weights
#
# Your job is to finish the train method so that it implements the perceptron update rule
import numpy as np
class Perceptron:
weights = [1]
threshold = 0
def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''
if strength>self.threshold:
result = 1
else:
result = 0
return result
def update(self,values,train,eta=.1):
'''Takes in a 2D array @param values and a 1D array @param train,
consisting of expected outputs for the inputs in values.
Updates internal weights according to the perceptron training rule
using these values and an optional learning rate, @param eta.
'''
#YOUR CODE HERE
#update self.weights based on the training data
def __init__(self,weights=None,threshold=None):
if weights:
self.weights = weights
if threshold:
self.threshold = threshold
#
# In this exercise, you will create a network of perceptrons which
# represent the xor function use the same network structure you used
# in the previous quizzes.
#
# You will need to do two things:
# First, create a network of perceptrons with the correct weights
# Second, define a procedure EvalNet() which takes in a list of
# inputs and ouputs the value of this network.
import numpy as np
class Perceptron:
weights = [1]
threshold = 0
def evaluate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''
return result
def __init__(self,weights=None,threshold=None):
if weights:
self.weights = weights
if threshold:
self.threshold = threshold
Network = [
#input layer, declare perceptrons here
[ ... ], \
#output node, declare one perceptron here
[ ... ]
]
#
# Python Neural Networks code originally by Szabo Roland and used by permission
#
# Modifications, comments, and exercise breakdowns by Mitchell Owen, (c) Udacity
#
# Retrieved originally from http://rolisz.ro/2013/04/18/neural-networks-in-python/
#
#
# Neural Network Sandbox
#
# Define an activation function activate(), which takes in a number and returns a number.
# Using test run you can see the performance of a neural network running with that activation
function.
#
import numpy as np
def activate(strength):
return np.power(strength,2)
#
# As with the perceptron exercise, you will modify the
# last functions of this sigmoid unit class
#
# There are two functions for you to finish:
# First, in activate(), write the sigmoid activation function
#
# Second, in train(), write the gradient descent update rule
#
# NOTE: the following exercises creating classes for functioning
# neural networks are HARD, and are not efficient implementations.
# Consider them an extra challenge, not a requirement!
import numpy as np
class Sigmoid:
weights = [1]
last_input = 0
def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''
return result
def strength(self,values):
strength = np.dot(values,self.weights)
return strength
def update(self,values,train,eta=.1):
'''
Updates the sigmoid unit with expected return
values @param train and learning rate @param eta
result = self.activate(values)
for i in range(0,len(values)):
self.weights[i] += eta*(train - result)*values[i]
def __init__(self,weights=None):
if weights:
self.weights = weights
unit = Sigmoid(weights=[3,-2,1])
unit.update([1,2,3],[0])
print unit.weights
#Expected: [2.99075, -2.0185, .97225]
#
# In the following exercises we will complete several functions for a
# simple implementation of neural networks based on code by Roland
# Szabo.
#
# In this exercise, we will will write a function, predict(),
# which will predict the value of given inputs based on a constructed
# network.
#
# Note that we are not using the Sigmoid class we implemented earlier
# to be able to compute more efficiently.
#
# NOTE: the following exercises creating classes for functioning
# neural networks are HARD, and are not efficient implementations.
# Consider them an extra challenge, not a requirement!
import numpy as np
def logistic(x):
return 1/(1 + np.exp(-x))
def logistic_derivative(x):
return logistic(x)*(1-logistic(x))
class NeuralNetwork:
self.weights = []
#randomly initialize weights)
for i in range(1, len(layers) - 1):
self.weights.append((2*np.random.random((layers[i - 1] + 1, layers[i] + 1))-1)*0.25)
self.weights.append((2*np.random.random((layers[i] + 1, layers[i + 1]))-1)*0.25)
#
# In the following exercises we will complete several functions for
# a simple implementation of neural networks based on code by Roland
# Szabo.
#
# In this exercise, we will begin by writing a function, deltas(),
# which will compute and store delta factors for each node in a
# layer, given the deltas for the previous layer.
#
# Recall that the delta value associated to an output node is the
# activation_derivative
# of the node's last_input multiplied by the difference of its expected output minus
# its actual output
#
# The delta value associated to a hidden node is the activation_derivative of the
# node's last_input times the sum over the next layer of the products of each nodes
# delta value times weight from the current node
#
# NOTE: the following exercises creating classes for functioning
# neural networks are HARD, and are not efficient implementations.
# Consider them an extra challenge, not a requirement!
import numpy as np
def logistic(x):
return 1/(1 + np.exp(-x))
def logistic_derivative(x):
return logistic(x)*(1-logistic(x))
class Sigmoid:
def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''
return result
def strength(self,values):
# Formats inputs to easily compute a dot product
local = np.atleast_2d(self.weights)
values = np.transpose(np.atleast_2d(values))
strength = np.dot(local,values)
return float(strength)
def __init__(self,weights=None):
if type(weights) in [type([]), type(np.array([]))]:
self.weights = weights
class NeuralNetwork:
self.nodes = [[]]
#input nodes
for j in range(0,layers[0]):
self.nodes[0].append(Sigmoid())
#randomly initialize weights
for i in range(1, len(layers)-1):
self.nodes.append([])
for j in range(0,layers[i]+1):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i - 1]+1)-1)*.25))
self.nodes.append([])
for j in range(0,layers[i+1]):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i]+1)-1)*.25))
def deltas(self,expected,outputs,layer):
'''
:param expected: an array of expected outputs (in the case of an output layer) or deltas from the
previous layer (in the case of an input layer)
:param ouptuts: an array of actual outputs from the layer
:param layer: which layer of the network to update.
sets the delta values for the units in the layer
:returns: a list of the delta values for use in the next previous layer
'''
import numpy as np
def logistic(x):
return 1/(1 + np.exp(-x))
def logistic_derivative(x):
return logistic(x)*(1-logistic(x))
class Sigmoid:
def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''
result = logistic(strength)
return result
def strength(self,values):
# Formats inputs to easily compute a dot product
local = np.atleast_2d(self.weights)
values = np.transpose(np.atleast_2d(values))
strength = np.dot(local,values)
return float(strength)
def __init__(self,weights):
if type(weights) in [type([]),type(np.array([]))]:
self.weights = weights
class NeuralNetwork:
self.nodes = [[]]
#input nodes
for j in range(0,layers[0]):
self.nodes[0].append(Sigmoid())
#randomly initialize weights
for i in range(1, len(layers)-1):
self.nodes.append([])
for j in range(0,layers[i]+1):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i - 1]+1)-1)*.25))
self.nodes.append([])
for j in range(0,layers[i+1]):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i]+1)-1)*.25))
# for x in self.nodes:
# print len(x),"Layer",len(x[-1].weights)
#to train on each example, we will first need to evaluate the example from X
#storing the signal strength at each node before the activation is applied.
#Then compare the outputs in y to our outputs, and scale them by the
#activation_derivative(strength) at the signal strengths for each of the output
#nodes.
#Iterate backwards over the layers, using the deltas method below to associate a
#rate of change to each node
#then modify each of the (non-input) node's weights by the learning rate times
#the current node's delta times the previous node's last input.
def deltas(self,y,outputs,layer):
'''
:param y: an array of expected outputs
:param ouptuts: an array of actual outputs from the layer
:param layer: which layer of the network to update. Use -1 for output layer.
sets the delta values for the units in the layer
:returns null:
'''
if layer==-1:
final = [y[i]-outputs[i] for i in range(0,len(y))]
else:
final = []
for i in range(0,len(self.nodes[layer])):
sum=0
for j in range(0,len(self.nodes[layer+1])):
sum+= self.nodes[layer+1][j].weights[i] * self.nodes[layer+1][j].delta
final.append(sum)
for i in range(0,len(self.nodes[layer])):
self.nodes[layer][i].delta = logistic_derivative(outputs[i])*final[i]
8. Kernel Methods & SVMs
9. Kernel - Georgia Tech - Machine Learning
9. SVM
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
########################## SVM
#################################
### we handle the import statement and SVC creation for you
here
from sklearn.svm import SVC
clf = SVC(kernel="linear")
def submitAccuracy():
return acc
Mapping on Z-axis or projection >> near or small and large values
10. Instance Based Learning
11. Naive Bayes
#!/usr/bin/python
import numpy as np
import pylab as pl
### the training data (features_train, labels_train) have both "fast" and "slow" points mixed
### in together--separate them so we can give them different colors in the scatterplot,
### and visually identify them
grade_fast = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]
bumpy_fast = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]
grade_slow = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]
bumpy_slow = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]
# You will need to complete this function imported from the ClassifyNB script.
# Be sure to change to that code tab to complete this quiz.
clf = classify(features_train, labels_train)
### draw the decision boundary with the text points overlaid
prettyPicture(clf, features_test, labels_test)
output_image("test.png", "png", open("test.png", "rb").read())
#!/usr/bin/python
#import numpy as np
#import matplotlib.pyplot as plt
#plt.ioff()
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
h = .01 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
plt.savefig("test.png")
import base64
import json
import subprocess
def makeTerrainData(n_points=1000):
###############################################################################
### make the toy dataset
random.seed(42)
grade = [random.random() for ii in range(0,n_points)]
bumpy = [random.random() for ii in range(0,n_points)]
error = [random.random() for ii in range(0,n_points)]
y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
for ii in range(0, len(y)):
if grade[ii]>0.8 or bumpy[ii]>0.8:
y[ii] = 1.0
### use the trained classifier to predict labels for the test features
pred = #TODO
def submitAccuracy():
accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test)
return accuracy
is you should save maybe 10% of your data and you use it as your testing set.
And that's what you use to actually tell you how much progress you're
your client, use the test results, because it's much better,
fair kind of understanding of how well you are doing with you training data.
11.5 Bayes NLP Mini Project
12. Bayesian Learning
we're trying to learn the most likely or the most probable hypothesis
given the data and whatever domain knowledge we bring to bear. You buy that?
it's the hypothesis
that we think is most likely, given the data that we've seen.
Given the training set and given whatever domain knowledge that we bring to
most likely, that is Most probable. Or most l, probably the correct one.
It's the
the hypothesis class that has the highest probability given the data.
chain rule basically the definition of conditional probability in conjunctions
hypothesis given the data. But what do all these other terms mean?
probability of the data given the hypothesis right?
likelihood that
main knowledge.
he probability of the hypothesis given
labeling the data, then also your probability of the hypothesis will go up.
I guess the probability of the data going
the probability of the data given the hypothesis times the probability
of the hypothesis divided by the probability of the data.
you have any reason to believe We have strong evidence that someone
might have some condition, then it makes sense to test them for it.
that Bayes' rule actually gives you some information. It actually helps
the data W which we know is equal to the probability of the data given
because it's not the case that the probability of h given d equals, it's the
so, in fact, it's probably better to say that I'm going to approximate
data given the hypothesis times the probability of the hypothesis and just go
priors.
that often it's just as hard
a strong prior.
that we're going to be given a bunch of labeled
training data, which I'm writing here as x sub i and d sub i, so x sub i
is whatever the input space is, and d sub i are these labels. And let's say, it
that down because I think it's important. They're noise-free examples. Okay.
>> That's right, for all xi. So, the second assumption, is that the true
more likely than any other. And so, we have a uniform prior over our hypotheses.
>> So it's like the one thing we know is that we don't know anything.
>> That's right. So, sometimes people called this an uninformative prior
because you don't know anything. Except of course I've always thought that's
you something that all hypothesis are equally likely. But that's
>> Is it? So its just an ignorant prior is what you're telling me.
these are our assumptions. We've got
>> Exactly right, uniform means Exactly that. Okay so we've got one of our
terms, good job. let's pick another term. How about the probability of
noise free so it's always, so they're always going to be zeros and ones.
And multiply all that by one over the size hypothesis space,
probability of the data given the hypothesis Which we know is one for all those
is the size of the version space Over the size of the hypothesis base which,
or the right one, is simply uniform over all of the hypotheses that are
in the version space. That is, are consistent with the data that we see.
>> Nice.
>> It
for all the hypotheses. Now this is exactly the algorithm that
to pick one over the other from the version space. They're
>> Yeah,
that follows.
anything at all about exactly what the labels were other than
that they were labels of some sort. The strongest assumption that we
you've got noise free data, you have to find that hypothesis space, and
> Where this is the hypothesis that actually matters. We're saying that X
comes in, the hypothesis spits that same X out. And then this noise process
one over two to the case. So, the probability that that would happen from
this hypothesis. for the very first data item. The one to five, would be
1 32nd. That's the probability that a one would produce a five by this process.
>> Okay.
>> Uh-hm.
tripled,
>> Uh-hm.
like the first one, so that will be one thirty second as well,
>> Mm-hm.
probability that all these things would happen is exactly the product.
>> Right.
is 65,536. So it should be 1 over, oh you already wrote it. 65,536. Yea that.
>> Yes that's absolutely correct Michael. Well done. Okay so,
there a more generic Is there a general form that we could write down?
>> And it was then the product of, of that quantity for all of
the data elements, so all the i's. So product over all the i's of that.
>> Okay.
>> Right.
i is equal to zero and this formula holds and it's zero otherwise.
>> Exactly.
>> Okay. Sounds good. Okay, great Michael. So that's right and that
was exactly the right way of thinking about it. And now, what we're
going to do next, is we're going to take what we've just gone through.
This sort of process of thinking about, how to generate data labels. for,
think you will find will be a pretty cool derivation. Sound good?
>> Awesome!
>> Excellent.
8. Return to Bayesian Learning - Georgia Tech - Machine Learning >> return for TXT subtitles File
the hypothesis.
all of the data together to each of the
of seeing one particular P sub i, given that we're in a world where H is true.
>> So okay, given that H is true that means whatever the corresponding xi
just need an expression for saying how likely it is that we get that much error.
Our goal here is, given all of this training
by, the noise model. And the noise is a Gaussian. So, we can
just think more generally about gradient descent right? The way gradient descend
>> Yes, you get all of the stuff that people have been doing.
things like gradient descend and linear regression, all of the stuff we
were talking about before and we have an argument. For why it's
because of the specific assumptions that we've made. So what were the
assumptions that we made? We assumed that there was some True deterministic
fact assuming that the data that you have has been corrupted by
actually not trying to model determinance function, of this sort. And then
you are in fact, possibly, in fact most likely doing the wrong thing.
know, you couldn't do all these cute tricks with natural logs but
yes, you would end up with something different. And one question
model was not the right one, what sort of bad things might
happen? Here let me give you an example, let's imagine that we're
looking at this here, and our X's are, I don't know measurements
>> Mm-hm.
let's make it even simpler than that. Let's imagine that our
>> Because if the x's are noisy, then this is not a valid assumption.
>> I see.
when it will work, but it's not clear that this particular assumption
an error term inside the f along with the x and f is say linear.
>> Mm-hm.
part of the noise term and, and it all still goes through.
Like I feel lines are still pretty happy even with that.
linear, I mean linear functions are very nicely behaved in that way.
>> Yeah.
So your measuring device that gives you an error for your height
would also have to give you an independent normal error for the weight.
>>
>> Yeah.
>> Mm-hm.
>> Okay good. So let's move on to the next thing Michael. Let's try one
more example of this and, and then I hope that means you got it, okay?
>> Sure.
>> Beautiful.
here for you is, are a maximum a posteriori equation, right?
going to do the log of both sides here. But this time I'm going to do
notion of entropy, that the optimal code for some event with probability P has
length minus log base 2 of P. So, that just comes straight out of information
theory. That's where all the entropy stuff comes from. Okay. So, if we
as lengths.
>> So my question to you is, given that this definition over here, that an
>> Mm-hm.
>> And the length of the hypothesis, or the probability of the hypothesis.
>> Yep.
>> So, you said you don't know what that means. But, let's think about
that out loud for a moment. What does it mean to have a length of
>> Okay.
>> The number of bits that we need to represent the hypothesis is, I guess,
We are taking all the different hypotheses and writing them out. The ones
prior. And those are going to have smaller lengths than the optimal code.
And the ones that are less common are going to have longer codes.
trees.
that we prefer smaller trees
minus one in front But these terms actually have meanings in information theory,
the best hypothesis, the hypothesis with the maximum a posteriori probability is
the one that minimizes error and the size of your hypothesis.
Or I can have a little bit of error for a smaller hypothesis. But this is the
hypothesis that still explains your data, that is, minimizes your error.
But, you do have some
real issues here about for example units. So, I don't know if
the units of the size of the hypothesis are directly comparable to
some way of translating between them... And some way of making the
decision whether you would rather minimize this or you'd rather minimize that
right like you actually had to write it down and transmit it, it
makes a lot of sense. But then I was thinking about neural networks.
And, and, and given that a fixed neural network architecture it's always
the same number of weights and they're just numbers. So you just
that those weights, if they get really you're going to need more
bits to express those big weights. And in fact that's exactly when
get too big. So like this gives a really nice story for understanding
directly but in what you need to represent the value of the parameters.
>> Wow.
>> Yeah, but the point here, Michael, I want to wrap this
up. The point here is we've now used Bayesian learning to derive
>> Neat.
>> Okay, good. Now one more thing, Michael, I'm going to show you.
>> Well, so, okay, I guess. The here, so here's what I'm seeing.
So I'm, what I'm seeing is that hypothesis one is the most likely hypothesis.
>> It's not just the most likely, it's the most a posteriori.
>> Yes.
>> But, if we're saying what's the
is, is, we have to actually look over all the hypotheses and in a sense,
let them vote. So the probability that the label is minus is actually 0.6, which
little tricky thing here for you Michael. You've been complaining about
>> Ohhh.
>> Because in fact the problem here, we've been talking about all along
is, what's the best hypothesis. But here. I ask you what's the best label?
is the best one, and then simply output max. That's how you find the best
hypothesis, but that's not how you find the best label. The way you find the
>> Okay.
>> So the best, if you can only output hypothesis and use that hypothesis,
in fact, you would say plus. But if you asked everyone
did effectively with KNN and all these other kind of. Weighted
data, and I think the probability laws would tell us that's equal
label given the data, which is, like, the probability of the
one that maximizes this expression. And this follows directly from
lie, in that I've led you down this path that somehow,
the truth is, finding the value is what we actually care about.
to find the best actual label or the best value for it.
>> We learned Bayes rule. We even learned how to derive Bayes rule.
we're swapping causes and effects. Sort of mathematically when we think about
>> ML, right. The maximum likelihood hypothesis. > Right. And what's the
>> It's the MAP that you get when the prior is uniform.
>> Yeah, that was pretty, I really liked that. So, we basically der, we
>> Mm-hm.
>> And then finally,
you told me that was all a lie, and you said that really what you want to do
all these Bayesian equations lead us to the question, of how we actually infere
>> Thanks.
>> Bye.
13. Bayesian Inference
https://towardsdatascience.com/a-gentle-introduction-to-maximum-likelihood-estimation-and-
maximum-a-posteriori-estimation-d7c318f9d22d
https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php
what I want you to do
there's a storm and there's lightning at the same time? So, what do you think?
> Yeah, random day at 2 PM. And we
>> It's fairly high at 2 PM. Let's say it happens a quarter of the time.
>> Mm-hm.
>> Wow. Alright. Now what's the probability that you look
>> And what's the probability that you look out and
>> So, 5 13ths is obviously 0.4615. And there you go. Is that right?
>> Yes. That was perfect. Yeah so its usually when there's a
storm, its not lightningy. It's less than half the time. That makes sense.
>> It does because otherwise lightning would be happening all the time.
every time its storming because otherwise it would be lightning all the time
>> RIght.
>> And often there's breaks between lighting. In fact, most of the time
one of the things you should notice here is that each time we add
>> Not just, not just two more, but like, twice as many. And so
and it's got, I don't know, a hundred variables, that's going to be a lot.
>> That's, that's, I can't even, I can't even think about that.
>> There's only like four numbers, one, two, three, many, and too many.
>> Right and so if we had, add variable like that it's going
being the truth, which is the idea that instead of representing all, so,
so, in this case, there's eight numbers. Instead of representing them as eight
the case that for all possible values, little x, little y and
it's the case that the probability that big X, the random variable
>> Fewer.
>> Fair enough.
equals that. Oh, that means that P(x) times P(y) equals
from the chainable point of view it's like saying the probability of
it looks just like the equation you wrote down for conditional independence.
>> Right, the only thing that we added is this notion that it might
be the case that we don't have such a strong property as this where
it's always the case that you can write the probability of x given y
>> Okay, that's pretty cool. That means more powerful or something.
let's apply this content back to what we were talking about before.
>> Okay.
So the concept of a belief network, sometimes also known
but it's the same idea over and over again. And
also true that that you can figure out what the probability
And these are numbers that you can just get by marginalizing
So, what I'd like you to do is actually fill in these boxes as a quiz. And to
and lightning is true, so that would be, point, that's all the
>> Alright and why are we looking at the case where storm is true?
>> Why are we doing it? Because it's conditionally independent of storm.
chance that you don't hear thunder when you hear lightning.
where we have thunder and there's not lightning. So we find that row.
>> Right and we do the same trick we did before and we get,
.04 over .4. Which I think we did last time, actually, and we get .1.
>> Right.
basically another edge. Here, and what that represents is that thunder, to work
storm and lightning, all the joint combinations of those to make it work.
<< And that grows exponentially as you add more and more data. <<
And that's right, and that's something that threw me when I started to look
at this, because the picture looks a lot like a neural net. Right? In
a neural net, you've got these nodes, you've got arrows going into the nodes,
and when you have a bunch of arrows going into the same node,
animal. In particular, now, what we're really saying is, to work out the value
you have more variables coming into the node. Higher in degree.
Though it's not exactly a tree. Doesn't have to be a tree so the parents
>> Hm.
>> So if you erase the red line between storm and thunder,
>> Oh okay.
just because there is an arrow between them. These arrows are just telling
>> Okay so let me make sure I understand, what you are saying is, it
It's really just talking about the fact that we can derive numbers from other
numbers, and not that You know things cause other things. So yeah, that's a
really good point. It seems like that was an easy place to get slipped up.
>> Mm-hm.
>> Maybe in this case but I would think that that wouldn't be generally true.
>> There we go. Yeah, that's what I was looking for. So, topological sort.
>> Right, and so this a standard thing that you can do with a graph, and it's
>> Let's see. Topological only makes sense if you really can
variable depends on other variables. But they all, it ultimately has to bottom
>> Mm-hm.
>> But we're talking only about the directed ones here. So, the directed
ones yeah, it'd have to be acyclic for the, for the probability distribution
to be meaningful.
>> I'm sure we could make something up, but this is, typically
this is how it's done. It's, it's, we constrain ourselves to acyclic graphs.
of A. Well, maybe that's like probability of what, what A was one time step
that you want, then you could solve very, very hard problems efficiently
using that idea. So it's, it's cute, but it's kind of takes us
a little bit off our path, so I'm not going to get into that.
heart, and now we've got some good arguments for why it actually is.
Did you get it?
>> Good.
>> So it has to be the one on the sec, the second and just to make sure
>> Yeah. So this is actually just one way you could just read this
network is to say what is this node x with an arrow coming into it?
That is the probability of x. But, the, the things pointing into it are what's
>> Right. So this, this, so this makes sense to me. This is why when
>> Well the arrows are a form of dependence but it's not a causal
>> Hm.
>> And the last of these three equations just Bayes rule,
this time written correctly where the denominator has to be the probability
>> Excellent.
All right. So let's put some of these rules into play
have gone through an exercise where you actually use these ideas.
0.2. And that it's not udacity given that it's spam. Is
So this is the case when things, when it is spam, and if it's not spam, we
the relative probabilities between it being spam and not spam. So then I'm a
big fan of normalization, but of course this makes me think about, since it's
>> Oh we do?
machine learning. But but first let's write a general form of this formula.
>> Okay.
like this, when you have all these little bristly things coming down.
That we observe the attribute values and we can infer the class.
the most likely class given the, the data that you've seen. You can just take
an arg max over all the different possible values of that, that root node of
in the number of variables, it's just linear. There's, two probabilities for
can actually estimate these probabilities. So so far, we've only been talking
a setting where we just write down what all the numbers are.
times you had that class at all, and that gives you the
infinite data this is actually going to give you exactly the right
number. It also connects this notion of inference that we've been
talking about with classification. Which is mostly what this, this mini
course has been about. So, that's really great to have a connection,
like instead of only generating what the labels are, we can actually
any of these directions. And it turns out it's wildly successful empirically.
classification in what they do. If you have enough data you can estimate
these values really well, and Naive Bayes is just remarkably good. So yeah
so it's like unclear why we'd even have any other algorithms, right Charles?
>> Well, there's no free lunch. But I, I gotta say I, I you know
there's this as a famous man once said it works in practice but doesn't work
in theory. And I'm trying to figure out how this can possibly work.
So I noticed it's called Naive Bayes. And, I think I know why now.
>> Alright.
question I have. I have two, we'll save the second one though. One question
that this works in practice? Hm, that's a good question. It does. Moving on.
>> No?
>> Alright.
>> Now,
now that I yelled at you, why don't I, why don't I give it a guess.
>> [LAUGH]
so here, here write this down. So let's imagine there are four
actually you can use the network that you have up there okay
>> Good.
>> So let's say that the first attribute, I'm just going to call it A
and the second attribute I'm going to call B, and let's say we're really, we're
>> The third attiribute is the first one, the fourth attribute is
the second one. There's not way around that. And so you'd think
Naive Bayes would fail. But, actually, looking at your equation right below
there where you're doing counting, I actually think, it'll work just fine.
>> Why?
inner relationships or, you know, you have enough attributes and,
>> All right and did you have other issues with it?
equation you wrote there. So it's really nice and neat that you
just doing counting. But, I don't have an infinite amount of data, right?
>> Right.
>> So.
be enough to do veto.
people often do. People will often, what they call smooth
that, then you're believing your data too much. You're kind of over fitting.
>> Oh, oh, it's okay, okay so, so, so, so, so bear with
but whatever. If you, you'd think that by being smooth, then you're making
>> Good.
>> Huh.
>> Nice.
So I was thinking of talking to you more
>> Sure, I can help you with that. We covered Bayesian Inference [LAUGH]
I'm sorry.
>> I'm going to choose not to pay attention to that. Instead, write
Bayesian Networks. We talked about the
>> Right. We did a lot of examples of how to actually do inference with networks.
>> Well first we did say that, that in general it's hard
>> Mm-hm.
link between all this bayesian stuff. The bayesian rabbit hole we
is this nice idea that we had a gold standard, right? We had a sort of way
of talking about what the right hypothesis was
you can't actually do the for loop that requires you compute
was all very cool. But what you've done here when you
have to worry about just figuring out the most likely label
you can compute anything from it, you could try to ask
well what's the likelihood that I see some particular attribute or set
on all those kind of things that you could do. With the
Bayesian learning. So inference gives us this power to not just
>> Cool. Yeah, well said. The, the For, and another thing,
really well. So whereas things like, oh. You know, decision trees
the decision tree where you need to know that attribute value you're
stuck. Whereas in this naive base setting, you can still do the
by probabilities.
>> Nice.
the homework problems. But I think that's enough for Bayesian inference.
And I think that actually wraps up classification and regression more generally.
>> Right. So we're done with supervised learning. Well, one's never done with
supervised learning. But we're at least done with this part of the course.
>> [LAUGH]
>> And your input will be the exam, and then we'll give you a label back.
>> This has been fun. I will see you in the second mini course.
>> Bye.
14. Ensemble B&B
hypothesis. That's the specific hypothesis that our learner
>> Yeah
but I'm not seeing why that's different from number of mismatches in the
how we would get our testing set anyway. Then I would think that would
be you know if it's large enough a pretty good approximation of this value.
mean I'm just going to put four dots on the the screen.
>> Hm.
then I'm going to tell you this particular learner output a hypothesis.
the first one and the third one right, but gets the
second and the fourth one wrong. So what's the error here?
>> Mm.
out that you get the first and the third one right,
and you get the second and the fourth one wrong.
just to be clear here's the question again. What happens to the distribution
over a particular example i when the hypothesis ht that was output by the
Need to understand
So that ties together this, what constructed E does
sub T. So the final hypothesis is just the s g n function of the weighted sum of
all of the rules of thumb, all of the weak classifiers that you've been picking
the alpha sub T is one half of the natural log of one minus epsilon T over
epsilon T. That is to say, it's a measure of how
well you're doing with respect to underlining error. So, you get more
you get less weight. So what does this look like to you?
how well each of the individual hypotheses are doing and then you
you say you know what? Negative and if its above zero you
say you know what? Positive and if its zero you just throw
up your hands and And return zero. In other words, you return literally
the sign of the number. So you are throwing away information there, and
to the next lesson it;s going to turn out that that little bit of
bit of a teaser. We'll get back to that there. Okay so, this
is boosting, Michael. There's really nothing else to it. You have a very
>> Thanks
>> You're welcome. I'm about helping others Michael you know that.
See translate for 10 min Talks about Good Answers
as we create
more and more of these hypotheses, which you would think would
make something more and more complicated, it turns out that you
less complicated. So the reason boosting tends to do well and tends to avoid
over fitting even as you add more and more learners is that you're increasing
the margin. And there you go. And if you look in the reading that
>> Cool.
I mean, the story that you told says that it's going to try
>> Hm. Okay, well you know, maybe, maybe it's worthwhile to
>> All right. Well, let me start off with what I think the
answer isn't. So, the last one, boosting tends to overfit, if boosting trains
too long. You just told me a story about that not being true.
>> Okay.
in fact if a whole lot of data included all of the data, and you actually
>> because it'll work on the testing data as well, because it's in there.
>> Right.
>> All right. Weak learner uses artificial neural network with
of parameters.
>> Sure.
might be the right answer, but you want to think about it some more?
>> And that is, in fact, correct. So let me give you an example
through the loop again, you will just call the same
So every time you call the learner, you'll get zero training error, but you will
just get the same neural network over and over and over again. And a weighted
>> Interesting.
with you for a moment, Michael. You used the word strongest at
some point, when you were talking about using the weakest output. And
I just want to point out that, that doesn't really mean anything.
a strong learner?
>> Because anything that does better than a half is still doing better
>> No.
not how people define weak people. They define weak people, by saying they
can't lift more than, not that they can lift at least as much.
>> I see. So it's this piece of terminology that boosting uses that is in
>> That's one interpretation. It's not the one that I would use, but
it's one interpretation. When you say something like a strong learner, I mean,
it's very difficult to sort of pin down. So don't get too caught
>> Good point yeah, also, also that this whole notion that strong
if you have something that's not a weak learner that it's, then
just throw one more thing in here and then we can stop
talking about this. There's another, a couple of other cases where boosting
>> [LAUGH]
>> I'm sorry. There's no way for that to be obvious from what we've
>> [LAUGH] No. Although I did recently see, see them in concert. But
that's a whole other conversation. Okay, so pink noise just means uniform noise.
>> No, white noise is Gaussian noise. Okay, so pink noise is uniform
noise and white noise is Gaussian noise. This is why, Michael, by the way,
if you ever try to set up a studio or a cool stereo system in your house, you
all the frequencies equally, not just the white noise. generated.
>> Hm.
>> But boosting tends to overfit in those sorts of circumstances. And you
can read more about it in the notes if you want to. But
the one that I want I really want people to get is, that
for boosting to overcome that. Because fundamentally you've already done all of
your overfitting and it's, there's really not much for those things to do.
>> Excellent. It all ties back into margins, and it's all one