100% found this document useful (1 vote)
484 views

Udacity Machine Learning Analysis Supervised Learning

The document provides an overview of various machine learning concepts and algorithms including: 1. NumPy and Pandas tutorials and documentation links 2. Scikit-learn tutorials on classification, regression, and evaluation metrics 3. Decision tree algorithms and examples in Python using scikit-learn 4. Linear regression examples in Python using scikit-learn to predict net worth from age 5. Neural networks concepts like the Heaviside step function and techniques like momentum 6. Links to further resources on model evaluation, bias and variance, data preparation, and more.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
484 views

Udacity Machine Learning Analysis Supervised Learning

The document provides an overview of various machine learning concepts and algorithms including: 1. NumPy and Pandas tutorials and documentation links 2. Scikit-learn tutorials on classification, regression, and evaluation metrics 3. Decision tree algorithms and examples in Python using scikit-learn 4. Linear regression examples in Python using scikit-learn to predict net worth from age 5. Neural networks concepts like the Heaviside step function and techniques like momentum 6. Links to further resources on model evaluation, bias and variance, data preparation, and more.

Uploaded by

yousef shaban
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 504

Start from model evaluation and validation

Start >> 5. Numpy and Panads Tutorials

Numpy Documentation

Panads Documentation

Panadas >> DataFrame


6. Scikit-learn Tutorial
7. Evaluation Metrics
Data leakage
accuracy
Definition >> Classification and regression

Confusion Matrix
From PCA Lessons
http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
Regression Metric

http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.me
an_absolute_error

http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mea
n_squared_error

http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score

http://scikit-
learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html#sklearn.metrics.
explained_variance_score
8. Causes of Error
Bias and Variance
http://scott.fortmann-roe.com/docs/BiasVariance.html
9. Nature of Data & Model Building
10. Training & Testing
#!/usr/bin/python

""" this example borrows heavily from the example


shown on the sklearn documentation:

http://scikit-learn.org/stable/modules/cross_validation.html

"""
from sklearn import datasets
from sklearn.svm import SVC

iris = datasets.load_iris()
features = iris.data
labels = iris.target

###############################################################
### YOUR CODE HERE
###############################################################

### import the relevant code and make your train/test split
### name the output datasets features_train, features_test,
### labels_train, and labels_test

### set the random_state to 0 and the test_size to 0.4 so


### we can exactly check your result

###############################################################

clf = SVC(kernel="linear", C=1.)


clf.fit(features_train, labels_train)

print clf.score(features_test, labels_test)

##############################################################
def submitAcc():
return clf.score(features_test, labels_test)

11. Cross Validation


We reset tradeoff
http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html
12. Representative Power of a Model
13. Learning Curves and Model Complexity
Need understand
14. Project Prep
https://github.com/udacity/machine-learning

https://www.python.org/download/releases/2.7/

http://www.numpy.org/
http://scikit-learn.org/stable/

http://matplotlib.org/

http://ipython.org/notebook.html
https://archive.ics.uci.edu/ml/datasets/Housing
https://review.udacity.com/#!/projects/5415419142/rubric

https://www.udacity.com/me

http://discussions.udacity.com/
https://discussions.udacity.com/c/nd009-model-evaluation-validation
4 supervised learning:
1. Supervised Learning Into
2. Decision Trees
R or C based on Output ( Continues or discrete )
Attributes A1 A2 A3
Number of nodes
Important need understand
Information gain + Entropy measure of randomness
S Training
A >> Attribute
Low entropy and high entropy
Maximum information gain
Inductive BIAS
Restriction Bias
Preference Bias
Why ID3 DT Prefer?
PDF File
3. More Decision Trees
#!/usr/bin/python

""" lecture and example code for decision tree unit """

import sys
from class_vis import prettyPicture, output_image
from prep_terrain_data import makeTerrainData
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
from classifyDT import classify

features_train, labels_train, features_test, labels_test = makeTerrainData()

### the classify() function in classifyDT is where the magic


### happens--it's your job to fill this in!
clf = classify(features_train, labels_train)

#### grader code, do not modify below this line

prettyPicture(clf, features_test, labels_test)


output_image("test.png", "png", open("test.png", "rb").read())

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

#################################################################################

########################## DECISION TREE #################################

#### your code goes here

acc = ### you fill this in!


### be sure to compute the accuracy on the test set

def submitAccuracies():
return {"acc":round(acc,3)}
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

########################## DECISION TREE #################################

### your code goes here--now create 2 decision tree classifiers,


### one with min_samples_split=2 and one with min_samples_split=50
### compute the accuracies on the testing data and store
### the accuracy numbers to acc_min_samples_split_2 and
### acc_min_samples_split_50, respectively

def submitAccuracies():
return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
"acc_min_samples_split_50":round(acc_min_samples_split_50,3)}

Data Impurity and Entropy


Very important information Gain
4. Regression & Classification

In between
Projections (linear algebra )
#!/usr/bin/python

import numpy
import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt


from studentRegression import studentReg
from class_vis import prettyPicture, output_image

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()

reg = studentReg(ages_train, net_worths_train)

plt.clf()
plt.scatter(ages_train, net_worths_train, color="b", label="train data")
plt.scatter(ages_test, net_worths_test, color="r", label="test data")
plt.plot(ages_test, reg.predict(ages_test), color="black")
plt.legend(loc=2)
plt.xlabel("ages")
plt.ylabel("net worths")
plt.savefig("test.png")
output_image("test.png", "png", open("test.png", "rb").read())
def studentReg(ages_train, net_worths_train):
### import the sklearn regression module, create, and train your regression
### name your regression reg

### your code goes here!

return reg
Very important idea to make function for regression
import numpy
import matplotlib.pyplot as plt

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

### get Katie's net worth (she's 27)


### sklearn predictions are returned in an array, so you'll want to index into
### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not
### exact syntax, the point is the [0] at the end). In addition, make sure the
### argument to your prediction function is in the expected format - if you get
### a warning about needing a 2d array for your data, a list of lists will be
### interpreted by sklearn as such (e.g. [[27]]).
km_net_worth = 1.0 ### fill in the line of code to get the right value

### get the slope


### again, you'll get a 2-D array, so stick the [0][0] at the end
slope = 0. ### fill in the line of code to get the right value

### get the intercept


### here you get a 1-D array, so stick [0] on the end to access
### the info we want
intercept = 0. ### fill in the line of code to get the right value

### get the score on test data


test_score = 0. ### fill in the line of code to get the right value

### get the score on the training data


training_score = 0. ### fill in the line of code to get the right value

def submitFit():
# all of the values in the returned dictionary are expected to be
# numbers for the purpose of the grader.
return {"networth":km_net_worth,
"slope":slope,
"intercept":intercept,
"stats on test":test_score,
"stats on training": training_score}

import numpy
import random

def ageNetWorthData():

random.seed(42)
numpy.random.seed(42)

ages = []
for ii in range(100):
ages.append( random.randint(20,65) )
net_worths = [ii * 6.25 + numpy.random.normal(scale=40.) for ii in ages]
### need massage list into a 2d numpy array to get it to work in LinearRegression
ages = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

from sklearn.cross_validation import train_test_split


ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths)

return ages_train, ages_test, net_worths_train, net_worths_test


Very important note
2. Parametric regression
Suppose for specific point K-nearst
7. Neural Networks
https://en.wikipedia.org/wiki/Heaviside_step_function
It is differentiable function , differentiable
momentum
preference and restriction
biases for supervised
learning algorithms
https://downey.io/notes/omscs/cs7641/restriction-and-preference-bias-supervised-learning/

artificial neural networks (ann)


restriction bias
Neural networks don’t restrict much at all. At their most basic, you can represent boolean
functions with a single layer network of threshold perceptrons.

For continuous functions you can can add a hidden layer to the network to map the output from
the first layer to match the continous function.

Even arbitrary functions can be modeled by adding a second hidden layer to jump around.

Since there is not much restriction going on here, neural networks are prone to overfitting. Use
cross-validation to measure performance and pick the correct complexity (e.g. number and size
of hidden layers).

preference bias
Note: Considering Gradient Descent over the perceptron training rule for the notes below.

In general, we prefer low complexity in our neural networks. Smaller weights, fewer hidden
layers, and smaller hidden layers.

This is accomplished by:

 Choosing small, random values for the initial input weights. Helps us avoid local minima and
ensures that when the algorithm is run subsequent times that it doesn’t fall into the same traps.
 Smaller values for weights help avoid the overfitting that large values are prone to (since larger
values allow a wider range of weights that can be applied).
7.5 Neural Nets Mini-project
1.Build a Perceptron.py
#-----------------------------------

#
# In this exercise you will put the finishing touches on a perceptron class
#
# Finish writing the activate() method by using numpy.dot and adding in the thresholded
# activation function

import numpy

class Perceptron:

weights = [1]
threshold = 0

def activate(self,values):
'''Takes in @param values, a list of numbers.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''

#YOUR CODE HERE


#TODO: calculate the strength with which the perceptron fires

#TODO: return 0 or 1 based on the threshold

return result

def __init__(self,weights=None,threshold=None):
if weights:
self.weights = weights
if threshold:
self.threshold = threshold

#-----------------------------------

#
# In this exercise we write a perceptron class
# which can update its weights
#
# Your job is to finish the train method so that it implements the perceptron update rule

import numpy as np

class Perceptron:

weights = [1]
threshold = 0

def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''

#First calculate the strength with which the perceptron fires


strength = np.dot(values,self.weights)

if strength>self.threshold:
result = 1
else:
result = 0

return result

def update(self,values,train,eta=.1):
'''Takes in a 2D array @param values and a 1D array @param train,
consisting of expected outputs for the inputs in values.
Updates internal weights according to the perceptron training rule
using these values and an optional learning rate, @param eta.
'''
#YOUR CODE HERE
#update self.weights based on the training data

def __init__(self,weights=None,threshold=None):
if weights:
self.weights = weights
if threshold:
self.threshold = threshold

#
# In this exercise, you will create a network of perceptrons which
# represent the xor function use the same network structure you used
# in the previous quizzes.
#
# You will need to do two things:
# First, create a network of perceptrons with the correct weights
# Second, define a procedure EvalNet() which takes in a list of
# inputs and ouputs the value of this network.

import numpy as np

class Perceptron:

weights = [1]
threshold = 0

def evaluate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''

#First calculate the strength with which the perceptron fires


strength = np.dot(values[i],self.weights[i])

#Then evaluate the return value of the perceptron


if strength >= self.threshold:
result = 1
else:
result = 0

return result
def __init__(self,weights=None,threshold=None):
if weights:
self.weights = weights
if threshold:
self.threshold = threshold

Network = [
#input layer, declare perceptrons here
[ ... ], \
#output node, declare one perceptron here
[ ... ]
]

def EvalNetwork(inputValues, Network):

# Be sure your output values are single numbers


return OutputValues

#
# Python Neural Networks code originally by Szabo Roland and used by permission
#
# Modifications, comments, and exercise breakdowns by Mitchell Owen, (c) Udacity
#
# Retrieved originally from http://rolisz.ro/2013/04/18/neural-networks-in-python/
#
#
# Neural Network Sandbox
#
# Define an activation function activate(), which takes in a number and returns a number.
# Using test run you can see the performance of a neural network running with that activation
function.
#
import numpy as np

def activate(strength):
return np.power(strength,2)

def activation_derivative(activate, strength):


#numerically approximate
return (activate(strength+1e-5)-activate(strength-1e-5))/(2e-5)

#
# As with the perceptron exercise, you will modify the
# last functions of this sigmoid unit class
#
# There are two functions for you to finish:
# First, in activate(), write the sigmoid activation function
#
# Second, in train(), write the gradient descent update rule
#
# NOTE: the following exercises creating classes for functioning
# neural networks are HARD, and are not efficient implementations.
# Consider them an extra challenge, not a requirement!

import numpy as np

class Sigmoid:

weights = [1]
last_input = 0

def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''

#First calculate the strength with which the perceptron fires


strength = self.strength(values)
self.last_input = strength

#YOUR CODE HERE


#modify strength using the sigmoid activation function

return result

def strength(self,values):
strength = np.dot(values,self.weights)
return strength

def update(self,values,train,eta=.1):
'''
Updates the sigmoid unit with expected return
values @param train and learning rate @param eta

By modifying the weights according to the gradient descent rule


'''

#YOUR CODE HERE


#modify the perceptron training rule to a gradient descent
#training rule you will need to use the derivative of the
#logistic function evaluated at the last input value.
#Recall: d/dx logistic(x) = logistic(x)*(1-logistic(x))

result = self.activate(values)
for i in range(0,len(values)):
self.weights[i] += eta*(train - result)*values[i]

def __init__(self,weights=None):
if weights:
self.weights = weights

unit = Sigmoid(weights=[3,-2,1])
unit.update([1,2,3],[0])
print unit.weights
#Expected: [2.99075, -2.0185, .97225]

#
# In the following exercises we will complete several functions for a
# simple implementation of neural networks based on code by Roland
# Szabo.
#
# In this exercise, we will will write a function, predict(),
# which will predict the value of given inputs based on a constructed
# network.
#
# Note that we are not using the Sigmoid class we implemented earlier
# to be able to compute more efficiently.
#
# NOTE: the following exercises creating classes for functioning
# neural networks are HARD, and are not efficient implementations.
# Consider them an extra challenge, not a requirement!

import numpy as np

#choose a seed for testing in the exercise


#np.random.seed(1)

def logistic(x):
return 1/(1 + np.exp(-x))

def logistic_derivative(x):
return logistic(x)*(1-logistic(x))
class NeuralNetwork:

def __init__(self, layers):


"""
:param layers: A list containing the number of units in each
layer. Should be at least two values
"""
self.activation = logistic
self.activation_deriv = logistic_derivative

self.weights = []
#randomly initialize weights)
for i in range(1, len(layers) - 1):
self.weights.append((2*np.random.random((layers[i - 1] + 1, layers[i] + 1))-1)*0.25)
self.weights.append((2*np.random.random((layers[i] + 1, layers[i + 1]))-1)*0.25)

def predict(self, x):


"""
:param x: a 1D ndarray of input values
:return: a 1D ndarray of values of output nodes
"""

#YOUR CODE HERE

#our neural network is a numpy array self.weights


#its first dimension is layers; self.weights[0] is the first
#(input) layer.
#its second dimension is nodes; self.weights[1][3] is the 4th
#node in the second (hidden) layer.
#its third dimension is weights; self.weights[1][3][2] will be
#the weight assigned to the input from the third node on the
#first layer.

#for each layer, evaluate the nodes in that layer


#by taking the dot product of the output of the previous layer
#(or the input in the case of the first layer)
#with the weights for that node, then applying the activation
#function, self.activation()

#also make sure to add a constant dummy value to the input by


#appending 1 to it

#return the output vector from the last layer.

#
# In the following exercises we will complete several functions for
# a simple implementation of neural networks based on code by Roland
# Szabo.
#
# In this exercise, we will begin by writing a function, deltas(),
# which will compute and store delta factors for each node in a
# layer, given the deltas for the previous layer.
#
# Recall that the delta value associated to an output node is the
# activation_derivative
# of the node's last_input multiplied by the difference of its expected output minus
# its actual output
#
# The delta value associated to a hidden node is the activation_derivative of the
# node's last_input times the sum over the next layer of the products of each nodes
# delta value times weight from the current node
#
# NOTE: the following exercises creating classes for functioning
# neural networks are HARD, and are not efficient implementations.
# Consider them an extra challenge, not a requirement!

import numpy as np

def logistic(x):
return 1/(1 + np.exp(-x))

def logistic_derivative(x):
return logistic(x)*(1-logistic(x))

class Sigmoid:

# default weights for an input node, usually changed when initialized


weights = [1]
# keeps track of previous input strengths for backpropagation
last_input = 0
# space to keep track of deltas for backpropagation
delta = 0

def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''

#First calculate the strength with which the perceptron fires


strength = self.strength(values)
self.last_input = strength
result = logistic(strength)

return result

def strength(self,values):
# Formats inputs to easily compute a dot product
local = np.atleast_2d(self.weights)
values = np.transpose(np.atleast_2d(values))
strength = np.dot(local,values)
return float(strength)

def __init__(self,weights=None):
if type(weights) in [type([]), type(np.array([]))]:
self.weights = weights

class NeuralNetwork:

def __init__(self, layers):


"""
:param layers: A list containing the number of units in each layer. Should be at least two values
"""

self.nodes = [[]]
#input nodes
for j in range(0,layers[0]):
self.nodes[0].append(Sigmoid())
#randomly initialize weights
for i in range(1, len(layers)-1):
self.nodes.append([])
for j in range(0,layers[i]+1):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i - 1]+1)-1)*.25))
self.nodes.append([])
for j in range(0,layers[i+1]):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i]+1)-1)*.25))

def predict(self, x):


"""
:param x: a 1D ndarray of input values
:return: a 1D ndarray of values of output nodes
"""
x = np.array(x)
a=np.ones(x.shape[0]+1)
a[0:-1]=x
for l in range(1, len(self.nodes)):
a = [node.activate(a) for node in self.nodes[l]]
return a

def deltas(self,expected,outputs,layer):
'''
:param expected: an array of expected outputs (in the case of an output layer) or deltas from the
previous layer (in the case of an input layer)
:param ouptuts: an array of actual outputs from the layer
:param layer: which layer of the network to update.
sets the delta values for the units in the layer
:returns: a list of the delta values for use in the next previous layer
'''

#YOUR CODE HERE


21. BackPropagation.py
#
# In the following exercises we will complete several functions for a simple
# implementation of neural networks based on code by Roland Szabo.
#
# In this exercise, we will begin writing a function, fit(), which will train our
# network on data that we provide.
#
# Special thanks to Roland Szabo for the use of his code as a basis for this and
# preceding exercises. His original code can be found at
# http://rolisz.ro/2013/04/18/neural-networks-in-python/
#
# NOTE: this and preceding exercises creating classes for functioning
# neural networks are HARD, and are not efficient implementations.
# Consider them an extra challenge, not a requirement!

import numpy as np

#choose a seed for testing in the exercise


np.random.seed(1)

def logistic(x):
return 1/(1 + np.exp(-x))

def logistic_derivative(x):
return logistic(x)*(1-logistic(x))

class Sigmoid:

# default weights for an input node, usually changed when initialized


weights = [1]
# keeps track of previous input strengths for backpropagation
last_input = 0
# space to keep track of deltas for backpropagation
delta = 0

def activate(self,values):
'''Takes in @param values, @param weights lists of numbers
and @param threshold a single number.
@return the output of a threshold perceptron with
given weights and threshold, given values as inputs.
'''

#First calculate the strength with which the perceptron fires


strength = self.strength(values)
self.last_input = strength

result = logistic(strength)

return result

def strength(self,values):
# Formats inputs to easily compute a dot product
local = np.atleast_2d(self.weights)
values = np.transpose(np.atleast_2d(values))
strength = np.dot(local,values)
return float(strength)

def __init__(self,weights):
if type(weights) in [type([]),type(np.array([]))]:
self.weights = weights

class NeuralNetwork:

def __init__(self, layers):


"""
:param layers: A list containing the number of units in each layer. Should be
at least two values
"""

self.nodes = [[]]
#input nodes
for j in range(0,layers[0]):
self.nodes[0].append(Sigmoid())
#randomly initialize weights
for i in range(1, len(layers)-1):
self.nodes.append([])
for j in range(0,layers[i]+1):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i - 1]+1)-1)*.25))
self.nodes.append([])
for j in range(0,layers[i+1]):
self.nodes[-1].append(Sigmoid((2*np.random.random(layers[i]+1)-1)*.25))

# for x in self.nodes:
# print len(x),"Layer",len(x[-1].weights)

def predict(self, x):


"""
:param x: a 1D ndarray of input values
:return: a 1D ndarray of values of output nodes
"""
a=np.ones(x.shape[0]+1)
a[0:-1]=x
for l in range(1, len(self.nodes)):
a = [node.activate(a) for node in self.nodes[l]]
# print a
return a

def BackPropagation(self, X, y, learning_rate=0.2, epochs=3000):


"""
:param X: a 2D ndarray of many input values
:param y: a 2D ndarray of corresponding desired output vectors
:param learning_rate: controls the learning rate (optional)
:param epochs: controls the number of training iterations (optional)
"""

#YOUR CODE HERE

#In each epoch, we will choose and train on an example from X, y

#to train on each example, we will first need to evaluate the example from X
#storing the signal strength at each node before the activation is applied.

#Then compare the outputs in y to our outputs, and scale them by the
#activation_derivative(strength) at the signal strengths for each of the output
#nodes.

#Iterate backwards over the layers, using the deltas method below to associate a
#rate of change to each node

#then modify each of the (non-input) node's weights by the learning rate times
#the current node's delta times the previous node's last input.
def deltas(self,y,outputs,layer):
'''
:param y: an array of expected outputs
:param ouptuts: an array of actual outputs from the layer
:param layer: which layer of the network to update. Use -1 for output layer.
sets the delta values for the units in the layer
:returns null:
'''

if layer==-1:
final = [y[i]-outputs[i] for i in range(0,len(y))]
else:
final = []
for i in range(0,len(self.nodes[layer])):
sum=0
for j in range(0,len(self.nodes[layer+1])):
sum+= self.nodes[layer+1][j].weights[i] * self.nodes[layer+1][j].delta
final.append(sum)
for i in range(0,len(self.nodes[layer])):
self.nodes[layer][i].delta = logistic_derivative(outputs[i])*final[i]
8. Kernel Methods & SVMs
9. Kernel - Georgia Tech - Machine Learning
9. SVM
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt


import copy
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test =


makeTerrainData()

########################## SVM
#################################
### we handle the import statement and SVC creation for you
here
from sklearn.svm import SVC
clf = SVC(kernel="linear")

#### now your job is to fit the classifier


#### using the training features/labels, and to
#### make a set of predictions on the test data

#### store your predictions in a list named pred


from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)

def submitAccuracy():
return acc
Mapping on Z-axis or projection >> near or small and large values
10. Instance Based Learning
11. Naive Bayes
#!/usr/bin/python

""" Complete the code in ClassifyNB.py with the sklearn


Naive Bayes classifier to classify the terrain data.

The objective of this exercise is to recreate the decision


boundary found in the lesson video, and make a plot that
visually shows the decision boundary """

from prep_terrain_data import makeTerrainData


from class_vis import prettyPicture, output_image
from ClassifyNB import classify

import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

### the training data (features_train, labels_train) have both "fast" and "slow" points mixed
### in together--separate them so we can give them different colors in the scatterplot,
### and visually identify them
grade_fast = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==0]
bumpy_fast = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==0]
grade_slow = [features_train[ii][0] for ii in range(0, len(features_train)) if labels_train[ii]==1]
bumpy_slow = [features_train[ii][1] for ii in range(0, len(features_train)) if labels_train[ii]==1]

# You will need to complete this function imported from the ClassifyNB script.
# Be sure to change to that code tab to complete this quiz.
clf = classify(features_train, labels_train)
### draw the decision boundary with the text points overlaid
prettyPicture(clf, features_test, labels_test)
output_image("test.png", "png", open("test.png", "rb").read())

#!/usr/bin/python

#from udacityplots import *


import matplotlib
matplotlib.use('agg')

import matplotlib.pyplot as plt


import pylab as pl
import numpy as np

#import numpy as np
#import matplotlib.pyplot as plt
#plt.ioff()

def prettyPicture(clf, X_test, y_test):


x_min = 0.0; x_max = 1.0
y_min = 0.0; y_max = 1.0

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
h = .01 # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot


Z = Z.reshape(xx.shape)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.pcolormesh(xx, yy, Z, cmap=pl.cm.seismic)

# Plot also the test points


grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

plt.scatter(grade_sig, bumpy_sig, color = "b", label="fast")


plt.scatter(grade_bkg, bumpy_bkg, color = "r", label="slow")
plt.legend()
plt.xlabel("bumpiness")
plt.ylabel("grade")

plt.savefig("test.png")

import base64
import json
import subprocess

def output_image(name, format, bytes):


image_start = "BEGIN_IMAGE_f9825uweof8jw9fj4r8"
image_end = "END_IMAGE_0238jfw08fjsiufhw8frs"
data = {}
data['name'] = name
data['format'] = format
data['bytes'] = base64.encodestring(bytes)
print image_start+json.dumps(data)+image_end
#!/usr/bin/python
import random

def makeTerrainData(n_points=1000):
###############################################################################
### make the toy dataset
random.seed(42)
grade = [random.random() for ii in range(0,n_points)]
bumpy = [random.random() for ii in range(0,n_points)]
error = [random.random() for ii in range(0,n_points)]
y = [round(grade[ii]*bumpy[ii]+0.3+0.1*error[ii]) for ii in range(0,n_points)]
for ii in range(0, len(y)):
if grade[ii]>0.8 or bumpy[ii]>0.8:
y[ii] = 1.0

### split into train/test sets


X = [[gg, ss] for gg, ss in zip(grade, bumpy)]
split = int(0.75*n_points)
X_train = X[0:split]
X_test = X[split:]
y_train = y[0:split]
y_test = y[split:]

grade_sig = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==0]


bumpy_sig = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==0]
grade_bkg = [X_train[ii][0] for ii in range(0, len(X_train)) if y_train[ii]==1]
bumpy_bkg = [X_train[ii][1] for ii in range(0, len(X_train)) if y_train[ii]==1]

# training_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}


# , "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}
grade_sig = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==0]
bumpy_sig = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==0]
grade_bkg = [X_test[ii][0] for ii in range(0, len(X_test)) if y_test[ii]==1]
bumpy_bkg = [X_test[ii][1] for ii in range(0, len(X_test)) if y_test[ii]==1]

test_data = {"fast":{"grade":grade_sig, "bumpiness":bumpy_sig}


, "slow":{"grade":grade_bkg, "bumpiness":bumpy_bkg}}

return X_train, y_train, X_test, y_test


# return training_data, test_data
def classify(features_train, labels_train):
### import the sklearn module for GaussianNB
### create classifier
### fit the classifier on the training features and labels
### return the fit classifier

### your code goes here!


def NBAccuracy(features_train, labels_train, features_test, labels_test):
""" compute the accuracy of your Naive Bayes classifier """
### import the sklearn module for GaussianNB
from sklearn.naive_bayes import GaussianNB

### create classifier


clf = #TODO

### fit the classifier on the training features and labels


#TODO

### use the trained classifier to predict labels for the test features
pred = #TODO

### calculate and return the accuracy on the test data


### this is slightly different than the example,
### where we just print the accuracy
### you might need to import an sklearn module
accuracy = #TODO
return accuracy
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
from classify import NBAccuracy
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()

def submitAccuracy():
accuracy = NBAccuracy(features_train, labels_train, features_test, labels_test)
return accuracy

is you should save maybe 10% of your data and you use it as your testing set.

And that's what you use to actually tell you how much progress you're

making in terms of learning your patterns in your data.


>> So, when you report the results to yourself and to your boss or

your client, use the test results, because it's much better,

fair kind of understanding of how well you are doing with you training data.
11.5 Bayes NLP Mini Project
12. Bayesian Learning

we're trying to do. We are trying to learn

the best hypothesis we can given some data and

some domain knowledge. Do you buy that as an assertion?


of extra domain knowledge that comes into play for example when you pick

a like a similarity metric for something like k nearest neighbors

we're trying to learn the most likely or the most probable hypothesis

given the data and whatever domain knowledge we bring to bear. You buy that?
it's the hypothesis

that we think is most likely, given the data that we've seen.

Given the training set and given whatever domain knowledge that we bring to

bear on the problem, the best hypothesis is the one that is

most likely, that is Most probable. Or most l, probably the correct one.
It's the

probability of, some particular hypothesis h, drawn from

some hypothesis class. Given some amount of data

which I'm just going to refer to as D

So we want to find the argmax, of h, drawn from Your hypothesis class.

That is we want to find the hypothesis drawn from

the hypothesis class that has the highest probability given the data.
chain rule basically the definition of conditional probability in conjunctions

hypothesis given the data. But what do all these other terms mean?
probability of the data given the hypothesis right?
likelihood that

we would see some data given that we were in a world


where some hypothesis, h, is true

returns true, in exactly the cases where some input number X, is


greater than or equal to 10 And it returns false otherwise. Okay?
et's say that our data was made up of exactly one

point. And that value set x equal to 7. Okay? What is

the probability that the label associated with 7. Would be true.


really about, given a set of x's, what's

the probability that I would see some particular

label. Now, what's nice about that is, is,

as you point out, is that, it's as

if we're running the hypothesis. Well, given a hypothesis,

D is the prior on the data, this is in

fact your prior on the hypothesis. So, just like

the probability of D is a prior on the data.

The probability of H is a prior on a particular


hypothesis drawn from the hypothesis space. So in other words,

in encapsulates our prior belief that one hypothesis is likely

or unlikely compared to other hypotheses. So in fact what's

really neat about this from a sort of AI point

of view is that the prior ,as its called, is in

fact our domain knowledge.


features might be important, so we care about high

information gain and decision trees, or our belief

about the, the structure of a neural network. Those

are prior beliefs, those are, that represents the

main knowledge.
he probability of the hypothesis given

the data, what could make that combined quantity

go up, so one is looking at the

right hand side, the probability of the hypothesis,

so, so if you have a hypothesis that has

a higher prior, has, is more likely to be

a good one. Before you see the data then

that would raise it after you see the data too.


hypothesis that does a better job of

labeling the data, then also your probability of the hypothesis will go up.
I guess the probability of the data going

down. But that's not really a change from the hypothesis.

>> Right. But it is true that if those

goes down, then the probability in the hypothesis can and

the data will go up. But as you point

out, it's not connected to the hypothesis directly.


the probablity of us seeing, some labels

on some data, given hypothesis. Times the probability of the hypothesis,

even without any data whatsoever, normalized by the

probability of the data. So let's play around with

Bayes' rules a little bit and make certain that

we all, we all kind of get it. Okay?


So Bayes' Rule, is everyone recall, is

the probability of the hypothesis given the data is equal to

the probability of the data given the hypothesis times the probability
of the hypothesis divided by the probability of the data.

>> So, which number is bigger?

>> The one that has the larger significant digit.


Because if the prior probability is low, then

the test isn't very useful. On the other hand, as soon as

you have any reason to believe We have strong evidence that someone

might have some condition, then it makes sense to test them for it.
that Bayes' rule actually gives you some information. It actually helps

you make a decision.


For each H in H, that is, each candidate hypothesis

in our in our hypothesis space, simply

calculate the probability of that hypothesis given

the data W which we know is equal to the probability of the data given

that hypothesis times the prior probability of

the hypothesis, divided by the probability of

the data. And then simply output whichever

hypothesis has maximum probability. Does that make sense?


all we care about is computing the argmax, as before,
can ignore it for the purposes of finding the maximal hypothesis.
>> So the place you removed it from, it seems like that's not actually valid,

because it's not the case that the probability of h given d equals, it's the

probability of d given h times the probability of h. It just means that we


don't care what the probability is when

we go to compute the argmax. That's right,

so, in fact, it's probably better to say that I'm going to approximate

the probability hypothesis given the data

by just calculating the probability of the

data given the hypothesis times the probability of the hypothesis and just go

ahead and ignore the denominator. Precissely

because it doesn't change hte maximal age.

makes sense, it's the biggest posterior given all of your

priors.
that often it's just as hard

to say anything particular about your prior over the hypothesis

as it is to say something about your prior of the

data and, so it is very common to drop that. And,


in dropping that, we're actually computing the argmax over the probability

of the data given the hypothesis. And,

that is known as the maximum likelihood hypothesis.

our prior belief is that

all hypotheses are equally likely. So we

have a uniform prior that is, the probability

of any given hypothesis is exactly the same

as the probability as any other given hypothesis.


Computing the probability of

you seeing the data labels given a particular

hypothesis and it turns out that those are

effectively the same thing if you don't have

a strong prior.
that we're going to be given a bunch of labeled

training data, which I'm writing here as x sub i and d sub i, so x sub i

is whatever the input space is, and d sub i are these labels. And let's say, it

doesn't actually even matter what the labels are, but

let's say that the labels are classification labels. Okay?


>> So I'm going to say, in fact, let me write

that down because I think it's important. They're noise-free examples. Okay.

>> Like di equals c of xi.

>> That's right, for all xi. So, the second assumption, is that the true

concept c, is actually in our hypothesis

space, whatever that hypothesis space is. And finally,

we have no reason to believe that

any particular hypothesis in our hypothesis space is

more likely than any other. And so, we have a uniform prior over our hypotheses.
>> So it's like the one thing we know is that we don't know anything.

>> That's right. So, sometimes people called this an uninformative prior

because you don't know anything. Except of course I've always thought that's

a terrible name because its a completely informative prior. In fact

its equally as informative as every other prior in that it tells

you something that all hypothesis are equally likely. But that's

>> I thought it was called an uninformed prior.

>> Is it? So its just an ignorant prior is what you're telling me.
these are our assumptions. We've got

a bunch of data, it's noise free, the concept

is actually in the hypothesis base we care

about and we have a uniform prior. So we

need to compute the best hypothesis. So given

that we want to somehow compute the probability of

some hypothesis given the data, right? That's just

Bay's Rule. So, Michael, you've got the problem right?


Let's try the prior probability.

So Michael, what's the prior probability on H?

>> Did we say that it was a finite hypothesis class?

>> It is a finite hypothesis class.


>> Then it's like, one over the

size of that hypothesis class because it's uniform.

>> Exactly right, uniform means Exactly that. Okay so we've got one of our

terms, good job. let's pick another term. How about the probability of

data given the hypothesis. What's that?


>> The probability, so I guess noise free, and we know that it's

noise free so it's always, so they're always going to be zeros and ones.

the probability that I would see data with these labels in

a universe where H is actually true. Which is different

from saying that H is trure or H is false. It's

really a common about the labels that you see on

a data. In a universe, where H happens to be true.


So we can write the

probability of the data as being, basically, a marginalized version

of the probability of the data given each of the hypotheses

times the probability of the hypotheses. Now, this is only

true in a world where our hypotheses are mutually exclusive.


for every hypothesis that is in the

version space of the hypothesis space given the

labels that we've got. Okay? How's that count?


o rather than having to come

up with an indicator function, I'm just going to

define vs as the subset of all

those hypotheses that are consistent with the data.

>> Yeah exactly

>> Okay, and so whats the probability of those?

>> One It's one and it's zero otherwise, so then,

we can simplify the sum and it's simply what? ?

>> The sum of the one, ooh! The

one of each doesn't even depend on the hypothesis.


> I see wait I don't see oh yes I do, I do it's one over the size
of version space. No its the size of the

version space over the size of the hypothesis space.

>> That's exactly right.

Basically for every single hypothesis in the version space we're

going to add one and how many of those are?

Well the size of the version space number of those.

And multiply all that by one over the size hypothesis space,

and so the probability of the data is that term. So

now we can just substitute all of that, into our

handy dandy equation up there, and let's just do that.

o the probability of the hypothesis given the data, is the

probability of the data given the hypothesis Which we know is one for all those

that are consistent, zero otherwise. The probability

of the prior probability over the hypothesis is


just one over the size of the hypothesis space, and the probability of the data

is the size of the version space Over the size of the hypothesis base which,

when we divide everything out, is simply this. Got it?


Answer:
>> So, what does that all say? It says that, given a bunch of data, your

probability of a particular hypothesis being correct, or being the best one

or the right one, is simply uniform over all of the hypotheses that are

in the version space. That is, are consistent with the data that we see.

>> Nice.

>> It

is kind of nice. And by the way, if it's

not consistent with it, then it's zero. So, this is

only true for hypotheses that are still in

version space and zero otherwise. Now notice that all of


this sort of works out only in a world

where you really do have noise free examples, and you

know that the concept is actually in your hypothesis space

and, just as crucially that you have a uniform prior

for all the hypotheses. Now this is exactly the algorithm that

we talked about before right. We talked about before what would

we do. To kind of decide whether a hypothesis was good

enough in this sort of noise-free world. And the answer we came

up with is you should just pick one of them that's

in the version space. And what this says is there's no reason

to pick one over the other from the version space. They're

all equally as good or rather equally as likely to be correct.

>> Yeah,

that follows.

>> Yeah. So there you go. So it turns out you

can actually do something with this. Notice by the way that we

did not pick a particular hypothesis space, we did not pick

a particular form of our instance space, we did not actually say

anything at all about exactly what the labels were other than

that they were labels of some sort. The strongest assumption that we

made was a uniform prior, so this is always the right thing

to do. At least in a Bayesen sense in a world where

you've got noise free data, you have to find that hypothesis space, and

you have uniform priors. Just pick


something from the consistent set of hypotheses.

> Where this is the hypothesis that actually matters. We're saying that X

comes in, the hypothesis spits that same X out. And then this noise process

causes it to become a multiple. And the probability of a multiple is this

one over two to the case. So, the probability that that would happen from

this hypothesis. for the very first data item. The one to five, would be

1 32nd. That's the probability that a one would produce a five by this process.

>> Okay. How do you, how'd you figure that out?

>> Cause the k that we would need

the multiplier would have to be five. And so

the probability for that multiplier is exactly one over

two to the five which is one 30 second.

>> Okay.

>> And so then I would use that same thought


process on the next one which says that it is doubled and the way that

this particular process would have produced a doubling

would be if with, with probability a quarter.

>> Uh-hm.

>> And, the next data element would have

been produced by this process with probability at

half, because it's k will be 1, and 1 over 2 to the k would be half,

>> Okay, I like this.

>> Right? The next one will be an 8th, because its

tripled,

>> Uh-hm.

>> And the last one is also a multiplier of 5, just

like the first one, so that will be one thirty second as well,

>> Mm-hm.

>> Alright but now we need to assign a probability

to the whole data set, and because you told me it

was okay to think about these things happening independently, the

probability that all these things would happen is exactly the product.

>> Right.

>> So I'll multiply a 32nd and a quarter and

1/2 and an 8th and a 32nd, so that's like a factor of 5 plus 2

is 7 plus 1 is 8. Plus another

3 is 11 plus another 5 is 16 and 2^16

is 65,536. So it should be 1 over, oh you already wrote it. 65,536. Yea that.
>> Yes that's absolutely correct Michael. Well done. Okay so,

that's right, but you did it with a bunch of specific numbers. Is

there a more generic Is there a general form that we could write down?

>> Yeah, I think so, we're doing something pretty regular

once I fell into a pattern. So, I took the D,

and divided by X, so D over X tells me

that the multiplier that was used, so that's like, the K.

>> So. D over x gave you the k.

>> And it was one

over 2 to the that.

>> Okay, so one over 2 to the that.

>> And it was then the product of, of that quantity for all of

the data elements, so all the i's. So product over all the i's of that.

>> Okay.

>> But we have to be careful because If it

was the case that for any of our xi's the

d wasn't a multiple of it, that can't happen under

this hypothesis and the whole probability needs to go to zero.

>> Right.

>> So they all have to

be divisible otherwise all bets are off.

>> Okay, so in other words if d of i mod x of

i is equal to zero and this formula holds and it's zero otherwise.

>> Exactly.
>> Okay. Sounds good. Okay, great Michael. So that's right and that

was exactly the right way of thinking about it. And now, what we're

going to do next, is we're going to take what we've just gone through.

This sort of process of thinking about, how to generate data labels. for,

you know, noisy cases and we're going to apply to it what I

think you will find will be a pretty cool derivation. Sound good?

>> Awesome!

>> Excellent.
8. Return to Bayesian Learning - Georgia Tech - Machine Learning >> return for TXT subtitles File

e know how to find the maximum likelihood hypothesis, at

least we know an equation for it. The maximum


likelihood hypothesis is simply the one that maximizes this expression.

>> Right. That was when we assumed a uniform

prior on the hypotheses.

>> Exactly. And so we, this is sort of the easiest

case to think about Where it turns out that finding the

hypothesis that best fits the data is the same as finding

a hypothesis that describes the data the best. If you make an

assumption about a uniform distribution, or a uniform prior. Okay, so.

This is all we have to do now is figure out

what we're going to do to expand this expression. So what do

you think we should do first? The probability of the data given

the hypothesis.
all of the data together to each of the

individual training data that we see. So what

do we do next? What is the probability

of seeing one particular P sub i, given that we're in a world where H is true.

>> So okay, given that H is true that means whatever the corresponding xi

is, if we push that through the f function, then the di is going to

be F of XI plus some error term so I guess if we took di minus

F X I, that would tell us what the error term is and the we

just need an expression for saying how likely it is that we get that much error.
Our goal here is, given all of this training

data, let's recover what the true f of x is.

And that's what our H is. Each of our hypotheses a

guess about what the true underlying deterministic function F is.


So, if we have some particular labels, some particular value D

sub I that is at variance with that. What's the probability of us

seeing something that far away from the

true underlying F. Well, it's completely determined

by, the noise model. And the noise is a Gaussian. So, we can

actually write down Gaussian. Do you remember

what the equation for a Gaussian is?

Take Log >> To solve ?? help you

>> Well how would you get rid of a minus sign?

>> So the max of a negative is the min. Right, so we can

get rid of the minus sign by

just simply minimizing instead of maximizing that expression.

We end up with this expression.


>> Nice. That's much simpler than where we started. The e is gone.

>> It's much simpler. We got rid of a bunch

of e's. We got rid of a bunch of turns

out extraneous constants. We got rid of multiplication.

what we're saying is we can do linear regression,

because linear regression is exactly about minimizing the sum

of the squares, right? So linear regression comes popping out

of this kind of Bayesian perspective just like that, so

is, is that part of what makes it so cool?

>> That is part of what makes it cool, but I

just think more generally about gradient descent right? The way gradient descend

works is you take a derivative by stepping in this, in

this space of the error function, which is sum of squared error.


>> I see, so you get gradient descend too.

>> Yes, you get all of the stuff that people have been doing.

Now, there's a piece of beauty there, which is that we derived

things like gradient descend and linear regression, all of the stuff we

were talking about before and we have an argument. For why it's

the right thing to do at least in a Bayesian sense. But

there's an even deeper beauty here, which is tied in with ugliness,

which is the reason this is the right thing to do, is

because of the specific assumptions that we've made. So what were the

assumptions that we made? We assumed that there was some True deterministic

function that was mapping our x's to our in

this case our d's and that they were corrupted

say transmission error or line noise or however you

want to think about it. They are corrupted by some

noise that has a very particular form. Uncorrelated, independently

drawn, Gaussian noise, with mean zero. So the less pretty

way of thinking about it is. Whenever you're trying

to minimize the sum of squared error, you are in

fact assuming that the data that you have has been corrupted by

Gaussian noise. And if it's corrupted by some other noise, or you're

actually not trying to model determinance function, of this sort. And then

you are in fact, possibly, in fact most likely doing the wrong thing.

>> I mean are there other noise models

that lead to some other kinds of learning.


>> Sure, pick any other model in here that does't look

Gaussian at all, and you would end up with something else.

I don't know what you would end up with because. You

know, you couldn't do all these cute tricks with natural logs but

yes, you would end up with something different. And one question

you might ask yourself is well, if I try to do minimizing

the sum of the squared errors, or something for which this

model was not the right one, what sort of bad things might

happen? Here let me give you an example, let's imagine that we're

looking at this here, and our X's are, I don't know measurements

of people. Okay? So height and weight. Something like that.

>> Mm-hm.

>> And in fact let's make it, let's make it

let's make it even simpler than that. Let's imagine that our

x is our height. And our outputs, our d's, are

say weight. And what we're trying to learn is some kind of

function from height to weight. Now, this doesn't make a

lot of sense to have a true [INAUDIBLE], but I'm trying

to make a point here. So what we're saying here is

that we, we measure our height and then we measure weight.

That there's some simple relationship between

them that's captured by f. But, when

we measure the weight, we get a sort of noisy version of that

weight. Okay? That seems reasonable. But


what's not reasonable is we're saying. Our

measurement of the weight is noisy, but our measurement of height is not.

>> Because if the x's are noisy, then this is not a valid assumption.

>> I see.

>> So, it seems to work

a lot of the time and we have an argument for

when it will work, but it's not clear that this particular assumption

actually makes a lot of sense in the real world. Even

though in practice it seems to do just fine. Okay, got it?

>> I think so though I feel like if the error if you put

an error term inside the f along with the x and f is say linear.

>> Mm-hm.

>> Then maybe it pops out and it just becomes another

part of the noise term and, and it all still goes through.

Like I feel lines are still pretty happy even with that.

>> No I think you're right. Lines would be happy here because

linear, I mean linear functions are very nicely behaved in that way.

But of course, they'd have to be the same noise model in

order for it to work the way you want it to work.

>> Yeah.

>> They'd have to both be Gaussian. They have to both have

zero mean, right? And they'd have to be independent of one another.

So your measuring device that gives you an error for your height

would also have to give you an independent normal error for the weight.
>>

Yeah. Though I feel like my scale and my

yardstick actually are fairly independent. And they're Gaussian? .

>> Oh mine is clearly Gaussian.

>> Yeah.

>> Yeah. Well at least they're normal.

>> They're normally are.

>> Mm-hm.
>> Okay good. So let's move on to the next thing Michael. Let's try one
more example of this and, and then I hope that means you got it, okay?
>> Sure.
>> Beautiful.
here for you is, are a maximum a posteriori equation, right?

So the best hypothesis is the one that maximizes this expression.


it's a monotonic function and
so it doesn't change the argmax. So, I'm

going to do the log of both sides here. But this time I'm going to do

log base 2, for no particular reason other

the trick that you used before. I'm going to change my

max into a min, by simply multiplying everything by minus 1.


That's true. But we know from information theory, based exactly on this

notion of entropy, that the optimal code for some event with probability P has

length minus log base 2 of P. So, that just comes straight out of information

theory. That's where all the entropy stuff comes from. Okay. So, if we

have some event that has some particular probability P of happening,

the best code for it has this structure, minus log of P.


we apply it to here, what is this actually saying? This is

saying that, in order to find the maximum a posteriori hypothesis,

we want to some how minimize two terms that can be described

as lengths.
>> So my question to you is, given that this definition over here, that an

event with probability P has some length minus

log P, what is this the length of?

>> So that would be the length of

the probability of the data given the hypothesis.

>> Mm-hm.

>> And the length of the hypothesis, or the probability of the hypothesis.

>> Well no, it's just the length of that hypothesis.

>> Oh, because the event is what has the length.

Oh, I see. So it's the length of the data,

given the hypothesis, and the length of the hypothesis.

>> Right. So let's write that out.

>> But I was just doing, like, pattern matching


there. It's not clear to me what a length

of a hypothesis is. Hypotheses are functions. I don't

know how to take a tape measure to a function.

>> That's fair. So this is the length of the hypothesis. Right?

>> Yep.

>> So, you said you don't know what that means. But, let's think about

that out loud for a moment. What does it mean to have a length of

a hypothesis? That's really sort of the number of

bits you need to describe a particular hypothesis, right?

>> Okay.

>> Okay. And in fact, that's exactly what it means.

That's why we use log base 2. So, if we want

to minimize the length of a hypothesis, what does that mean,

the number of bits that we need to represent the hypothesis?

>> The number of bits that we need to represent the hypothesis is, I guess,

in some representation, or, so in this case

I guess it would be some optimal representation.

We are taking all the different hypotheses and writing them out. The ones

that are more likely have a higher P of H, because that's the

prior. And those are going to have smaller lengths than the optimal code.

And the ones that are less common are going to have longer codes.

>> Well, let's make it more concrete.


Which of these two decision trees is smaller?

>> [LAUGH] The one on the right is smaller.


has fewer nodes, so smaller decision trees, trees with

fewer nodes, less depth, whatever you need to

make it smaller, have smaller lengths than bigger decision

trees.
that we prefer smaller trees

to bigger trees, this kind of a bayesian argument for occam's razor.

Now, what about this over here? What does it mean to

talk about the length of the data given a particular hypothesis.

>> Uh...I could think of one interpretation there. So

like, if the hypothesis generates the data really well, then


But let's

imagine that the hypothesis gets all of the data

labels wrong. Then when you send the hypothesis over To


this person. This, this sort of person we're making up

who, trying to understand the Daden hypothesis.


. So, what this really is, is a notion of miss-classification

error, or just error in general. If we're thinking about

regression. So, basically, what we're saying is, if we're trying

to find the maximum a posteriori Hypothesis. We want to maximize this

expression. We want to find the h that maximizes this expression.

That's the same as finding the h that

maximizes the log of that expression, which gives you

this. Which is the same as minimizing this expression,

which is just maximizing this expression but throwing a

minus one in front But these terms actually have meanings in information theory,

the best hypothesis, the hypothesis with the maximum a posteriori probability is

the one that minimizes error and the size of your hypothesis.

You want the most simple hypothesis that minimizes

your error. That is pretty much literally occam's razor.

What is important here In reality is that

these are often traded off of one another.

If I give a more complicated or bigger

hypothesis, I can typically drive down my error.

Or I can have a little bit of error for a smaller hypothesis. But this is the

sort of fundamental tradeoff here. You want to find The simplest

hypothesis that still explains your data, that is, minimizes your error.
But, you do have some

real issues here about for example units. So, I don't know if
the units of the size of the hypothesis are directly comparable to

the counts of errors or you know sum of squared errors

or something like that and so you have to come up with

some way of translating between them... And some way of making the

decision whether you would rather minimize this or you'd rather minimize that

if you were forced to make a decision.


That the best

hypothesis is the one that minimizes error without paying too

much of a price for the complexity of the hypothesis.


this notion of length feels... Like you could translate it directly into bits

right like you actually had to write it down and transmit it, it

makes a lot of sense. But then I was thinking about neural networks.

And, and, and given that a fixed neural network architecture it's always

the same number of weights and they're just numbers. So you just

transmit those numbers. So I thought, hmmm, this isn't really helping us

understand? ? ? ? ? ? and then it occurred to me

that those weights, if they get really you're going to need more

bits to express those big weights. And in fact that's exactly when

we get over fitting with neural nets if we let the weights

get too big. So like this gives a really nice story for understanding

neural nets as well.

>> Right. That the complexity is not in the number of parameters

directly but in what you need to represent the value of the parameters.

>> Wow.

>> So I could have ten parameters that are all


binary, in which case I need ten bits. Or they

could be arbitrary real numbers, in which case I might

need, well, an arbitrary number of bits. That's really weird.

>> Yeah, but the point here, Michael, I want to wrap this

up. The point here is we've now used Bayesian learning to derive

a bunch of different things that we've actually been using

all along, and so again the beauty of Bayesian learning is

that it gives you a sort of handle on why

you might be making some of the decisions that you're making.

>> It seems like this raises the theory question

that you threw at me in a previous unit. Right.

Which is like well so if it doesn't really tell

us anything we didn't already know, how important is it?

>> Well in this case, I think it is important

because it told us something that we were thinking and tells

us in fact we were right. So now we can

comfortably go out in the world minimizing some of squared

error when we're in a world where there is some

kind of Gaussian transmission noise. We can go about trying to

Believe Occam's Razor because Bayes told us so. [LAUGH] Thanks

to Shannon. And so on and so forth. We can

do these things and know that in some sense, they're

the right things to do, at least in a Bayesian sense.

>> Neat.
>> Okay, good. Now one more thing, Michael, I'm going to show you.

Which is that everything I've told you so far is a lie. [SOUND]


And we're back. What's the answer, Michael?

>> Okay, so it depends.

>> What does it depend on?

I've given you everything. This is straightforward.

>> Well, so, okay, I guess. The here, so here's what I'm seeing.

So I'm, what I'm seeing is that hypothesis one is the most likely hypothesis.

>> It's not just the most likely, it's the most a posteriori.

>> Well, that's what I mean by likely.

Right, is the map hypothesis? It's the maximum a

posteriori hypothesis. So if we say, what is the

label according to the map hypothesis? Boom, it's plus.

>> Yes.
>> But, if we're saying what's the

most likely label. So the most likely label

is, is, we have to actually look over all the hypotheses and in a sense,

let them vote. So the probability that the label is minus is actually 0.6, which

is greater than 0.4, so if I had to pick, I would go with minus.

>> And you would be correct. So I did a

little tricky thing here for you Michael. You've been complaining about

my titles, because everyone said Bayesian learning and

I changed the title here to Bayesian Classification.

>> Ohhh.

>> Because in fact the problem here, we've been talking about all along

is, what's the best hypothesis. But here. I ask you what's the best label?

>> Hm. And exactly as you point out, finding the

best hypothesis is a, is a very simple algorithm. Here I'll

write it for you because we did it before. For

every H in hypothesis set, simply compute the probability that it

is the best one, and then simply output max. That's how you find the best

hypothesis, but that's not how you find the best label. The way you find the

best label is you basically do a

weighted vote for every single hypothesis in the

hypothesis set, according to the weight being

the probability of that hypothesis given the data.

>> Okay.

>> So the best, if you can only output hypothesis and use that hypothesis,
in fact, you would say plus. But if you asked everyone

to vote, just like we did with boosting, just like we

did effectively with KNN and all these other kind of. Weighted

regression techniques we've used before, you need to do the voting.

>> And I, and I feel like I could probably derive

that using rules of probability. Right, because really what we want is

we're trying to maximize the probability of the label, given the

data, and I think the probability laws would tell us that's equal

to the sum over all hypotheses of the hypothesis and the

label given the data, which is, like, the probability of the

hypothesis given the data, times the probability of the label

given the hypothesis, and that's what we did, we summed up.

You know, the probability of the label given the hypothesis

is either one or zero. That's your left column. And then

we're summing up the probabilities that corresponding to the pluses. And

we're summing up the probabilities corresponding to the minuses and choosing

the largest one.

>> So, this is what you just said written down

as an equation. basically, the most likely value. Is the

one that maximizes this expression. And this follows directly from

Bayesian's rule where now instead of trying to maximize the

hypothesis given the data, you're trying to maximize the value

given the data. And I think it's pretty straightforward to

derive that but I'd like to leave it up to


the students to do it on their own. Okay, so Michael,

in some sense everything that I've told you before is a

lie, in that I've led you down this path that somehow,

finding the best hypothesis is the right thing to do. But

the truth is, finding the value is what we actually care about.

Finding a hypothesis is just a means to an end. And

if we have a way of actually computing the probabilities for

all the hypotheses, then we should let them to vote in order

to find the best actual label or the best value for it.

>> Got it.

>> All right. Good.

maximum likelihood hypothesis? How's it relate

to the maximum a posteriori hypothesis?


>> It's the MAP that you get when the prior is uniform.

Okay Michael so this wraps up all this

Bayesian Learning stuff. What have we learned today?

>> We did Bayes rule.

>> We learned Bayes rule. We even learned how to derive Bayes rule.

>> And it was super useful because it

let's you swap, kind of, causes and effect.

>> So I like the way you put that, Michael that

we're swapping causes and effects. Sort of mathematically when we think about

Bayes rule, what that really let's us do is. Instead of

having to compute the probably of a hypothesis given the data We

instead view to compute the probability of the

data given the hypothesis, which is typically a


much easier thing to do. And what makes it of course Bayes rule in general is

that you weight that by the prior probability

over the hypothesis. Which in fact is one

of the important things that we learned which

is that priors matter. So anything else we learned?

>> Yep, we did the MAP hypothesis, Maximum a posteriori. Right.

We learned about HMap, and we also learned about HML.

>> ML, right. The maximum likelihood hypothesis. > Right. And what's the

maximum likelihood hypothesis? How's it relate

to the maximum a posteriori hypothesis?

>> It's the MAP that you get when the prior is uniform.

>> Right. Alright. And we, oh, we

connected up maximum a posteriori and least squares.

>> Yeah, that was pretty, I really liked that. So, we basically der, we

deroved. We derived a bunch of things we'd

been doing before. And short of showed that

there's actually a good argument for them. At

least if you're Bayesian. There are good arguments for

doing some doing sum of squares. There are good

arguments for Occam's Razor. We'd actually be able

to give real justification for doing them other

than, well sure it makes us one of them.

>> Right so that includes the minimum description length story.

>> Mm-hm.
>> And then finally,

you told me that was all a lie, and you said that really what you want to do

is this other kind of way of picking that

actually factors in the probability of all the different hypotheses

and having them essentially vote. Right. What we really

care about, is classification. We're learning in the end and

so we also learned about Bayes classifiers. So in

fact, what we described before, which is voting of hypothesis.

Turns out to be the Bayes optimal classifier. I

didn't say that, but it is very important to note.

In fact, what you should be noting there is

not only is it the Bayes optimal classifier, it's the

Bayes optimal classifier. And what that means is that

on average you cannot do any better than basically doing

a weighted vote of all the hypotheses according to

the probability of the hypothesis given the data. You cannot

do any better than this on average. So

again, what Bayesian learning gives us and what Bayesian

classification gives us is a way of talking

about optimality and gold standards. What'd you think, Michael?

>> That's really neat.

>> I like it. I mean, I have to tell

you, I really think that this stuff is kind of cool.

It's always nice to be able to take things that actually


work and explain them according to some framework, some underlying theory.

>> I wonder though, it seems like

all these Bayesian equations lead us to the question, of how we actually infere

probabilities from various different quantities and observations.

So is there a way to do that?

>> So I think the answer is yes. And maybe you should

go figure it out and then tell me about it next time.

>> Okay. [LAUGH] All right, as you wish.

>> As you wish.

>> Stay tuned.

>> Anyway, this has been a lot of fun, Michael.

I will talk to you later.

>> Thanks.

>> Bye.
13. Bayesian Inference
https://towardsdatascience.com/a-gentle-introduction-to-maximum-likelihood-estimation-and-
maximum-a-posteriori-estimation-d7c318f9d22d
https://www.probabilitycourse.com/chapter9/9_1_2_MAP_estimation.php
what I want you to do

is say, what fraction of the time,

is, is each of these different possible combination

of things happening? So, for example, what's

the probability that you look out and

there's a storm and there's lightning at the same time? So, what do you think?
> Yeah, random day at 2 PM. And we

can be in Atlanta since that's what you're familiar with.

>> Is it summer? Because that happens more often in the summer.

>> Sure, let's say summer.

>> It's fairly high at 2 PM. Let's say it happens a quarter of the time.

>> Wow, that's a rainy summer.

>> Mm-hm.

>> Alright. Now, that's not the only possibility though. It


could also be that there's a storm but no lightning.

>> Right. That happens more often at 2 PM

in the summer in Atlanta. Let's say it's mm, .4.

>> Wow. Alright. Now what's the probability that you look

at the window and there's no storm but there is lightning.

>> Maybe 5%.

>> And what's the probability that you look out and

there's, you know, it's nice clear there's no storm no lightning.

>> Coincidentally I picked numbers that made it easier

for me to subtract from one. So, it's 0.3.


>> So the correct answer would be 0.25

divided by 0.65. Which is, some number. 5 13th's?

>> Yeah. It's 5 13th's. And, though I'd

rather that people fill it in as a fraction.

>> As a, wait. That is a 5 13ths is a fraction.

>> Good point. As a point something something. A decimel.

>> So, 5 13ths is obviously 0.4615. And there you go. Is that right?

>> Yes. That was perfect. Yeah so its usually when there's a

storm, its not lightningy. It's less than half the time. That makes sense.

>> It does because otherwise lightning would be happening all the time.

>> Well when its storming. It could

be that its very likely when its storming.


>> It is likely when it's storming, but it wouldn't be happening

every time its storming because otherwise it would be lightning all the time

when its storming.

>> RIght.

>> And often there's breaks between lighting. In fact, most of the time

there's not lightning, at least outside my window. At 2pm. In the summer.


>> Alright, so that wasn't so bad. You are able

to compute some probabilities from this joint distribution. So let's

see what happens when we start talking about more variables.


More propositions that could be true or false. What I did

is I filled in thunder as another variable and thunder

can be true or false in each of these cases. And

I wrote down what the probabilities could be from my

experience in Atlanta in the summer. I was, I was around


over last summer, and in 2004, so let's, so I'm an expert obviously, so

I'm able to estimate these probabilities to

the nearest percent. Anyway the point is, that

one of the things you should notice here is that each time we add

one variable what happens to the number

of probabilities that we have to write down?

>> Well in a world where it's binary it goes up by two.

>> A factor of two, right?

>> A factor of two.

>> Not just, not just two more, but like, twice as many. And so

if we have a complicated scenario that we want to be able to reason about,

and it's got, I don't know, a hundred variables, that's going to be a lot.

>> That's, that's, I can't even, I can't even think about that.

>> Yeah, it's like two to the hundred is.

>> That's, that's not even a real number.

>> It's technically a real number, but

it's an, it's an unimaginably large number.

>> There's only like four numbers, one, two, three, many, and too many.

>> So it's going to be really inconvenient as we start adding more

of these and especially if we add

variables like, you know, remember the restaurant

example that we worked on when we were doing decision trees.

>> Oh yeah those were the days.

>> Then there was variables like food type,


and what was the deal with food type?

>> It had lots of values that it could take on.

>> Yeah, yeah like five or something like that.

>> Thai an, American and Italian.

>> Right and so if we had, add variable like that it's going

to multiply the number of probabilities that we need by five. So this is

going to get really big really fast. So would it be nice if

we had an more convenient way of writing it out in this distribution?

>> Yeah, it would be nice.

>> So it turns out that we can factor it.

>> But I thought we already had a factor of two?

>> Well that was a joke but it actually is pretty close to

being the truth, which is the idea that instead of representing all, so,

so, in this case, there's eight numbers. Instead of representing them as eight

numbers, we're going to represent it by you know, 2 times 2 time 2.

So we really are going to essentially factor it. putting,

putting things into pieces that we can recombine, smaller pieces

that we can recombine into, into larger pieces. And it,

yeah, it turns out that actually works out really well.


Alright, I'm going to hit you with a definition first.

>> Hit me.


>> So, conditional independence is this idea that goes

like this. We're going to say that some variable that

makes up the joint distribution is conditionally independent of

some other variable, Y, given Z, if it's the

case of the probability distribution governing X, so the

probabilities associated with the values in this variable X

Is independent of the value of y given the value of z. So if I tell you what

z is, then you can figure out what the probability of x

is without having to look at y. So that is, if it's

the case that for all possible values, little x, little y and

little z for the variables big x, big y, and big z, If

it's the case that the probability that big X, the random variable

big X, equals, takes on the value of little x, given that

big Y takes on the value of little y and big Z

takes on the value of little z, equals the probability that big X

takes on the value of x given big Z takes

on the value of z. If those are equal for all

possible ways of filling in the values of the variables,

then we say that x is conditionally independent of y given

z. Right, so you see we dropped Y from the

right-hand side of the probability expression. Okay, so it's sort of

less things we have to worry about, if it's the

case that we really didn't need it in the first place.

>> Fewer.
>> Fair enough.

>> So that's pretty similar

to normal independence. Okay, so what's normal independence?

>> So normal independence, we say the probability of x and y

is equal to the probability of x times the probability of y.

>> That's right.

>> Which means if we think about the chain rule, we

also know that the probability of x and y is equal

to the probability of x given y times the probability of

y. So that means that the probability of x given y is

equal to the probability of x, for all values of x and y.

>> So this is actually implying. So [INAUDIBLE] if it

equals that. Oh, that means that P(x) times P(y) equals

P(x given y) times P(y). If we cancel those, we

get px equals. Okay. That's what you wanted to say.

>> Right. So, since, What independence means,

right, is that the joint distribution between two

variables is equal to the product of their

marginals. That's just. You know comes from basic

probability theory and so if you think about what that means

from the chainable point of view it's like saying the probability of

x given y is equal to the probability of x. So,

it looks just like the equation you wrote down for conditional independence.

>> Right, the only thing that we added is this notion that it might
be the case that we don't have such a strong property as this where

it's always the case that you can write the probability of x given y

just with the probability of x. But in the context of some, of knowing

some value z, it might be true. And that's what conditional

independence gives us. As long as there is some z that we

stick in here, that gives us that property, that's great, we can

essentially ignore y, when we are talking about the probability of x.

>> Okay, that's pretty cool. That means more powerful or something.

>> Yeah, and in fact if you remember you mentioned the

word factoring. You can see here that we are down a

probability as the product of two other things. We are factoring

that probability distribution. That's what

independence let's us do. And conditional

independence let's us do that in, in more general circumstances. So

let's apply this content back to what we were talking about before.

>> Okay.
So the concept of a belief network, sometimes also known

as Bayes Net. Sometimes also known as Bayesian Network. Sometimes

also known as a graphical model. And there's other names,

but it's the same idea over and over again. And

the, and the idea is that what we're going to do

is we're going to represent the

conditional independence relationships between all

the variables in the joint distribution graphically. In terms of

of a little picture like this, where there's nodes corresponding

to all the variables. And, edges corresponding to dependencies that

need to be explicitly represented. So, the way that this

works is, what we can do is we can fill

in the prior probability of storm, which we can get by


just marginalizing out. So we've, we've already done an exercise

like this. So this is a number you should be able

to figure out. Then because of vary well, this is

also true that that you can figure out what the probability

of lightning is, given storm and also given not storm.

And these are numbers that you can just get by marginalizing

out. Finally, the probability of thunder, normally you'd have to

condition that on both storm and lightning. But as we already

talked about, it's actually conditionally independent of storm given lightning.

So, all we need to figure out is the probability of

thunder given lightning, and the probability of thunder given not

lightning. And once we have these, in this case five numbers,

that's enough to work out any probability we want

in the joint, just by multiplying corresponding components together.

So, what I'd like you to do is actually fill in these boxes as a quiz. And to

help you out we copied the numbers over from

the previous slides so that you actually have the

[LAUGH] values that you need to fill in this

table. because otherwise that would have been kind of mean.


>> Okay, that makes sense. Okay, good. So, now we

do the same trick again with thunder. Except now, instead of

looking at L and S, we look at Thunder and, and

lighting, so we need to look a case where thunder is true

and lightning is true, so that would be, point, that's all the

cases where lightning is true, so it would be .2 divided by .25

>> Alright and why are we looking at the case where storm is true?

>> Why are we doing it? Because it's conditionally independent of storm.

>> It doesn't matter.

>> [CROSSTALK] Information, so it doesn't matter which rows we look at.

What matters is we look at a case where thunder and

lightening are both true, and we compare that to thunder is false

and lightening is true. So that's this number. Those add up to


the 0.25, we get 0.2, over the 0.25, which is 0.8. Right.

>> So it's very likely to hear thunder if you see lightning.

>> That makes sense. And there's only a 20%

chance that you don't hear thunder when you hear lightning.

>> It's lightning not thunder, yup. Mmhmm.

>> And so we do the same thing in the case

where we have thunder and there's not lightning. So we find that row.

>> Okay. Not lightning and there is thunder. There's one.

>> Right and we do the same trick we did before and we get,

.04 over .4. Which I think we did last time, actually, and we get .1.

>> We did. So, if it's, if there's not

lightening out, it's very unlikely to hear thunder. Alright.

>> Alright and just to drive this point home.

That was great. Just to drive this point home.

What if it was the case that it mattered what's

value storm had, how would we fill in this table.

>> Well we'd have to look at a lot more rows.

>> Well in particular we couldn't draw this kind

of belief network if that were the case, right?

>> Right.

>> Because it wouldn't be conditionally independent. So we'd have to draw

basically another edge. Here, and what that represents is that thunder, to work

out to what the probability of thunder is, you have to look at

storm and lightning, all the joint combinations of those to make it work.
<< And that grows exponentially as you add more and more data. <<

And that's right, and that's something that threw me when I started to look

at this, because the picture looks a lot like a neural net. Right? In

a neural net, you've got these nodes, you've got arrows going into the nodes,

and when you have a bunch of arrows going into the same node,

you just end up like adding all

those different influences together, weighted by what's,

what it has on the weight. This

belief network representation is an entirely different

animal. In particular, now, what we're really saying is, to work out the value

of this node, you need to know what's going on

in all combinations of what the inputs are. And so,

as you pointed out, so astutely, that grows exponentially as

you have more variables coming into the node. Higher in degree.

>> Hm. So this is not just a network. It's

a graph. And so we can talk about parents and children

right? So, basically, the number of numbers you have to

keep track of is exponential in your number in your parents.

>> I mean it's a, yes.

Though it's not exactly a tree. Doesn't have to be a tree so the parents

relationships are kind of weird. Like in particular,

if you use parent terminology in this graph,

what you're saying is that lightning has

one parent which is storm and thunder has


two parents which are storm and lightning. So

it's, storm is it's own grandfather and parent.

>> So let me ask you a quick question, Michael. So earlier on when

you were describing this, this graph, I

noticed you used the word dependencies. You said

we're going to capture the dependencies.

>> Hm.

>> So if you erase the red line between storm and thunder,

>> I'd be happy to.

>> So you erased that, should I read

this as storms cause lightning, and lightning causes thunder.

>> You can do that, but you would be wrong.

>> Oh okay.

>> You can not infer that there is a cause of relationship

just because there is an arrow between them. These arrows are just telling

us about the relationship between the probabilities and

not anything about the physically processes that underlie them.

>> Okay so let me make sure I understand, what you are saying is, it

would be very natural to look at a belief network or a Bayesian net or a

Bayes Nets or graphical model. And read

the arrows as causes, and therefore read them

as talking about dependencies. But actually what's happening

here is that these things represent conditional independencies.

So, it is not true that lightening is


dependent on storm and thunder is dependent on

lightening. So much as is the case that

storm and thunder are conditionally independent given lightning.

>> That's, that is a good point. I guess I never really realized

that dependence. You use the word dependence.

Sometimes it means a physical dependence. Like,

in the real world it's dependent.

Here I'm just talking about statistical dependence.

It's really just talking about the fact that we can derive numbers from other

numbers, and not that You know things cause other things. So yeah, that's a

really good point. It seems like that was an easy place to get slipped up.

>> Okay. Cool.


>> All right Charles, so, so, what do you think the answer is here?

>> Actually I don't know what you're looking for here.

>> Oh, okay. Well, so one thing that's true. We


had to sample the, the variables from A to E.

>> Mm-hm.

>> And that's alphabetical order. So do

you think that's what I was looking for?

>> Maybe in this case but I would think that that wouldn't be generally true.

>> True. Right. So, yeah, alphabetical is not what I

was looking for. So, there's it's a graph theoretic property

that says we want to basically put the nodes

in order, so that you always put the things

that have incoming links that haven't been visited yet

after the ones where you, they have been visited.

>> Oh, so it is a lot like alphabetical

or a lot like lexo-, lexicographic, but it's topological.

>> There we go. Yeah, that's what I was looking for. So, topological sort.

>> Which makes perfect sense.

>> Right, and so this a standard thing that you can do with a graph, and it's

very quick to, to actually compute one of

these. It does depend on a particular property, though.

>> Let's see. Topological only makes sense if you really can

go from no parents to parents. So, it cannot be cyclical. You

can't have arrows that take you back. So, E can't be a

parent of A and also have A be one of its parents.

>> That's right.

>> So it must be acyclic.


>> Must be acyclic, right. And that's going to

be true in these cases, because we're always going

to set it up so that in a, in a Bayes net, the variable that we're each

variable depends on other variables. But they all, it ultimately has to bottom

out. There can't by cyclic dependencies. So, it is a directed acyclic graph.

>> So, what would it mean if there were cycles?

>> I don't know. I don't know what to do with such a graph.

>> It just doesn't mean anything at all, I guess.

>> Yeah, I mean, there, there is a family of undirected models.

>> Mm-hm.

>> But we're talking only about the directed ones here. So, the directed

ones yeah, it'd have to be acyclic for the, for the probability distribution

to be meaningful.

>> Well, that makes sense.

>> I'm sure we could make something up, but this is, typically

this is how it's done. It's, it's, we constrain ourselves to acyclic graphs.

>> Well, if a Bayesian network is

supposed to capture conditional independencies, then if you

add cycles, that's like saying there are none,

right? I'm not even sure what that means.

>> I could make it mean something. So here,

we, we want the probability of A, conditioned on probability

of A. Well, maybe that's like probability of what, what A was one time step

ago. Or it could mean that it, you


know, that, that we've actually putting constraints on

the joint assignment to all the variables.

But, yeah, it's not really, it doesn't really,

it makes things more complicated and that's not

he model that, that is the typical one

>> Okay, fair enough.


But in the real world,

there are perhaps hundreds and hundreds of variables

with complicated relationships and conditional independencies that, that aren't

necessary intuitive just by looking at the graph. And

so picking one conditional probability table and looking at

it isn't going to tell you much. But by

sampling I get real examples that are concrete that,

as a human being, I can understand without having

to, you know, really glock all the 25 different

conditional probability tables. Does that sound right? Is that. [CROSSTALK]

>> Yeah, yeah.

>> What you're trying to say?

>> That's exactly right. Thanks.


>> Okay.

>> I want to draw your attention to this, this

word here for a moment. This notion of approximate inference. Now

generally we don't like approximations when we can do things, things

exactly. So why are, why are we not doing things exactly?

>> because it's hard.

>> It's hard, that's exactly right. So or,

or, even if it weren't hard, it may,

it may be in some cases faster. So I would be,

I'm not going to do it now, but I'd be happy

if I guess if there's ground swell of support among the

students. To I can go through the argument as to why

this inference is hard. There's a nice little reduction to problems,

N, NP complete problems like satisfiability. But it turns out roughly

that if you could do inference exactly on any belief net

that you want, then you could solve very, very hard problems efficiently

using that idea. So it's, it's cute, but it's kind of takes us

a little bit off our path, so I'm not going to get into that.

>> Okay, so sampling is useful, Michael, which I always suspected in my

heart, and now we've got some good arguments for why it actually is.
Did you get it?

>> Yeah I did actually. so, so this one

I think I understand completely. So we know that from

the last discussion we had about how you would recover

the joint, that what you're saying on the right of

this equation probability y times probability of x given y means that

the probability of y, the variable y doesn't depend on

anything. So, between those two graphs the one on the

right is the one where you're saying that. You don't

need to know the value of any other

variable in order to determine the probability of y.

>> Good.

>> So it has to be the one on the sec, the second and just to make sure

if you look at the second product the probability

of x given y the second multican? Is it multican?

>> Hm, factor.

>> Factor? Let's say factor. The second factor,

this says that while you determine the probability

of x given the value of y and there is an arrow from y to x

so, the second one is in fact correct.

>> Yeah. So this is actually just one way you could just read this

network is to say what is this node x with an arrow coming into it?

That is the probability of x. But, the, the things pointing into it are what's

exactly being given. What it's being conditioned


on. So that's exactly right, the second one.

>> Right. So this, this, so this makes sense to me. This is why when

you look at a network, a Bayesian network, it's very

hard not to think of them as dependencies.

Even though they're not dependencies, they're conditional independencies.

>> Well the arrows are a form of dependence but it's not a causal

dependence necessarily, it's it's again it's just

the way the probabilities are being decomposed.

>> Hm.

>> And the last of these three equations just Bayes rule,

this time written correctly where the denominator has to be the probability

of x, and we've gone over this a couple of times. I

don't, I don't need to, to describe it again, but what Would

like to, just, bring to your attention to

this three together turn out to be kind

of our, you know, three musketeers in working

out the probability of various kinds of events.

>> Excellent.
All right. So let's put some of these rules into play

by actually doing some inference by hand. Ultimately, we're going to derive

some algorithms that can do this so you don't have to

think about it so hard. But understanding those algorithms, it's helpful to

have gone through an exercise where you actually use these ideas.

Revision subtitles of this episodes


have to choose the row or choose how to

distribute the likelihood over the row. So all I

really need to know is, what's the probability of

me being in box one and being in box two.


>> So then that means that the first quantity there

is actually a product of each of those conditional probabilities.

>> Yeah, so this is a really convenient structure.

Because it really just decomposes into all these separate

helpful quantities. So in particular, we can actually derive

this by applying the chain rule. But what we end

up with is that this joint probability over these

three variables decomposes into a product of three independent joint

probabilities. The probability that's, Contains viagra given that it's

spam, which we have. That number is 0.3. That probability

that prince doesn't appear in it, given that it's

spam and that is that it doesn't contain prince given

that it is spam. So that should 0.8, cause 1 minus the

0.2. And that it's not udacity given that it's spam. Is

going to be 1 minus this 0.0001, should be 0.9999. All right.

So this is the case when things, when it is spam, and if it's not spam, we

can do this same thing and get a product,

and that we can normalize, to get what the,

the relative probabilities between it being spam and not spam. So then I'm a

big fan of normalization, but of course this makes me think about, since it's

sort of a classification problem, we only

really care about knowing which one's more


likely. We don't really care about the

probability, right? Do we have to normalize?

>> Yeah, yeah because we do care about the probability.

>> Oh we do?

>> Yeah because we're... I asked" What is the

probability of spam given these other quantities. Oh, I see.

>> But you're right. So the observation

that you're making is a really good one. Which is that we

can do probability calculations in this

setting, and that's actually going to give

us answers to classification problems. And we're going to connect this back to

machine learning. But but first let's write a general form of this formula.

>> Okay.

>> Because this this seems a little bit specific. Alright so

the general form for this, is that if we're trying to figure

out the probability of, of some kind of a a root node

like this, when you have all these little bristly things coming down.

You can think of it as a probability of a

value given a bunch of attributes. And that's going to be equal

to the product of the probability that each of those

attributes would be generated by that. Underlying this v. This, this

the label or the or the underlying class. Times the

prior probability that v and then we just normalize by all

the different possible values of, of v. This, this quantity across


all the possible types of v. So so this is one

way of actually getting a very general kind of inference done,

and there's, as you were pointing out, Charles, there's a. There's

a really nice reason to think about things in this form,

because it does let you do a kind of classification. So

essentially if you think of, of this top node as being

the class, this is what was playing the role of V

here, and these are all a bunch of attributes, then even

if, if we have a way of generating attribute values from classes.

What this let's us do is to go the other way.

That we observe the attribute values and we can infer the class.

>> Nice, so what's the equation for that?

>> Right, so the, the maximum oposterior

class if you're just trying to find whats

the most likely class given the, the data that you've seen. You can just take

an arg max over all the different possible values of that, that root node of

the prob, its probability times the product

of all the attribute values given that class.

So this would actually let us if you're, if you're been paying attention,

we could, in this particular case, compute map spam. Which is a palindrome.

>> Wow. That is spectacular.

>> You did not see that coming did you?

>> No I did not.


So this idea of Naive Bayes, where you have
a network that has a label producing or, or

conditionally producing a bunch of attribute values, is just

a really cool and powerful idea. So one of the,

one of the issues is that, even though inference

in general is, is is a very difficult problem it's

NP hard. To work out what these probabilities are,

when you have a naive Bayes structure, it's cheap. It's,

it's the formula that we had on the previous slide. The

number of parameters that you need to write down, again even if

you have a very large number of variables, it's not exponential

in the number of variables, it's just linear. There's, two probabilities for

each of the attributes and one probability for the class. We

can actually estimate these probabilities. So so far, we've only been talking

about Bayes Nets in, in not in a learning setting, but in

a setting where we just write down what all the numbers are.

We can actually very easily estimate these parameters. How would we

do that? Well the odd, the easy way to do it, is

you count. When you're trying to estimate the probability of a particular

attribute value given a class, it's really just in your, in your

labeled data. How often do you have an example that has an

attribute value in that class, and then divide by the number of

times you had that class at all, and that gives you the

conditional probability. So this is, you know in, in the case of

infinite data this is actually going to give you exactly the right
number. It also connects this notion of inference that we've been

talking about with classification. Which is mostly what this, this mini

course has been about. So, that's really great to have a connection,

it actually allows us to do all kinds of interesting things

like instead of only generating what the labels are, we can actually

generate what attributes are. We can do inference on, in, in

any of these directions. And it turns out it's wildly successful empirically.

So, my understanding is that Google uses a tremendous amount of Naive Bayes

classification in what they do. If you have enough data you can estimate

these values really well, and Naive Bayes is just remarkably good. So yeah

so it's like unclear why we'd even have any other algorithms, right Charles?

>> Well, there's no free lunch. But I, I gotta say I, I you know

there's this as a famous man once said it works in practice but doesn't work

in theory. And I'm trying to figure out how this can possibly work.

So I noticed it's called Naive Bayes. And, I think I know why now.

>> Alright.

>> One is that it's well it's naive and

in fact painfully ridiculous to believe that the bayesian

net that you wrote up there in the upper

right-hand corner represents the real world most of the time.

>> Hm, I see, and why is that?

>> Well because

what the, what the network says is that all

of the attributes are conditionally independent giving that you know


the label, that just can't be true. We talked

about this before where we were using Bayesian inference to,

to derive the sum of squared errors that it

makes a very strong assumption about where your errors come

from and an even stronger assumption about where your errors

don't come from. So you're not modeling any of the

interrelationships, between, the different attributes and

that just doesn't seem right. So, one

question I have. I have two, we'll save the second one though. One question

I have is, how in the world can it possibly be the case

that this works in practice? Hm, that's a good question. It does. Moving on.

>> [LAUGH] No, that's not satisfying.

>> No?

>> How about, how about I give it a guess? Okay?

>> Alright.

>> Now,

now that I yelled at you, why don't I, why don't I give it a guess.

>> [LAUGH]

>> I think it comes back to one of

the conversation we had in the previous slide. When

I was saying well we don't have to care.

We don't care about probabilities. And you said we

do care about probabilities because of the question your

asking and that was fair. But once were down


to classification. The probabilities really don't matter. Right all

that matters is that you get the right answers.

So its okay I guess if the probabilities you

get are long. So long as they're sort, sort

of in the right direction right. That you end

up getting the, the right label as a result.

>> Yeah, that's a good point. That in fact

we're introducing this idea in the context of, of

Bayesian Inference it might actually not be so good

at that even if it is particularly good at classification.

>> Oh, oh actually I think I have a good example so,

so here, here write this down. So let's imagine there are four

actually you can use the network that you have up there okay

>> Good.

>> So let's say that the first attribute, I'm just going to call it A

and the second attribute I'm going to call B, and let's say we're really, we're

really lucky and our naïve assumption is

right and they really are conditionally independent. But

let's say the third attribute, is actually

just another way of writing down A, and

the fourth attribute is just another way of writing down

B. So, clearly there are interrelationships between the attributes, right?

>> The third attiribute is the first one, the fourth attribute is

the second one. There's not way around that. And so you'd think
Naive Bayes would fail. But, actually, looking at your equation right below

there where you're doing counting, I actually think, it'll work just fine.

>> Why?

>> Because all you're really doing

is double counting the sort of weight of

attribute A, but you're also double counting the

weight of attribute B and they'll cancel each

other out. And you'll get the right answer.

>> When you do the arg max, but these

>> When you do the arg max

>> You get bad probabilities. The probabilities

end up being kind of squared of what

they should, what they're supposed to be.

But that's okay because the ordering is preserved.

>> Right, exactly. And so, even if you're unlucky and

the fourth attribute wasn't B but it was something else, C.

It doesn't matter if you double count A as

long as it still gives you the right label.

And you can imagine that if you have weak

inner relationships or, you know, you have enough attributes and,

and so on that you would still get the

right, you know, yes this is the correct label, even

if you've got the probabilities wildly wrong. Okay, so

I'm willing to believe that that could happen in practice.


>> Okay.

>> So in fact, my guess is that Naive Bayes believes

it's answer too much. But it doesn't matter if it happens to be right.

>> All right and did you have other issues with it?

>> So the second problem I have actually boils down to that

equation you wrote there. So it's really nice and neat that you

can compute the probabilities of seeing an attribute, given a value by

just doing counting. But, I don't have an infinite amount of data, right?

>> Not on a bad day, no.

>> No. Or even on a good day I usually

don't have an infinite amount of data. So what if

I'm unlucky enough that for some particular attribute value,

I have never seen it paired with that label, V.

>> Right. So then, that means this numerator will be zero

>> Right.

>> So.

>> Well that numerator is zero, but since

the computation involves a product by just having

one attribute value that I've never seen before.

I'm going to end up saying well the probability

of that entire product of seeing that value given

a set of attributes is also going to be zero. So

one unseen attribute, basically says it doesn't matter what else

is going on. Which seems a little weird, right? You,


you, you'd think that you, if all the other

attributes are screaming yes, yes, yes, yes, it should be

positive. But just because you haven't happened to have seen

any examples of some other one single attribute, that shouldn't

be enough to do veto.

>> Good point, so in fact that's not what

people often do. People will often, what they call smooth

the probabilities, by essentially initializing the count, so that

nothing is zero, everything has a tiny little non-zero value

in it. And there's, there's smarter and less smart

ways of doing that, but no, you're absolutely right. That,

that is, that zeroing out problem is a real

thing and you have to be a little bit careful.

>> Hey, hey I just had a thought. So,

if you, you have to do that, because if you don't do

that, then you're believing your data too much. You're kind of over fitting.

>> Ooh. Over fitting comes up again.

>> Oh, oh, it's okay, okay so, so, so, so, so bear with

me on this Michael. So if you're over fitting by believing the data,

and you're fixing it by smooth, I usually spell it with a V,

but whatever. If you, you'd think that by being smooth, then you're making

an assumption. There's a kind of inductive

bias, right? Your'e, you're saying that I go

in with the assumption that they're sort


of all things are at least mildly possible.

>> Good.

>> Huh.

>> Yea, that's, that's right.

>> Okay, Naive Bayes is cool, you've convinced me.

>> Nice.
So I was thinking of talking to you more

about sampling, but it seems like it might work out

best to just have some hands on experience with it

so we're going to put those things on the homework. So

given that we're actually in a position now to, to

kind of wrap up the whole Bayes net inference piece

that we were talking about. So do you want to help

remind me, Charles, what were the things that we covered?

>> Sure, I can help you with that. We covered Bayesian Inference [LAUGH]

I'm sorry.

I'm punch drunk.

>> I'm going to choose not to pay attention to that. Instead, write
Bayesian Networks. We talked about the

Bayesian Network representation of joint probability distributions.

>> Right. We did a lot of examples of how to actually do inference with networks.

You know, exactly how do we, do we

compute probabilities of particular values. We mentioned sampling.

>> That's right.

>> And then we did a Naive Bayes.

>> Well first we did say that, that in general it's hard

to do exact inference. It's actually hard to do even approximate inference.

>> Mm-hm.

>> But we talked about a special

case of bayesian networks, that was called naive

bayes with the naive part being, that we're

assuming that attributes are independent of one another.

>> Condition on the label.

>> Right. And this was actually helping us make a

link between all this bayesian stuff. The bayesian rabbit hole we

went down. And classification, which is the core machine learning

topic that we've been spending a lot of time on.

>> So the other thing that I really

liked about this notion, this link to classification, Michael,

is that when I was talking about Bayesian

learning, what we ended up with at the end

is this nice idea that we had a gold standard, right? We had a sort of way
of talking about what the right hypothesis was

and, ultimately, what the right classification was by computing

these probabilities. And sometimes, we couldn't do it because, typically,

you can't actually do the for loop that requires you compute

conditional probabilities of hypothesis given data. Over say an infinite number

of hypothesis, but at least we kind of knew what the

right thing was and we made right assumptions we could do

things like derive, oh I don't know, a sum of squared

errors or various other things that you might do and that

was all very cool. But what you've done here when you

do inference. Is at least with a Naive Bayes case,

you've shown us a way that we can do classification

using these things, that actually is tractable, and is the

right thing to do under certain assumptions. I really like

that. And the other thing that I think is worth

mentioning is that not only does it link this Bayesian

learning to classification. But it connects classification back to this

general notion of Bayesian learning, Bayesian inference where, you don't

have to worry about just figuring out the most likely label

given a bunch of attributes. But because it's a Bayes network and

you can compute anything from it, you could try to ask

well what's the likelihood that I see some particular attribute or set

of attributes, given a label or given a subset of attributes

on all those kind of things that you could do. With the
Bayesian learning. So inference gives us this power to not just

do classification, but to do a larger set of things beyond classification.

I think that's kind of cool.

>> Cool. Yeah, well said. The, the For, and another thing,

kind of in that same space is that it handles missing attributes

really well. So whereas things like, oh. You know, decision trees

and so forth, if you give me an example that doesn't have

one of the attribute values and you've hit that part of

the decision tree where you need to know that attribute value you're

stuck. Whereas in this naive base setting, you can still do the

probabalistic inference over the missing attributes

because all the things are linked

by probabilities.

>> Nice.

>> All right. So I think, you know, you'll, you'll get

a much stronger handle of this when you go through the,

the homework problems. But I think that's enough for Bayesian inference.

And I think that actually wraps up classification and regression more generally.

>> Right. So we're done with supervised learning. Well, one's never done with

supervised learning. But we're at least done with this part of the course.

>> Because there's always more to supervise learn.

>> That's right. And in particular you'll get

a nice example of this, because you'll be taking an exam.

>> [LAUGH]
>> And your input will be the exam, and then we'll give you a label back.

>> [LAUGH] I guess that's one way to think about it.

>> Well and then they'll get to generalize beyond

that for the next time they take the exam.

>> Very good! All right. Well, well thanks

very much, this has been fun. Thanks Charles.

>> This has been fun. I will see you in the second mini course.

>> All right.

>> Bye.
14. Ensemble B&B
hypothesis. That's the specific hypothesis that our learner

has output. That's what we think is the

true concept, and C is whatever the true underlying

concept is. So I'm going to define error as

the probability, given the underlined distribution that I

will disagree with the true concept on some

particular instance X. Does that make sense for you?

>> Yeah

but I'm not seeing why that's different from number of mismatches in the

sense that if we count mismatches on a sample drawn from D, which is

how we would get our testing set anyway. Then I would think that would
be you know if it's large enough a pretty good approximation of this value.

>> So here Michael, let me give you a specific example.

I'm going to draw four, four possible values of X. And when

I say I'm going to draw four possible values of X, I

mean I'm just going to put four dots on the the screen.

>> Hm.

>> Okay? And

then I'm going to tell you this particular learner output a hypothesis.

Output you know, a, a potential function that ends up getting

the first one and the third one right, but gets the

second and the fourth one wrong. So what's the error here?

>> Mm.

>> So let's just make sure that, that

everybody's with us. Let's do this as a quiz.

>> Okay, so let's ask the students what they

think. So here's the question again. You've output some hypothesis

over the four possible values of x, and it turns

out that you get the first and the third one right,

and you get the second and the fourth one wrong.

If I look at it like this, what's the error rate?


So, Michael wants us to do a quiz.

Because Michael likes quizzes cause he thinks you like

quizzes. And so, I want you to answer

this question before Michael gets a chance to. So

just to be clear here's the question again. What happens to the distribution

over a particular example i when the hypothesis ht that was output by the

example. Agrees with the particular label, Y-sub-i.

Okay, so we have four possibilities when they agree.

One is the probability of you seeing that particular

example increases. That is, you increase the value of


D-sub-t on i. Or the probability of you seeing that

example decreases. That is, the number d of t

of i goes down, or it stays the same

when they agree or well it depends on exactly

what's going on with the old value of d and

alpha and all these other things. So ,you can't

really give just one of those other three answers.

So those are your possibilites. The other radio buttons

[LAUGH] only one of them is right. And go.

Need to understand
So that ties together this, what constructed E does

for you, and connecting it to the hardest examples.

So now, that gets us to a nice little

trick where we can talk about how we actually output

our final example. So, the way you construct your

final example, they way you do that combination in the

step is basically by doing a weighted average. And

the weight Is going to be based upon this alpha

sub T. So the final hypothesis is just the s g n function of the weighted sum of

all of the rules of thumb, all of the weak classifiers that you've been picking

up over all of these time steps Where

they're weighted by the alpha sub T's. And remember,

the alpha sub T is one half of the natural log of one minus epsilon T over
epsilon T. That is to say, it's a measure of how

well you're doing with respect to underlining error. So, you get more

weight if you do well Then if you do less well or

you get less weight. So what does this look like to you?

Well its a weighted average based on how well you're doing or

how well each of the individual hypotheses are doing and then you

pass it through a thresh holding function where if its below zero

you say you know what? Negative and if its above zero you

say you know what? Positive and if its zero you just throw

up your hands and And return zero. In other words, you return literally

the sign of the number. So you are throwing away information there, and

I'm not going to tell you what it is now, but when we go

to the next lesson it;s going to turn out that that little bit of

information you throw away is actually

pretty important. But that's just a little

bit of a teaser. We'll get back to that there. Okay so, this

is boosting, Michael. There's really nothing else to it. You have a very

simple algorithm, which can be written down in a couple

of lines. The hardest parts are constructing the distribution, which I

show you how to do over here, and then simply bringing

everything together, which I show you how to do over here.

>> Alright yeah, I think it doesn't seem so bad

and I feel like I could code this up, but I

would be a little happier if I had a handle


on what the, why alpha is the way that it is.

>> Well there's two answers. The first answer

is. You use natural logs because you're using

exponentials and that's always a cute thing to

do. And of course, you're using the error term

as a way of measuring how good the hypothesis

is. And the second answer is, it's in the

reading you were supposed to have done. [LAUGH] So,

go back and read the paper now that you've

listened to this and you will have a much

better understanding of what it's trying to tell you.

>> Thanks

>> You're welcome. I'm about helping others Michael you know that.
See translate for 10 min Talks about Good Answers
as we create

more and more of these hypotheses, which you would think would

make something more and more complicated, it turns out that you

end up with something smoother, less likely to overfit and ultimately,

less complicated. So the reason boosting tends to do well and tends to avoid

over fitting even as you add more and more learners is that you're increasing

the margin. And there you go. And if you look in the reading that

we gave the students there's actually a

detailed descritpion about this in a proof.

>> Cool.

>> Okay. So, there you go, Michael.

Do you think, then, that boosting never overfits?

>> [SOUND] Never seems like such a strong word.

I mean, the story that you told says that it's going to try

to separate those things out, but I guess I guess it doesn't have

to be able to do that. I mean, it could be that

for example all the weak learners

are I dunno very unconfident very inconsistent.

>> Hm. Okay, well you know, maybe, maybe it's worthwhile to

take a little diversion here to take a five second quiz.

>> I think it's worth the time.


>> All right, Michael. What's the answer?

>> All right. Well, let me start off with what I think the

answer isn't. So, the last one, boosting tends to overfit, if boosting trains

too long. You just told me a story about that not being true.

So I'm going to eliminate that

one from consideration. Boosting training too long.

>> Oh, nice to know you were listening.

>> [LAUGH] Boosting training too long, seems like

not a good reason for it to overfit.

>> You're correct.

>> All right. Boosting tends to overfit if it's

a nonlinear problem. So, that doesn't seem right. I

mean I guess, no, this one just doesn't seem right

at all. Like I don't see why, why the problem


being linear or nonlinear, has anything to do with overfitting.,

>> Okay.

>> A whole lot of data is the opposite of what tends

to cause overfitting. If there's lots of data then you'd think that it

would actually do a pretty reasonable job of, you know, there's a

lot to fit. There's a lot going on there. It's unlikely to overfit.

>> Right, and

in fact if a whole lot of data included all of the data, and you actually

could get zero training error over it, then

you know you have zero training zero generalization error.

>> because it'll work on the testing data as well, because it's in there.

>> Right.

>> All right. Weak learner uses artificial neural network with

many layers and nodes. So I'm guessing that you wanted

me to think about that being something that, on its

own, is prone to overfitting, because it's got a lot

of parameters.

>> Sure.

>> So, if, and now we're doing boosting

over that. So we fit a neural net, and

then we fit another neural net, and we

fit another neural net. And we're combining all the

outputs together in the correct, weighted way. It's

not obvious to me that that should be a


good thing to do. I'm not sure it would overfit, but it seem like it sure could.

>> OK, so you're, you're, so for now let's

put a little question mark to it. You think that

might be the right answer, but you want to think about it some more?

>> Yeah let me, let me look at the first

one. Weak learner chooses the weakest output. Well, I mean

boosting is supposed to work as long as we have

a weak learner. . And it doesn't matter if it

chooses the weakest or the strongest. All that matters is

it does significantly better than a half. So, like I

feel like the only one, the only one of these

choices that is likely to be true is the second one.

>> And that is, in fact, correct. So let me give you an example

of when that would be correct. So let's imagine

I have a big powerful neural network that could represent

any arbitrary functions. Okay, got lots of layers and lots

of nodes. So, boosting calls it, and it perfectly fits

the training data, but of course overfits. So then it

returns, and it's got no error, which means all of

the examples will have equal weight. And when you go

through the loop again, you will just call the same

learner, which will use the same neural

network, and will return the same neural network.

So every time you call the learner, you'll get zero training error, but you will
just get the same neural network over and over and over again. And a weighted

sum of the same function is just that

function. So if it overfit, boosting will overfit.

>> Interesting. And not only will it overfit, but it'll

just, it'll be stuck in a horrible loop of error.

>> Right. So that's why this

is the sort of situation where you can imagine

boosting a lower fit. If the underlying learners all

overfit and you can never get them to stop

overfitting, then there's really not much you can do.

>> Interesting.

>> Now, I do want to have a little semantic argument

with you for a moment, Michael. You used the word strongest at

some point, when you were talking about using the weakest output. And

I just want to point out that, that doesn't really mean anything.

>> What do you mean, it doesn't mean anything?

>> Well, so what's a strong, what would you call

a strong learner?

>> One that is far away from it. If a

weak learner just has to do a little bit better

than a half, it seems like a strong learner would

be something that would be very close to being accurate.

>> Right. Of course, on the other hand, if

by that definition all strong learners are also weak learners.


>> Sure.

>> Because anything that does better than a half is still doing better

than a half, which is all it requires to be a weak learner.

>> Yeah, but that's kind of true of people

too. Like a strong person is also a weak person.

>> No.

>> Well it depends how you define it. So,

if you say a weak person is someone who can

at least lift their own arms, then strong people are

also weak people in that they can lift their arms.

>> Yes if you define it that way and if I define

blue to be purple, then I can say blue is purple. But that's

not how people define weak people. They define weak people, by saying they

can't lift more than, not that they can lift at least as much.

>> I see. So it's this piece of terminology that boosting uses that is in

error, not me.

>> That's one interpretation. It's not the one that I would use, but

it's one interpretation. When you say something like a strong learner, I mean,

it makes sense to use that kind of term, and sort of throw

it around, and say, well, by a strong learner I mean someone who's,

or a learner that's going to overfit, or is going to always do

really well on the training data. But in kind of a technical definition

it's very difficult to sort of pin down. So don't get too caught

up what a strong learner means if you want to write a proof.


Seems fair?

>> Good point yeah, also, also that this whole notion that strong

is sometimes defined as not weak. And it is not the case that

if you have something that's not a weak learner that it's, then

it's a strong learner. In fact, it's no lear, no learner at all.

>> Exactly. So, a weak learner's just defined in a way that

basically says, it gives me at least some information. Good. Let me

just throw one more thing in here and then we can stop

talking about this. There's another, a couple of other cases where boosting

tends to overfit. The one that matters the most, or

comes up the most, is in the case of pink noise.

>> Did you say, peak noise?

>> I said, pink noise. I even wrote it in red, which

looks like pink. It's a strong pink as opposed to a weak pink.

>> [LAUGH]

>> I'm sorry. There's no way for that to be obvious from what we've

talked about, but as a practical matter,

pink noise tends to, cause boosting overfit.

>> Okay, but this is not a term I'm familiar with

unless you're critiquing the musical stylings of a particular performer.

>> [LAUGH] No. Although I did recently see, see them in concert. But

that's a whole other conversation. Okay, so pink noise just means uniform noise.

>> I thought white noise was uniform noise.

>> No, white noise is Gaussian noise. Okay, so pink noise is uniform
noise and white noise is Gaussian noise. This is why, Michael, by the way,

if you ever try to set up a studio or a cool stereo system in your house, you

want a pink noise generator. So that it covers

all the frequencies equally, not just the white noise. generated.

>> Hm.

>> But boosting tends to overfit in those sorts of circumstances. And you

can read more about it in the notes if you want to. But

the one that I want I really want people to get is, that

if you have an underlying weak learner that overfits, then it is difficult

for boosting to overcome that. Because fundamentally you've already done all of

your overfitting and it's, there's really not much for those things to do.

>> Okay. Got it?

>> Got it.

>> Excellent. It all ties back into margins, and it's all one

big story, which I think is the lesson of all of machine learning.

You might also like