Machine Learning Essentials
Lecture II: Supervised Machine Learning
What is Machine Learning?
Machine learning (ML) is the study of computer algorithms that improve automatically through experience..
Machine learning approaches are traditionally divided into:
• Supervised learning: The computer is presented with example inputs and their desired outputs, given
by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
• Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find
structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data)
or a means towards an end (feature learning).
• Reinforcement learning: A computer program interacts with a dynamic environment in which it must
perform a certain goal (such as driving a vehicle or playing a game against an opponent). As it
navigates its problem space, the program is provided feedback that's analogous to rewards, which it
tries to maximize.
What is ML- cont
Tom Mitchell definition of ML: A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves
with experience E.
Modern day machine learning has two objectives, one is to classify data based on models which have been
developed, the other purpose is to make predictions for future outcomes based on these models
Relations of ML and AI or
What is ML - cont
Relation to Data Mining:
Machine learning and data mining often employ the same methods and overlap significantly, but while
machine learning focuses on prediction, based on known properties learned from the training data, data
mining focuses on the discovery of (previously) unknown properties in the data.
Relation to Optimization:
Machine learning also has ties to optimization: many learning problems are formulated as minimization
of some loss function on a training set of examples. Loss functions express the discrepancy between
the predictions of the model being trained and the actual problem instances.
Statistics draws population inferences from a sample, while machine learning finds generalizable
predictive patterns.
Machine Learning in Python
This lecture focuses on practical aspects of machine learning, primarily using Python’s Scikit-Learn package
• In particular:
we introduce the fundamental vocabulary and concepts of machine learning.
We introduce the Scikit-Learn API and show some examples of its use.
We take a deeper dive into the details of several of the most important machine learning approaches, and
develop an intuition into how they work and when and where they are applicable.
We experiment with supervised learning ML methods
We experiment with unsupervised ML methods
Examples of ML Applications
Classification:
you are given a set of labeled points and want to use these to classify
some unlabeled points. The training set is shown on the right.
The ML algorithm learning result is show below.
The line separating the two sets can now be used
To classify unknown data points.
Examples of ML – cont.
Regression - Predicting Continuous labels
The training set: data with continuous labels (colors in picture)
The ML result. Coloring the whole space based
on the training data, which can be used now to label
unknown data.
Examples of ML – cont.
Clustering: Inferring labels on unlabeled data.
The unlabeled data on the right:
The ML algorithm result: dividing the points into groups
of points close to each other (k-means algorithm).
Each group was given a different color for illustration
purposes.
Examples of ML – cont.
Dimensionality Reduction: Inferring structure of unlabeled data.
The input data:
Result of the ML algorithm
Hands on with Python Scikit-Learn
To install: pip install -U scikit-learn
pip install seaborn # for later use
To check installation:
python -m pip show scikit-learn # to see which version and where scikit-learn is installed
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for classification and
the diabetes dataset for regression.
To Load these dataset use:
from sklearn import datasets
iris = datasets.load_iris() # database of flowers
digits = datasets.load_digits() # database of the digits 0-9
diabetes = datasets.load_diabetes() # diabetes database
# print(iris.data); print(iris.target); print(digits.images[0]); etc
Scikit-Learn - cont
import seaborn as sns # using the seaborn library
iris = sns.load_dataset('iris')
iris.head()
Output:
sepal_length sepal_width petal_length petal_width species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
…
A table of features (columns) describing each measurement
Target array describing the labels
For plotting:
import seaborn as sns; sns.set()
sns.pairplot(iris, hue='species', size=1.5)
Frequentist (classical) vs. Bayesian Method
Up to now we looked at frequentist methods. The assumptions were:
P1: Probability refers to limiting relative frequency. Probabilities are objective properties of the real world.
P2: Parameters are fixed, unknown constant, and no probability statement about them can be made
P3: Statistical procedures should be designed to have a well defined long run frequency properties. For example,
95% confidence interval should trap the true value of a parameter with limiting frequency of at least 95%.
Another approach is Bayesian Inference. Its postulates are:
B1: Probability describes degree of belief, not a limiting frequency.
B2: We can make probability statements about parameters.
B3: We make inference about q by producing a probability distribution for it. Inference such as point estimation
and intervals can be estimated form it.
Bayesian Method
1. We choose a probability density – called prior distribution
– this is our beliefs about the parameter q before we see data
2. We choose a statistical model that reflects our beliefs about x given
3. After observing we update our beliefs and calculate the posterior distribution
In case of a singe data point:
In case of n IID observations,
So where
Posterior is proportional to Likelihood times Prior.
Bayesian Method – estimation
We use to make estimation as follows:
Point estimation – typically we take the mean
Interval estimation: find a set , such that
Taking a,b such that produce the desired result.
Hence, C is posterior interval
Example: Let and let σ assumed to be known. Take prior . After some
calculation we can get, where
Note that as
Bayesian Method – interval estimation
Find such that . This can be done by choosing c,d such that
We know so or equivalently
Similarly,
The 95 percent Bayesian interval is
This is the same as the frequentist confidence interval.
Bayesian Method – Posterior by simulation
Suppose we draw .
The histogram of approximate the posterior density
Approximation to the mean is
The posterior 1-a can be approximated by
where is the quantile of
Once we have a sample from Let then
is a sample from
This avoids the need to do any analytical calculations.
Simulation Methods – for Bayesain Inference
In Bayesian Inference we need to calculate certain integrals.
Let the Prior be and let the data be
The Posterior is where
The posterior mean
The integrals above, or their corresponding sums in case of discrete space are to be calculated using simulation
which is explained next.
Simulation Methods – for calculating integrals
Monte Carlo Integration:
We want to calculate some integral
We write it as
Where and
Then it is an expectation
We generate then
The standard error where and
A confidence interval We can take n large and make the interval small as we wish.
Simulation Methods – another example
We start with the standard PDF
And want to calculate its CDF
We write it as where
And then we have
For example, with x=2, the true value is 0.9772 and the Monte Carlo estimate with n = 10000 is 0.9751/ With
n = 100000 we get 0.9771.
Statistics in Data Science – Assignment II
An example where statistical tools are used in Data Science.
Say we are trying to understand complex data. It is very high dimension, so we try to come up with
features that can characterize it.
We invent many features, and then we want to choose a subset of them that is useful in classification.
If we are given labeled data, we can sample two classes, calculate their mean over a sample of size n,
estimate variance and check the Hypothesis that the two sample came from the same class. If the
hypothesis is rejected, we keep the feature. Otherwise we drop it.
Assignment II: Go over some datasets in UCI, especially those with many features and test whether all
the columns they give are really useful. Do it by picking a sample size, n, estimate mean and variance
for the population based on the sample, use the theorem about distribution of average values
(remember the sqrt(n) !!) to determine of the two samples came fro mthe same population.
Naive Bayes Classification
Naive Bayes models are a group of extremely fast and simple classification algorithms that are often suitable
for very high-dimensional datasets. They are fast and have so few tunable parameters.
We focus on an intuitive explanation of how naive Bayes classifiers work, with a couple of examples.
Let L denotes a label and F denotes observed feature (column in the dataset).
𝑃 𝐹 𝐿) 𝑃(𝐿)
𝑃 𝐿|𝐹 = 𝑃(𝐹)
If we are trying to decide between two labels—let’s call them L1 and L2—then one way
to make this decision is to compute the ratio
𝑃 𝐿1 𝐹) 𝑃 𝐹 𝐿1) 𝑃(𝐿1)
=𝑃
𝑃 𝐿2 𝐹) 𝐹 𝐿2) 𝑃(𝐿2)
Now we need a way to calculate 𝑃 𝐹 𝐿) .
Gaussian Naive Bayes – an example
The assumption: The data from each label is drawn from a simple Gaussian distribution
One simple model assumes that the data is described by a Gaussian distribution with no covariance between
dimensions. We can fit this model by simply finding the mean and standard deviation of the points within each
label, which is all you need to define such a distribution.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.datasets import make_blobs
from sklearn.naive_bayes import GaussianNB
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)
model = GaussianNB()
model.fit(X,y)
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model.predict(Xnew)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu’)
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='RdBu', alpha=0.1)
plt.axis(lim); plt.show() # shows the boundary of the classification
yprob = model.predict_proba(Xnew) # predicting probabilities for each label
yprob[-8:].round(2)
Multinomial Naive Bayes
The assumption: The features are assumed to be generated from a simple multinomial distribution. The
multinomial distribution describes the probability of observing counts among a number of categories, and thus
multinomial naive Bayes is most appropriate for features that represent counts or count rates.
The idea is precisely the same as before, except that instead of modeling the data distribution with the
best-fit Gaussian, we model the data distribution with a best-fit multinomial distribution.
Examples: Text classification. (in TA class)
When to Use Naive Bayes
Because naive Bayesian classifiers make such stringent assumptions about data, they will
generally not perform as well as a more complicated model. That said, they have several
advantages:
•They are extremely fast for both training and prediction
• They provide straightforward probabilistic prediction
• They are often very easily interpretable
• They have very few (if any) tunable parameters
Linear Regression
linear regression models are a good starting point for regression tasks. They can be fit very quickly, and are
very interpretable.
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.scatter(x, y) Model slope: 2.02720881036
plt.plot(xfit, yfit); plt.show() Model intercept: -4.99857708555
print("Model slope: ", model.coef_[0])
print("Model intercept:", model.intercept_)
Basis Function Regression
Use polynomials instead of linear functions,
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.linear_model import LinearRegression
def GetPolyData(x,n):
return np.sin(x) + np.random.uniform(-.2, .2, n)
n = 250 # elements number
x = list(range(n))
x = [i/100 for i in x]
y = GetPolyData(x,n)
train_x = np.array(x)
train_y = np.array(y)
polyModel = PolynomialFeatures(degree = 4)
xpol = polyModel.fit_transform(train_x.reshape(-1, 1))
preg = polyModel.fit(xpol,train_y)
liniearModel = LinearRegression(fit_intercept = True)
liniearModel.fit(xpol, train_y[:, np.newaxis])
polyfit = liniearModel.predict(preg.fit_transform(train_x.reshape(-1, 1)))
plt.scatter(train_x, train_y)
plt.plot(train_x, polyfit, color = 'red')
plt.show()
Classification using Support Vector Machines
An example with two sets and several separating lines.
New point x will be classified differently based on separating
line.
Support Vector Machines: Maximizing the Margin
Support Vector Machines (SVM) - example
from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=50, centers=2,
from sklearn.svm import SVC # "Support vector classifier" random_state=0, cluster_std=0.60)
import numpy as np plt.scatter(X[:, 0], X[:, 1], c=y, s=50,
import matplotlib.pyplot as plt
import seaborn as sns; sns.set() cmap='autumn');
def plot_svc_decision_function(model, ax=None, model = SVC(kernel='linear', C=1E10)
plot_support=True): model.fit(X, y)
if ax is None:
plt.scatter(X[:, 0], X[:, 1], c=y, s=50,
ax = plt.gca()
xlim = ax.get_xlim() cmap='autumn')
ylim = ax.get_ylim() plot_svc_decision_function(model);
# create grid to evaluate model plt.show()
x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
# plot decision boundary and margins
ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
# plot support vectors
if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1], Print(model.support_vectors_)
s=300, linewidth=1, facecolors='none'); array([[ 0.44359863, 3.11530945],
ax.set_xlim(xlim)
ax.set_ylim(ylim) [ 2.33812285, 3.43116792],
[ 2.06156753, 1.96918596]])
Kernel SVM - example
from sklearn.datasets import make_blobs X, y = make_circles(100, factor=.1, noise=.1)
from sklearn.svm import SVC # "Support vector classifier"
import numpy as np clf = SVC(kernel='rbf', C=1E6) # rbf = radial basis functions
import matplotlib.pyplot as plt
import seaborn as sns; sns.set() clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='autumn')
def plot_svc_decision_function(model, ax=None, plot_svc_decision_function(clf)
plot_support=True): plt.scatter(clf.support_vectors_[:, 0],
if ax is None:
clf.support_vectors_[:, 1], s=300, lw=1, facecolors='none');
ax = plt.gca()
xlim = ax.get_xlim() plt.show()
ylim = ax.get_ylim()
# create grid to evaluate model
x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)
xy = np.vstack([X.ravel(), Y.ravel()]).T
P = model.decision_function(xy).reshape(X.shape)
# plot decision boundary and margins
ax.contour(X, Y, P, colors='k',
levels=[-1, 0, 1], alpha=0.5,
linestyles=['--', '-', '--'])
# plot support vectors
if plot_support:
ax.scatter(model.support_vectors_[:, 0],
model.support_vectors_[:, 1],
s=300, linewidth=1, facecolors='none');
ax.set_xlim(xlim)
ax.set_ylim(ylim)
SVM – Face Recognition
from sklearn.datasets import make_circles
from sklearn.svm import SVC Xtrain, Xtest, ytrain, ytest =
import numpy as np train_test_split(faces.data, faces.target,
import matplotlib.pyplot as plt random_state=42)
import seaborn as sns; sns.set() param_grid = {'svc__C': [1, 5, 10, 50],'svc__gamma':
from sklearn.svm import SVC [0.0001, 0.0005, 0.001, 0.005]}
from sklearn.decomposition import PCA as RandomizedPCA grid = GridSearchCV(model, param_grid)
from sklearn.pipeline import make_pipeline print(grid.fit(Xtrain,ytrain).best_params_)
from sklearn.datasets import fetch_lfw_people model = grid.best_estimator_
from sklearn.model_selection import train_test_split yfit = model.predict(Xtest)
from sklearn.metrics import classification_report fig, ax = plt.subplots(4, 6)
from sklearn.model_selection import GridSearchCV for i, axi in enumerate(ax.flat):
axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
from sklearn.metrics import confusion_matrix axi.set(xticks=[], yticks=[])
faces = fetch_lfw_people(min_faces_per_person=60) axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
print(faces.target_names) color='black' if yfit[i] == ytest[i] else 'red')
print(faces.images.shape) fig.suptitle('Predicted Names; Incorrect Labels in
Red', size=14);
# lets plot them
fig, ax = plt.subplots(3, 5) print(classification_report(ytest, yfit,
for i, axi in enumerate(ax.flat): target_names=faces.target_names))
axi.imshow(faces.images[i], cmap='bone') plt.show()
axi.set(xticks=[],
yticks=[],xlabel=faces.target_names[faces.target[i]]) # confusion classes
mat = confusion_matrix(ytest, yfit)
plt.show() sns.heatmap(mat.T, square=True, annot=True, fmt='d',
cbar=False,
pca = RandomizedPCA(n_components=150, whiten=True, xticklabels=faces.target_names,
random_state=42) yticklabels=faces.target_names)
svc = SVC(kernel='rbf', class_weight='balanced') plt.xlabel('true label')
model = make_pipeline(pca, svc) plt.ylabel('predicted label');
plt.show()
SVM – Face Recognition results
Predictions by SVM Confusion classes
Partial data
precision recall f1-score support
Ariel Sharon 0.65 0.73 0.69 15
Colin Powell 0.81 0.87 0.84 68
Donald Rumsfeld 0.75 0.87 0.81 31
George W Bush 0.93 0.83 0.88 126
Gerhard Schroeder 0.86 0.78 0.82 23
Hugo Chavez 0.93 0.70 0.80 20
Junichiro Koizumi 0.80 1.00 0.89 12
Tony Blair 0.83 0.93 0.88 42
avg / total 0.85 0.85 0.85 337
Decision Trees
Decision trees are extremely intuitive ways to classify or label objects: you simply
ask a series of questions designed to zero in on the classification
#Creating a decision tree
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=1.0) The Data
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow’)
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)
Visualization of the decision tree splitting of the data
Decision Trees - cont
from sklearn.datasets import make_blobs
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import numpy as np
import matplotlib.pyplot as plt
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
ax = ax or plt.gca()
# Plot the training points
ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
clim=(y.min(), y.max()), zorder=3)
ax.axis('tight')
ax.axis('off')
xlim = ax.get_xlim()
ylim = ax.get_ylim()
# fit the estimator
model.fit(X, y)
xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
np.linspace(*ylim, num=200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Create a color plot with the results
n_classes = len(np.unique(y))
contours = ax.contourf(xx, yy, Z, alpha=0.3,
levels=np.arange(n_classes + 1) - 0.5,cmap=cmap, clim=(y.min(), y.max()), zorder=1)
ax.set(xlim=xlim, ylim=ylim)
plt.show()
# creating the data
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow’);
# computing the tree classifier
tree = DecisionTreeClassifier().fit(X, y)
# visualizing the classifier
visualize_classifier(DecisionTreeClassifier(), X, y)
Random Forests
Random Forests – digits
import numpy as np
import matplotlib.pyplot as plt # define training data
from sklearn.datasets import load_digits Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target,
from sklearn.model_selection import train_test_split random_state=0)
from sklearn import metrics #create RF model
from sklearn.metrics import confusion_matrix model = RandomForestClassifier(n_estimators=1000)
from sklearn.ensemble import RandomForestClassifier model.fit(Xtrain, ytrain)
import seaborn as sns ypred = model.predict(Xtest)
# get the data
digits = load_digits() print(metrics.classification_report(ypred, ytest))
digits.keys()
# visualize some digits # confusion matrix
fig = plt.figure(figsize=(6, 6)) # figure size in inches mat = confusion_matrix(ytest, ypred)
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
wspace=0.05) plt.xlabel('true label')
# plot the digits: each image is 8x8 pixels plt.ylabel('predicted label');
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[]) plt.show()
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))
plt.show()
Random Forests – digits results
The data 8x8 images The classification summary The confusion matrix
We find that a simple, untuned random forest results in a very accurate classification
of the digits data.
Thank you for listening