ML Lectures Summary 2
ML Lectures Summary 2
Given a set of data and labels, Discover patterns in data Reasoning under uncertainty to make
learn a model which will predict a D= {Xi}, group the data into Y optimal decisions (what actions to be taken
label for new data classes using a model or function to maximize reward)
D= {Xi, Yi}, learn F: Xk -> Yk F: Xi -> Yj D = {environment(e), actions(a),
• Used to automate manual • Discovering trending topics rewards(r)}, learn policy and utility
labour • Grouping data into clusters functions
• Outlier detection • Policy: F1 : {e,r} -> a
• Utility: F2: {a,e} -> r
1
Machine Learning Workflow Terminology
Instance: Pikachu
Label/Class: Mouse
Features/Attributes: Abilities, Weight,
Legendary
Feature Values: lighting rod, 2, yes
Feature Vector: (lighting rod, 2, yes)
A model: an equation that links the values of some features to the predicted value of the
target variable;
• finding the equation (and coefficients in it) is called ‘building a model’ (see also
‘fitting a model’).
Score functions/Fit statistics/Score metrics – measures of how well the model fits the data.
Feature selection – reducing the number of predictors by selecting the important ones
(dimensionality reduction).
Feature extraction – reducing the number of predictors by means of a
mathematical operation (e.g., PCA).
Types of Data
Images Text
- Are 2D arrays of numbers (RGB values - Words/Letters need to be converted in a
for each Pixel) format understandable to computers
2
Preprocessing
Robust scaler = Same as Standard Scaler, but with median instead of mean and
interquartile range instead of standard deviation.
- Better for skewed data
- Deals better with outliers
- data is scaled with standard scaler if median of scaled data is around 0
MinMax Scaler= shifts data to an interval set by xmin and xmax
- when data has a bounded range/ distribution is not Gaussian (ex: image processing: 0-255)
Normalizer= Does not work by feature (column) but by row (sample)
- Each row of data is rescaled so that its norm becomes 1
- Compute the norm of the vector (square root of the squared elements)
- Divide each element by the norm
- Used only when the direction of data matters
- Helpful for histograms
3
Univariate Transformations
Guiding Principles in ML
To avoid over-fitting:
Build a classifier using the training set and evaluate it using the test
set.
4
Cross Validation- To evaluate (test) your model’s ability
to predict new data
- Detect overfitting or selection bias
- Techniques:
o K-fold cross validation
o Leave one out (K-fold cross validation to the extreme)
5
Model-Based selection
• Get best fit for a particular model • Build a model, select features most
• Ideally: exhaustive search over all possible important to model.
combinations (GridSearch) • Lasso, other linear models, tree-based
• Exhaustive is infeasible (and has multiple models
testing issues) • Multivariate – linear models assume
• Use heuristics in practice linear relation
Categorical variables
Data often has categorical (or discrete) features.
Remember measurement levels:
• Categorical- entities divided into distinct categories; no particular relationship
between them; no order, no interval
• Ordinal- logical order
• Interval-equal intervals on the variable represent equal differences in the measured
property; no true 0
• Ratio- same as interval, but the ratios of scores on the scale must also make sense and
have true 0 value
Often necessary to represent categorical features as numbers.
6
Working with images
Digital images Arrays and Images
• The values are all discrete and integers. • Images are represented as matrices (e.g. numpy
• Can be considered as a large array of discrete arrays)
dots, • Can be written as a function f(x,y) -> pixel intensity
• Each dot has a brightness associated with it. • Types of images: Binary Images, Grayscale Images
• These dots are called picture elements - pixels. and Color Images
Multi-channel Images
• Such images is a stack of multiple matrices;
representing the multiple channel values for each
pixel
• E.g RGB color is described by the amount of
red, green and blue in it.
7
Measures for segmentation
• A segmentation result can be measured if the ground truth is known
• Empirical Measures:
• Accuracy, Precision and Recall
• F-score, Jaccard Index
Text data- can consist of words, sentences, or entire documents, often varying in length
High variability: punctuation, different forms for the same word, typos, Capitalized letters,
etc ...
(a) Representing Text Data
- A sentence can be broken down into individual words.
- Each word is represented as a categorical variable (e.g., using one-hot encoding)
- One-hot encoded vectors are possibly:
o very large (10.000s words in a general vocabulary)
o very sparse (vectors are all “0”s with a single “1”)
- Problems
o Concatenating all word vectors results in massive vectors.
o Sentences have unequal length, which is unsuitable for most ML methods.
Bag-of-Words representation
- Most common technique to numerically represent text is Bag of
Words.
- Represents each sentence or document as a vector with a value
for each word in the vocabulary.
o Binary: word present or absent in the document
o Count: how often the word appears in the document
o Popular approach: Term Frequency x Inverse
Document Frequency (TF-IDF)
- Use a single vector of length equal to the size of the vocabulary.
- Each component is the number of times that word appears in the sample (e.g., in a
sentence or document).
- Useful for sentiment analysis (e.g.recognizing words such as “great”, “terrible” ...)
8
Term Frequency (TF) = (Number of times term t appears in a document)/(Number of terms
in the document)
Inverse Document Frequency (IDF) = log(N/n), where,
- N= number of documents and
- n= number of documents a term t has appeared in
- IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low.
Thus having the effect of highlighting words that are distinct.
- We calculate TF-IDF value of a term = TF * IDF
Advantages:
• Highly interpretable. Each word is an independent feature.
• Simple method.
• Fairly effective approach for some applications
Limitations:
• All structure is lost!
- Crucial information may be lost. e.g. “I passed the IML exam, but I failed the
Computational Linguistics exam”
• Misspellings:
- e.g. “machine” and “machnie” will be counted as a different word
• Some expressions consist of different (multiple) words
- e.g. product review with the word “worth” -> What if the review said “not worth” vs
“definitely worth”
Tacking expressions with multiple words with n-grams: instead of using individual words as
tokens, use groups of n consecutive words.
9
(b) Text Data Preprocessing
• Tokenization — convert sentences to words
- Process of breaking a stream of textual content up into words, terms,
symbols, or some other meaningful elements called tokens.
- The list of tokens turns into input for additional processing including parsing or text
mining.
- Tokenization can swap out sensitive data
• E.g. Typically payment card or bank account numbers—with a randomized number
in the same format
• Restricting the Vocabulary
- Removing unnecessary punctuation, tags
- Removing stop words— frequent words such as ”the”, ”is”, etc. that have low
semantic content
- Removing infrequent words
o Words that appear only once or twice might not be
helpful
o Restrict vocabulary size to only most frequent words
(for less features)
• Stemming —words are reduced to a root by removing inflection
by dropping unnecessary characters, usually a suffix.
- Ex: studies -> studi; studying- > study
• Lemmatization —Another approach to remove inflection by determining the part of speech
and utilizing detailed database of the language.
- Ex: studies -> study; studying -> study
10
Nearest-neighbour Classifiers (1-NN)
= Given a set of labeled instances (training set), new instances (test set)
are classified according to their nearest labeled neighbour
• For classifiers, we take numerical data
• Convert unstructured data to numerical
K-NN classifier
3-NN:
• K= hyper-parameter, represents the number of labeled neighbors to
consider
• Test points are assigned the majority label of k nearest neighbours
• Special cases:
o k = N: since all datapoints are considered, the predicted label for a
test point will always be the the majority label of all datapoints.
Equivalent to a majority classifier.
o Ties: in case of a tie between predicted labels, there are different
possibilities. The most common one is random selection from the tied labels.
• more neighbours=> Less complex decision boundary
Weights in k-NN= extension of the basic algorithm: not all neighbours get an equal vote
Ex: rating of best restaurant in town, opinion of people living closer to the city count more
Distance-weighting= each neighbour has a weight which is based on its distance to the data
point to be classified (closer a point is to the center of the cell being estimated, the more
influence, or weight, it has in the averaging process)
• Inverse distance weighting – each point has a weight equal to the inverse of its distance to
the point to be classified (neighboring points have a higher vote)
• Inverse of the square of the distance
• Kernel functions (Gaussian kernel, tricube kernel)
• If we change the distance function, the results will change.
11
• Implication: with distance weighting, k=n is no longer equivalent to a majority based
classifier.
Computing distance in k-NN
Euclidian distance (straight line) Manhattan distance (distance
between projections on the axis)
= take the nearest centroid and compare how close you are to it
• find the mean of the clusters and compare how close the test data is
from them. The class whose centroid it is closest to, in squared
distance, is the predicted class for the new sample
KNN Regression
= use k-nn to fit your data
• takes the distances and fits the best line on it
• can weight your regressor based on distance; tries to fit your outliers as well
•The cost of the learning process is zero •The model can not be interpreted (there is
• No assumptions about the characteristics no description of the learned concepts)
of the concepts to learn have to be done •It is computationally expensive to find the
• Complex concepts can be learned by k nearest neighbors when the dataset is
local approximation using simple very large
procedures •Performance depends on the number of
dimensions that we have (curse of
dimensionality)
13
Curse of Dimensionality and Overfitting
Classification task: CATS versus DOGS- 10 instances (images of cats and dogs)
Feature 1: average amount of red color in image
• More information is needed for classification, therefore we add a
second feature
Feature 2: average amount of green color in image
• Even more information is needed for classification, therefore we
add a third feature
Feature 3: average amount of blue color in image
Regression
In machine learning, supervised learning to predict continuous outputs (y) is called
regression.
Needed to predict outputs:
• Continuous or categorical input features (x);
• Training examples: many x for which y is known (e.g. many people of whom we know the
height, predicting the housing prices);
• A model, a function that represents the relationship between x and y;
• A cost function, which tells us how well our model approximates the training examples;
• Optimization, a way of finding parameters for the model while minimizing the loss
function.
14
Bias/Intercept
If the line does not pass through the origin:
• Introduce a bias (intercept) term (b)
• Parameter b is the sum of the differences between the values for y
and their estimates (m*x). increases complexity,
more df
Finding m and b
Using Scikit-Learn:
Multi-dimensional Inputs
• The model can be extended for additional input dimensions
15
• y = ax3 + bx2 + cx + d (third degree polynomial) – fits a curve to 4 points
Best fit vs perfect fit
• Perfect fit – Goes through all points
in the data.
• The best fit may not be the perfect fit.
• The best fit should give the best
predictive value.
Overfitting -> fitting a high degree polynomial
Measuring Fit
• Fit – accuracy of a predictive model; the extent to which predicted values of a target
variable are close to the observed values of that variable
• For regression models, the fit can be expressed as R2 (the percentage of variance
explained by the model)
• Residual variance: SSR = where ŷ is the predicted value for yi
• Total variance: SST=
• 𝑅2 = 1 – (SSR / SST)
Overfitting vs Underfitting
• Machine learning is so effective in finding the best fit that it is likely to construct a complex
model that would never generalize to unseen data.
• However, a complex model that reduces prediction error and yields a better fit also models
noise.
• The relation between the complexity of the induced model and underfitting and overfitting
is a crucial notion in data mining
Underfitting: Overfitting:
• The induced model is not complex • The induced model is too complex to
(flexible) enough to model the data model the data (tries to fit noise)
• Performs badly both on training and • Performs better on training set than on
validation set validation set
Validation Set
Tuning hyper-parameters:
• Never use test data for tuning the hyper-parameters
• We can divide the set of training examples into two
disjoint sets: training and validation
• Use the first set (i.e., training) to estimate the
coefficients m for different values of hyperparameter(s) (degree of the polynomial)
• Use the second set (i.e., validation) to estimate the best degree of the polynomial, by
evaluating how well the classifier does on this second set
• Then, test how well it generalizes to unseen data
16
Methods
Ridge Regression Lasso Regression Elastic Net Regression
Gradient Descent
- m is a parameter, e.g. your weights, biases and activations. Notice that we only
update a single parameter, i.e. we could update a single weight.
- η is the learning rate (eta) => but also sometimes alpha α or gamma γ is used.
- J is formally known as objective function, but most often it's called cost
function or loss function.
=> We take each parameter m and update it by taking the original parameter m
and subtract the learning rate η times the ratio of change
Minimize the loss function J(m), by moving the parameters m in the opposite direction of the gradient of J(m).
Learning rate-
describe the
speed of the
descent
17
Lecture 5- Logistic Regression
Linear Regression (recap L4)
can be expressed as
ML terms:
Parameters : variables learned (found) during training, e.g weights (w)
Hyperparameters: variables whose value is set before the training process begins (e.g regularization parameters (alpha),
number of neighbors (k))
Loss function (or error): what you are trying to minimize for a single training example to achieve your objective (e.g square
loss ( )
Cost function : average of your loss functions over the entire training set (e.g. mean square error
Objective function: Any function that you optimize during training (e.g. maximum likelihood, divergence between classes)
• A loss function is a part of a cost function which is a type of an objective function.
Gradient of cost function: The direction of your steps to achieve your objective
Logistic Regression
• is a classifier, not a regressor
Regression for Classification
• In some cases, we can use linear regression to determine an appropriate boundary
• Linear regression : output is a linear function of features
• Linear classification
o decision boundary is a linear function of the input
o Logistic Regression (classifier)
o Linear Support Vector machines
Sigmoid Function= Assumes a particular functional form (a sigmoid) is applied to the linear
function of the data
• Output is a smooth and differentiable function of the inputs and the weights
Logistic Regression
Assumes a particular functional form (a sigmoid) is applied to the linear function of the data
• One parameter per data dimension (feature) and the bias
• Features can be discrete or continuous
• Output of the model between 0 and 1
ML Terms:
Decision Boundary (for classification): Single line/contour which separate data points into regions
• What is the output label at the boundary?- undetermined?
18
Probabilistic Interpretation- can be used to model class probability
Two classes => C=class X=instance
Sum of probabilities= 1
Loss Functions
Our goal in training is to find the best set of weights and biases that minimizes the loss function.
Sum of Squared loss for regression Cross entropy loss for classification
Cross Entropy
19
Regularization
= Regularization is any modification to a learning algorithm that is intended to reduce its generalization error but not its
training error
- A way to cope with the excessive degrees of freedom to limit your parameter space; put a penalty on having
extreme values on the weights to be optimized => narrow range, better fit on the data
• Similar to other data estimation problems, we may not have enough samples to learn good models for logistic regression
classification
• One way to overcome this is to ‘regularize’ the model, impose additional constraints on the parameters we are fitting
• By adding a prior w
L1 vs L2 Regularization
Squared L2 (Ridge):
- Encourages small weights
- adds “squared magnitude” of coefficient as penalty term
- if 𝛼 is zero, we get back the original Linear Regression
- if 𝛼 is very large, too much weight is added to the penalty and it will lead to
under-fitting
- The parameter \alpha is the importance of the regularization, and it’s a hyper-parameter
Multi-class Classification
Prediction: class with highest score Prediction: label samples based on areas
Logistic regression with multiple classes
Logistic regression can be used to classify data from more than 2 classes. Assume we have k classes then:
Disadvantages
• Linear decision boundary
20
Lecture 6- Neural Networks
Conventional ML
Feature Extraction
Nonlinear classifiers
Goal: To construct non-linear discriminative classifiers that utilize functions of input variables
Neural Network approach
• Use a large number of simpler (activation) functions
• Functions are fixed (Gaussian, sigmoid, polynomial basis functions)
• Optimization involves linear combinations of these fixed functions
Layers: Input layer(1), Hidden layers (multiple), Output layer(1)
The primate brain
Inspiration: The brain- ~1011 neurons
- Each neuron communicates to other 104 neurons
Artificial Neural
Networks
• Neural networks define functions of the inputs (hidden features), computed by
neurons
• Artificial neurons are called units
Representational power
Neural network with at least one hidden layer is a universal
approximator (can represent any function).
The capacity of the network increases with more hidden units
and more hidden layers
21
Neural Network Components
• Input layer: x, Independent variable
• An arbitrary amount of hidden layers
• Output layer: y , dependent variable
• A set of weights (coefficients) and biases at each layer (W and b)
• A choice of activation functions for each layer σ.
Activation functions
• Applied on the hidden units
• Achieve nonlinearity
• Popular activation functions:
Backpropagation
Loss can be measured
• Propagating the error back and updating (adjusting) our weights and
biases to minimize loss.
• How do we find the appropriate amount to adjust?
• Compute the derivative of the loss function with respect to weights and
biases
The derivative of the function is the slope of the function
Gradient descent: updating the weights and biases by increasing or
reducing it.
22
Deep Learning
Deep Neural Network (Deep Learning)
- A multilayer perceptron, or neural network is typically considered deep when it has multiple
(hidden) layers
o can learn a hierarchical feature representation:
o first layer- extract some simple features
o more abstract, high-level representations as you go deeper into the layers
o the extracted features are used in the end for classification
- Algorithm extracts relevant features to the target on its own, and makes the
classification by itself
Activation map
Convolution Layer- Connect each hidden unit to a small input patch and share the weight across
space. This is called a convolution layer and the network is a convolutional network
- Filters its input to reveal patterns
- Filter (convolutional kernel) sliding over the entire image and extracts these numbers (result is
higher => the filter pattern exists at that particular location)
Fully Connected NN
Algorithms that can be developed with CNNs:
23
Segmentation
Classification
Detection
24
Linear support vector machines
• Focus on boundary points instead of fitting all the points
• Goal: learn a boundary that leads to the largest margin (buffer) from points on
both sides
• Support vectors= subset of vectors that support (determine the boundary)
o After training an SVM, a support vector is any instance located on the margin
o The decision boundary is entirely determined by the support vectors.
o Any points that are not a support vector have no influence whatsoever; you could remove them, add more
points, or move them around, and as long as they stay off the street they won’t affect the decision
boundary.
o SVMs predictions computing only involves the support vectors, not the whole training set
Decision Function- by computing it => Linear SVM classifier predicts the class of a new instance x
- Takes a dataset as input, gives a decision as output
Hard Margin SVMs -> Define t (i) = –1 for negative instances (if y(i) = 0) and t (i) = 1 for positive instances (if y
(i)
= 1), then we can express this constraint as
Issues
25
Soft Margin SVM
- Data is not linearly separable =>
o Introduce slack variable
o Allow error e(j) in classification
o Based on the output the discriminant function wTx + b
o e(j) approximates the number of misclassified samples
- e measures how much the jth instance is allowed to violate the margin
(j)
• When an observation has a distance of 1.5 for the • Observations that fall on the correct side of the
boundary, (outcome +1) decision boundary (hyperplane) but are within the
margin incur a cost between 0 and 1
26
Kernel SVMs
One way to make a linear model more flexible is by adding more features.
EX: by adding interactions or polynomials of the input features. (Expand the set of input features, say by also
adding feature1 ** 2, the square of the second feature, as a new feature)
• When we transform back this line to original plane, it maps to ellipse boundary. These transformations are
called kernels.
• As a function of the original features, the linear SVM model is not actually linear anymore
Soft Margin SVMs with 2 Features Kernel functions transform data to higher dimensions
- The threshold is a line, blues lines= margins
27
Introducing Kernels
• Kernels work best for “small” n_samples
• Long runtime for “large” datasets (100k samples)
• Real power in infinite-dimensional spaces: rbf!
• Rbf is “universal kernel” - can learn (aka overfit)
anything
Common Kernels:
Polynomial Kernel:
SVM hyperparameters:
• Linear SVM hyperparameters- C (regularization)
• Polynomial SVM hyperparameters- C (regularization),
d (polynomial degree)
• RBF SVM hyperparameters- C (regularization), gamma
(width of kernel)
SVM Regression
Just like in the classification setting, in the regression setting we may also have
that not all observations fit our
requirement.
Hard-Margin SVM- Not all observations may fit our requirement-> Solved by
introducing variables ε:
One-Class Classification
• Learning from examples of just one class, e.g. positive examples?
• Desirable if there are many times of negative examples.
• "Outlier/Novelty Detection" problems can be formulated as one-class classification
Summary- SVMs
• A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane.
• Given labelled training data (supervised learning), the algorithm outputs an optimal hyperplane which
categorizes new examples.
• In two dimensional space this hyperplane is a line dividing a plane in two parts where in each class lay in
either side.
Kernel SVMs
• The important parameters
• the regularization parameter C,
• the choice of the kernel, and the kernel-specific parameters..
• The RBF kernel has only one parameter, gamma, which is the inverse of the width of the Gaussian kernel.
gamma and C both control the complexity of the model,
• C and gamma should be adjusted together.
29
Linear SVMs Kernel SVMs
Advantages Advantages
• Accuracy • Allow for complex decision boundaries, even if the data has
• Works well on smaller cleaner datasets only a few features.
• Can be more efficient because it uses a • Work well on low-dimensional and high-dimensional data (i.e.,
subset of training points few and many features)
Disadvantages Disadvantages
• Not suited to larger datasets as the training • Do no scale very well with the number of samples. Running an
time with SVMs can be high SVM on data with up to 10,000 samples might work well, but
• Less effective on noisier datasets with working with datasets of size 100,000 or more can become
overlapping classes challenging in terms of runtime and memory usage.
• Linear SVMs have a linear decision • Require careful preprocessing of the data and tuning of the
boundary parameters..
• Originally designed as a two-class • SVM models are hard to inspect; it can be difficult to understand
classifier why a particularprediction was made, and it might be tricky to
explain the model to a nonexpert.
• Still, it might be worth trying SVMs, particularly if all of your
features represent measurements in similar units (e.g., all are pixel
intensities) and they are on similar scales.
Probability distributions
1. 0 ≤ P(A) ≤ 1
2. P(true) = 1, P(false) = 0
3. Set Theory: P(A ∪ B) = P(A) + P(B) – P(A ∩ B)
Conditional Probability
• (A = 1 | B = 1): The fraction of cases where
A is true given B is true
• In some cases, given knowledge of one or more random variables, we can improve our
prior belief of another random variable
• p(slept in movie) = 0.5,
• p(slept in movie | liked movie) = 1/4
• p(didn’t sleep in movie | liked movie) = 3/4
Chain Rule
• The joint distribution can be specified in terms of conditional probability p(x,y) = p(x|y) p(x)
• Together with Bayes rule (which is actually derived from it) this is one of the most powerful rules in
probabilistic reasoning
p(slept in movie, liked movie) = 1/8;
p(slept in movie | liked movie) = 1/4
p(liked movie) = 1/2;
Bayes Rule
30
Classification Problem
X= observations (features)
Y= class (label)
Types of Classifiers
Instance based classifiers Generative: Discriminative
• Use observations directly • build a generative statistical • directly estimate a decision
without models model rule/boundary
• e.g. K nearest neighbors • e.g. Bayes classifiers • e.g. Decision trees
Probability that you will pass the exam if your teacher is Larry
P(x) = P(Larry) = 28/80
P(c) = P(yes) = 56/80
P(x|c) = P(Larry|yes) = 19/56
P(y|x) = 0.34 * 0.7 / 0.35 = 0.68 => 68%
31
Bernoulli Naïve Bayes Problem
Classify whether a person passes or fails an exam based on various features.
Instance X: Confident = Yes, Studied = Yes, Sick = No
Gaussian Naive Bayes classifier Multinomial Naive Bayes Bernoulli Naive Bayes (BernoulliNB())
(GaussianNB()) (MultinomialNB()) • Assumes your feature vectors are
• assume that features follow a • assume count data binary (i.e. zeros and ones) or continuous
normal distribution • each feature represents an integer values which can be precisely split
count of something, like how often a (binarized) with a predefined threshold
. word appears in a sentence
Naïve Bayes Classifiers
Advantages Disadvantages
• Simple, (doing a bunch of counts) • Parameter estimation
• Works well with a small amount of training data • (Determination of p(x|y) where cases of a y label in
• The class with the highest probability is considered not enough)
as the most likely class.
Decision Trees
Decision Trees take one feature at a time and test a binary condition
For instance: is the feature larger than 0.5?
If the answer is YES, grow a node to the left
If the answer is NO, grow a node to the right
• Internal nodes correspond to attributes
(features)
• Leafs correspond to classification outcome
(output)
• Branching is determined by the best attribute
value
• Decision Tree grows with each level of
questions
• Each node (box) of the decision tree tests a condition on a feature. “Questions”= thresholds on single features
• All decision boundaries are perpendicular to the feature axes, because at each node a decision is made about a
single feature
If an attribute is continuous
Algorithm • Suppose we have a training set with the attribute “weight” where
• Choose an attribute on which to descend at each Weight = 50, 54,65, 75, 82, 95
level • The algorithm will consider the following possible splits
• Condition on earlier (higher) choices Weight <= 54 AND Weight > 50
• Generally, restrict only one dimension at a time Weight <= 65 AND Weight > 54
• Declare an output value when you get to the Weight <= 75 AND Weight > 65
bottom Weight <= 82 AND Weight > 75
Weight <= 95 AND Weight > 82
To construct a useful decision tree
1. Start from an empty decision tree
2. Split on the next best attribute
3. Repeat
32
• What is the best attribute?-> Can be determined using Information Theory, Gini Coefficient, Conditional
Entropy
• Similar to the game of “20 questions”
• The order of features is important
• “Guess the food”: it is better to start with the question “Is it a dessert?”, rather than with “Is it friet met
satésaus:?”
• For each split: exhaustive search over all features and thresholds
Conditional Entropy
Ex: what is the entropy of cloudiness Y, given that it is raining? What is the entropy of cloudiness Y, if we know
whether it is raining or not raining?
Properties:
• H is always non-negative
• Chain rule: H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X ) + H(X )
• If X and Y independent, then X doesn’t tell us anything about Y: H(Y |X ) = H(Y )
• But Y tells us everything about Y : H(Y |Y ) = 0
• By knowing X , we can only decrease uncertainty about Y : H(Y|X ) ≤ H(Y )
Information Gain
• How much information about cloudiness do we get by discovering whether it is raining?
IG(Y |X ) = H(Y ) − H(Y |X ) ≈ 1 – 0.75 ≈ 0.25 bits
• Also called information gain in Y due to X
• If X is completely uninformative about Y : IG (Y |X ) = 0
• If X is completely informative about Y : IG (Y |X ) = H(Y )
• We use this to construct our decision tree
Model Complexity
• The complexity of the model induced by a decision tree is determined by the depth of the tree
• Increasing the depth of the tree increases the number of decision boundaries and may lead to overfitting
• Pre-pruning and post-pruning
33
• Limit tree size (pick one):
• max_depth
• max_leaf_nodes
• min_samples_split
• (and more)
Advantages Disadvantages
• Suitable for multi-class classification • Prone to overfitting without prunning
• Model is most easily interpretable (as it’s a series • Weak learners : single decision tree does not make
of if-else conditions) great predictions, multiple trees
• Can handle numerical and catergorial data can be combined to create stronger ensemble models
• Non-linear
• Can tolerate missing values
34
Lecture 9- Esemble Models
Bias and Variance
Estimating Generalization Error
Generalization error = bias + variance + noise.
Bias and variance typically trade off in relation to model complexity.
Ensemble methods try to reduce bias and/or variance of weak (single) models by combining several of them
together to achieve better performances
Simple (weak/base) models These base models perform not so well by themselves either
• Logistic Regression because
• Naïve Bayes • have a high bias (e.g. low degree of freedom models) or
• Knn • have too much variance to be robust (e.g. high degree of
• (Shallow) Decision Trees freedom models).
• Kernel SVMs
Voting
Build different models
• Classifiers that are most “sure” will vote with more conviction
• Classifiers will be most “sure” about a particular part of the space
• Average the result
More models are better – if they are not correlated.
Also works with neural networks
You can average any models as long as they provide calibrated (“good”) probabilities.
Scikit-learn: VotingClassifier
35
Bagging (Bootstrap Aggregation)
• Generic way to build “slightly different” models
• Draw bootstrap samples from dataset (as many as there are in the dataset, with repetition)
• Implemented in BaggingClassifier, BaggingRegressor
Bagging- fits several independent models and “average” their predictions to obtain a model with a lower
variance.
• Fitting fully independent models require too much data
• Relies on bootstrapping samples
• Creates multiple bootstrap samples
• Each new bootstap sample will act as
another independent dataset
• Fit weak learners for each sample,
• Aggregate them (average the output)
• Regression : simple average
• Classification problem:
• simple majority vote (hard voting)
• highest average probability (soft voting)
Random Forests
• Strong learners composed of multiple trees can be called “forests”.
• Trees can be shallow (few depths) or deep (lot of depths, if not fully grown).
o Shallow trees - less variance but higher bias, better choice for sequential methods
o Deep trees- low bias but high variance, better choices for bagging method that is mainly focused at reducing
variance.
• Random forest approach is a bagging method where deep trees, fitted on bootstrap samples, are combined to
produce an output with lower variance
o Randomize in 2 ways to reduce correlation:
§ For each tree- Pick bootstrap sample of data
• Samples over the observations in the dataset to generate a bootstrap sample
§ For each split- Pick random sample of features
• Samples over features and keep only a random subset of them to build the tree
o More trees are always better
36
Classification and regression with RDFs
Classification: the mode of the classes outputted by the trees.
Regression: the mean of the values outputted by the trees.
Boosting
• Fits sequentially multiple weak learners
adaptively
• Each model in the sequence is fitted giving
more importance to observations in the
dataset that were badly handled by the
previous models in the sequence.
Ada Boost
• Adaptive Boosting (AdaBoost)
• Add the weak learners one by one, looking at each iteration for the best possible pair (coefficient, weak
learner) to add to the current ensemble model.
• Updates observation weights in the dataset and train a new weak learner with a special focus given to the
observations misclassified by the current ensemble model.
• Adds the weak learner to the weighted sum according to an update coefficient that expresses the
performances of this weak model: the better a weak learner performs, the more it contributes to the strong
learner.
At the very beginning of the algorithm (first model of the
sequence), all the observations have the same weights 1/N.
Then, we repeat L times (for the L learners in the sequence)
the following steps:
• Fit the best possible weak model with the current
observations weights
• Compute the value of the update coefficient that is some
kind of scalar evaluation metric of the weak learner that
indicates how much this weak learner should be taken into
account into the ensemble model
• Update the strong learner by adding the new weak learner
multiplied by its update coefficient
• Compute new observations weights that expresse which observations we would like to focus on at the next
iteration (weights of observations wrongly predicted by the aggregated model increase and weights of the
correctly predicted observations decrease)
AdaBoost
• Increases the predictive accuracy by assigning weights to both observations at end of every tree and
weights(scores) to every classifier.
• Every classifier has a different weight on the final prediction.
• Boosting occurs sequentially
Recap - Random forest
• Parallel operations
• All trees are assigned equal weights.
37
Gradient Boosting- Combines weak learners to form a strong learner.
• Residual of the current classifier becomes the input for the next
consecutive classifier on which the trees are built (sequential
model)
• The residuals captured in a step-by-step manner by the classifiers,
in order to capture the maximum variance within the data.
• Done by introducing the learning rate to the classifiers.
• Many shallow trees
• learning_rate ↔ n_estimators
• Serial Slower to train than Random Forests, (Gserial), but much faster to predict
• Small model size
• Uses one-vs-rest for multi-class!
Gradient Boosting regression
1) Make an initial guess of the weights of the samples by computing the average (1st prediction)
(32.1+18.5+46.6+24+18/5) = 27.84
2) Compute the residuals (called pseudo residuals)
3) Combine predicted value with residuals (scaled by a learning rate)
• For a learning rate of 0.1, the predicted value for 1st row is 27.84 + 0.1 * 4.26 = 28.266
4) Compute new residuals
5) Compute the new predicted values
• New predicted values for the first row: 27.84 + (0.1*4.26) + (0.1*3.83) = 28.649
Leaf node Tree 1 Tree 2
Gradient Boosting- Start by assigning pseudo-residuals are set equal to the obervation values. Then, repeat L
times (for the L models of the sequence) the following steps:
• fit the best possible weak learner to pseudo-residuals (approximate the opposite of the gradient with respect to
the current strong learner)
• compute the value of the optimal step size that defines by how much we update the ensemble model in the
direction of the new weak learner
• update the ensemble model by adding the new weak learner multiplied by the step size (make a step of
gradient descent)
• compute new pseudo-residuals that indicate, for each result, in which direction we would like to update next
the ensemble model predictions
Tuning Gradient Boosting
• Pick n_estimators, tune learning rate
• Can also tune max_features
• Typically strong pruning via max_depth
Gradient Boosting
Adaboost • Additive Model
• Additive model • Misclassifications of previous models are identified
• Misclassifications of previous models are by the gradient
• Trees are usually grown as decision stumps • Tree are grown to a greater depth (more than 1)
• Each classifier has different weights assigned to the • Classifiers are wegithed equally.
final prediction based on its performance Predictive capacity restricted by a
• Both classifiers and observations are weighted to learning rate
capture the maximum variance. • Builds trees on residuals of previous classifier – to
capture variance of data
38
Stacking
• Bagging and Boosting considers mainly homogeneous
weak learners.
• Stacking
o learns several different (heterogeneous) weak learners
o combines with base models by training a meta model
to output predictions based on multiple predictions
returned by weak models
• Classification problem example
o Chose some weak learners: KNN classifers, logicst
regression, and a SVM
o Choose a neural network as a meta model
o Output of the 3 weak learners = input to neural network
o Output of neural network = Final prediction
Fitting on stacking esemble
Steps:
• Split the training data in two folds
• Choose L weak learners and fit them to data of the first fold
• For each of the L weak learners, make predictions for observations in the second fold
• Fit the meta-model on the second fold, using predictions made by the weak learners as inputs
Limitation: Only half of the data to train the base models and half of the data to train the meta-model.
Solution: “k-fold cross-training” approach (similar to what is done in k-fold cross-validation) such that all the
observations can be used to train the meta-model.
Poor man’s Stacking- build multiple models; train model on probabilities/ scores produced
Hold-out estimates of probabilities
• Split 1 produces probabilities for Fold 1, split2 for Fold 2 etc.
• Get a probability estimate for each data point!
• Unbiased estimates (like on the test set) for the whole training set!
Summary
Ensemble learning
• multiple models (often called weak learners or base models) are trained to solve the same problem and
combined to get better performances
• the main hypothesis is that if we combine the weak learners the right way we can obtain more accurate and/or
robust models
Voting
• Build different models.
• Classifiers that are most “sure” will vote with more conviction
• Average the result
Bagging methods (an ensemble model with a lower variance)
• several instance of the same base model are trained in parallel (independently from each others) on different
bootstrap samples and then aggregated in some kind of “averaging” process
• an ensemble model with a lower variance
Boosting methods (an ensemble model with a lower bias)
• several instance of the same base model are trained sequentially. At each iteration, the way to train the current
weak learner depends on the previous weak learners and more especially on how they are performing on the
data
Stacking method
• Different weak learners are fitted independently from each others and a meta-model is trained on top of that to
predict outputs based on the outputs returned by the base models
When to use Tree-based models • Gradient boosting often best performance with careful tuning
• Model non-linear relationships • Doesn’t care about scaling, no need for feature engineering!
• Single tree: very interpretable (if small)
• Random forests very robust, good benchmark
39
Lecture 10- Model Evaluation, Learning with imbalanced data
Generalization performance
• A model should always be evaluated on independent test data.
• A model’s performance on unseen data will give us the generalization performance of the model.
• Focus on supervised learning methods (evaluation of unsupervised methods is more qualitative)
Benefits Disadvantages
• Leaves less to luck: If we get a very good or bad training set • Increased computational cost
by chance, this will show in the results → performance will be • Simple cross-validation can result in
an outlier class imbalance between training and test
• Shows how sensitive the model is to the training data set. sets
High variance means high sensitivity to the training data.
40
Stratified Cross-Validation-> makes sure there is no class imbalance in the different folds
Leave-one-out cross validation (LOO)-> k-fold cross-validation, where k=N and N=the number of items in the
dataset
• Very time consuming
• Generates predictions given the maximal available data
• Can be useful to find out which items are regular and irregular from the point-of-view of the dataset.
Binary Classification
Goal setting!
What do I want? What do I care about?
• (precision, recall, something else)
Can I assign costs to the confusion matrix?
• (i.e. a false positive costs me $10, a false negative $100)
What guarantees do we want to give?
41
Multi-class classification metrics
Macro-average F1:
Average F1 scores over classes (“all classes are equally important”)
Weighted F1: Mean of the per-class f-scores, weighted by their support.
(“bigger classes are important”)
Micro-average F1: Make one binary confusion matrix over all classes, then
compute recall, precision once (“all samples are equally important”)
Imbalanced Data
SUMMARY
42
LECTURE 11- DIMENSIONALITY REDUCTION
Supervised Learning
• Given a set of data and labels, learn a model which will predict a label for new data.
• Given D = {Xi,Yi} learn a model (or function) F: Xk -> Yk
Unsupervised Learning
• Discover patterns in data
• Given D = {Xi} group the data into Y classes using a model (or function)
• F: Xi -> Yj
Ex:
• Discovering trending topics on Twitter or in the news
• Grouping data into clusters for easier analysis
• Outlier detection (e.g. Fraud detection and security systems)
43
• A simple classification model in a
high-dimensional space (e.g., a linear
decision boundary (plane) in 3D)
• Corresponds to a complex classification
model in low-dimensional space (e.g.,
non-linear decision boundaries in 2D)
• Overfitting is associated with (too)
complex models
ð too many features may lead to
overfitting too
Informative features: We want to increase the number of features to put all the relevant information in the
classifier
Curse of dimensionality: We want to decrease the number of features to
avoid the curse of dimensionality
Machine learning algorithms should optimize the trade-off between informative features and curse of
dimensionality by means of dimensionality reduction techniques
Dimensionality Reduction
Benefits of applying dimensionality reduction to a dataset:
• Reduced space required to store the data as the number of dimensions comes down
• Less dimensions lead to less computation/training time
• Some algorithms do not perform well when we have a large dimensions (e.g. KNN)
• Takes care of correlations by removing redundant features. (E.g. you have two highly correlated variables –
‘time spent on treadmill in minutes’ and ‘calories burnt’. The more time you spend running on a treadmill, the more
calories you will burn.
ð there is no point in storing both as just one of them does what you require
• Visualizing data
2 different ways:
Feature selection: Dimensionality reduction:
• Keeping the most relevant variables from the original • Finding a smaller set of new variables, each being a
dataset combination of the input variables, containing basically the
• E.g. Random Forests, Decision Trees, same information as the input variables
• Removing features with too many missing values • Unsupervised approach
• E.g Principal Component Analysis, Nonnegative rank
factorization, t-SNE (t Stochastic Nearest Neighbour)
Principal Component Analysis
44
An orthogonal linear combination of the original variables
• First principal component explains maximum variance in the dataset
• Second principal component-> explain the remaining variance in the dataset; uncorrelated to the first principal
component
• Third principal component-> explain the variance which is not explained by the first two principal components, etc
Computing PCA
Taking the whole dataset ignoring the class labels
• Compute the mean and covariance
• Center the data (subtract mean). (In practice: scale to unit variance)
Construction of the projection matrix (W) that will be used to transform the data
• Projection matrix = matrix of our concatenated top k eigenvectors
Higher Dimensions
• For datasets >2 features, PCA rotates the coordinate system in such a way that:
• the projection of the data on the first principal component (new axis) has the largest variance
• the projection of the data on the second principal component (new axis) has the one-but-largest variance, etc
• If the variation in the data is associated with relevance for classification (or regression), the most relevant
features are captured by the first principal components (and the rest captures noise)
• Retaining the first principal components and throwing away the rest
effectively reduces the dimensionality
How many features (principal components) to keep?
No fixed rule that defines how many features should be used in a
classification problem; depends on:
• the amount of training data available,
• the complexity of the decision boundaries, and
• the type of classifier used
Total Explained Variance: can be used to decide the number of
features.
45
Non-negative Matrix Factorization
46
Dimensionality Reduction for Data Visualization: Manifold learning
Manifold Learning- allow for much more complex mappings and often provide better
visualizations
Learn underlying “manifold” structure, use for dimensionality reduction
Pro: pretty pictures
Cons: Visualisation only, axes don’t correspond to anything in the input space, often can’t
transform new data
t-SNE
• Starts with a random embedding
• Iteratively updates points to make “close”
points close.
• Global distances are less important,
neighbourhood counts.
• Good for getting coarse view of topology.
• Can be good for finding interesting data
points.
Tuning Parameters
• n_components (default: 2): Dimension of the
embedded space.
• perplexity (default: 30): The perplexity is related to
the number of nearest neighbors that are used in other
manifold learning algorithms. Consider selecting a
value between 5 and 50.
• early_exaggeration (default: 12.0): Controls how
tight natural clusters in the original space are in the embedded space and how much space will be between them.
• learning_rate (default: 200.0): The learning rate for t-SNE is usually in the range (10.0, 1000.0).
• n_iter (default: 1000): Maximum number of iterations for the optimization. Should be at least 250.
47
LECTURE 12: CLUSTERING
GOALS
• Data exploration: Are there coherent groups? How many groups are there?
• Data partitioning= divide data by group before further processing.
• Unsupervised feature extraction= Derive features from clusters or cluster distances
E.g. Clustering techniques
• K-means, Hierarchical Clustering, Density Based Techniques, Gaussian Mixtures Models
ALGORITHM
1. Choose the number of clusters, K
2. Randomly choose initial positions of K centroids
3. Assign each of the points to the "nearest centroid" (depending on the distance measure)
4. Recompute centroid positions
5. If solution converges -> Stop, else go the step 3.
• New data points can be assigned cluster membership based on existing clusters.
Computational Properties:
• By default K-means in sklearn does 10 random restarts with
different initializations.
• For large datasets, K-means initialization may take much longer than
clustering.
• Consider using random, in particular for MiniBatchKMeans
• Uses mini-batches to reduce the computation time, while still
attempting to optimise the same objective function (partial_fit)
MiniBatchKMeans
Mini-batches= subsets of the input data (rather than the whole data), randomly sampled in each training iteration.
Algorithm
1. Draw samples randomly from the dataset, to form a mini-batch. Assign to nearest centroid
2. Update the centroids by using a convex combination of the average of the samples and the previous samples
assigned to that centroid.
3. Perform 1 and 2 until convergence or for a fixed number of iterations,
Hierarchical Clustering
= a series of partitions from a single cluster containing all the data points to N clusters
containing 1 data point each.
49
Hierarchical clustering is heavily influenced by:
- distance metric (Euclidian, Manhattan, Maximum)
- linkage criterion- determines how clusters are merged
• the distance between 2 clusters is a function of the pairwise
distance between each point
• clusters that minimize this function are then combined
• linkage criteria are broadly equivalent for single point clusters
• Complete linkage: maximum distance between points in each
cluster
• Single linkage: minimum distance between points in each
cluster
• Average Linkage: average distance between points in each cluster
• Centroid Linkage: distance between cluster centroids
• Ward linkage: cluster variance (select clusters that maximize
decrease in variance)
Finds core samples of high density and expands clusters from them
Sample is a “core sample” if > min_smaples is within epsilon “dense region”
50
• Allows complex cluster shapes
• Can detect outliers
• Needs two parameters to adjust, epislon is hard to pick
(can be done based on number of clusters though).
• Can learn arbitrary cluster shapes
• Limitations
• Varying densities
• High-dimensional data
Mixture Models
51
Silhouette Coefficient
• A function S that measures the separation between two clusters, c1 and c2.
• How can we measure the goodness of a clustering C = c1, ... cl, using the separation function S?
52