0% found this document useful (0 votes)
34 views

ML Lectures Summary 2

Uploaded by

Kristiyana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

ML Lectures Summary 2

Uploaded by

Kristiyana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Lecture 1- Intro and Preprocessing

ML = teaching a computer to do a task using data


- Methods that extract knowledge from data; closely related to stats and optimization
- Focused on prediction

height difference between 2 breeds


• If height >25, more likely
greyhound
• If height ,25, more likely labrador
- We can add more features
- Decision boundary- cut off
threshold

Structured data Unstructured data

Highly organized, made up mostly of tables Everything else, e.g.:


with rows and columns that define their • Email messages, text messages, ...
meaning. (e.g. Excel spreadsheets and • Text files, including Word documents .. Audio
relational databases). files of music, voicemails ...
• Video files that include movies, personal
videos, YouTube uploads...
• Images of pictures, illustrations, meme
Why ML?
- Volume of data collected grows daily
- Data production is 44 times greater in 2020 than in 2009.
- Every day, 2.5 quintillion bytes of data are created.
- 90% of the data in this world within the past two years!!!
- Data= cheap and abundant
- Knowledge= expensive and scare
- To make sense of all the unstructured data: we need knowledge discovery
- Machine Learning: computers learn from data to aid knowledge discovery

Given a set of data and labels, Discover patterns in data Reasoning under uncertainty to make
learn a model which will predict a D= {Xi}, group the data into Y optimal decisions (what actions to be taken
label for new data classes using a model or function to maximize reward)
D= {Xi, Yi}, learn F: Xk -> Yk F: Xi -> Yj D = {environment(e), actions(a),
• Used to automate manual • Discovering trending topics rewards(r)}, learn policy and utility
labour • Grouping data into clusters functions
• Outlier detection • Policy: F1 : {e,r} -> a
• Utility: F2: {a,e} -> r

Other kinds of learning: Semi-supervised Learning; Active Learning

1
Machine Learning Workflow Terminology
Instance: Pikachu
Label/Class: Mouse
Features/Attributes: Abilities, Weight,
Legendary
Feature Values: lighting rod, 2, yes
Feature Vector: (lighting rod, 2, yes)

A model: an equation that links the values of some features to the predicted value of the
target variable;
• finding the equation (and coefficients in it) is called ‘building a model’ (see also
‘fitting a model’).
Score functions/Fit statistics/Score metrics – measures of how well the model fits the data.
Feature selection – reducing the number of predictors by selecting the important ones
(dimensionality reduction).
Feature extraction – reducing the number of predictors by means of a
mathematical operation (e.g., PCA).

Two main types of Supervised Learning


Classification Regression
• Discrete Output • Continuous output
• Ex: color, gender, yes/no, class • Ex: temperature, age, distance, salary
membership • How many points will you get in the
• Will you pass this course? exam?

Dummy Classifier & Dummy Regressor


• Do not generate any insight about the data
• Serves as a simple baseline to compare against other more complex classifiers/
regressors; if performance is negative => worse than random/dummy
Dummy Classifier Dummy Regressor
Classifies the given data using only simple Makes predictions using simple strategies:
strategies: - Mean, Median
- Most-frequent, Uniform, Constant

Types of Data

Images Text
- Are 2D arrays of numbers (RGB values - Words/Letters need to be converted in a
for each Pixel) format understandable to computers

2
Preprocessing

Scaling data= multiplying all instances of a variable by a


constant to change the variables’ range
- With few exceptions, Machine Learning algorithms
don’t perform well when the input numerical attributes
have a very different scale

Standard scaler = z-scores (standard scores): mean of 0 and standard


deviation of 1
- common method in data normalization (good for non-skewed data)
- data is scaled with standard scaler if mean of scaled data is around 0

Robust scaler = Same as Standard Scaler, but with median instead of mean and
interquartile range instead of standard deviation.
- Better for skewed data
- Deals better with outliers
- data is scaled with standard scaler if median of scaled data is around 0
MinMax Scaler= shifts data to an interval set by xmin and xmax
- when data has a bounded range/ distribution is not Gaussian (ex: image processing: 0-255)
Normalizer= Does not work by feature (column) but by row (sample)
- Each row of data is rescaled so that its norm becomes 1
- Compute the norm of the vector (square root of the squared elements)
- Divide each element by the norm
- Used only when the direction of data matters
- Helpful for histograms

3
Univariate Transformations

Accounts for one variable’s effect on a DV


Ex: logarithmic, geometric, power ...
- Most ML models perform best with Gaussian
distributed data (bell curve)
- Methods to transform data to Gaussian include Box-Cox
transform and Yeo-Johnson transform
o Parameters can be automatically estimated to
minimize skewness and stabilize the variance.

Transforming to log scale:

Binning= Separate feature values into n categories (e.g.,


equally spaced over the range of values)
- Replace all values within a category by a single value, e.g.,
mean.
- Effective for models with few parameters, such as regression
- Not effective for models with many parameters, such as
decision trees.

Guiding Principles in ML

Measuring Classification Success


- How “predictive” are the models we have learnt?
- New data is probably not exactly the same as the
training data
• What happens if we overfit our data?

To avoid over-fitting:
Build a classifier using the training set and evaluate it using the test
set.

4
Cross Validation- To evaluate (test) your model’s ability
to predict new data
- Detect overfitting or selection bias
- Techniques:
o K-fold cross validation
o Leave one out (K-fold cross validation to the extreme)

Machine Learning Pipelines= Workflows to execute a


sequence of tasks
- Data normalization (scaling)
- Imputation of missing values
- Dimensionality Reduction
- Classification

Lecture 2- Preprocessing & Feature Engineering

Missing values imputation (Preprocessing)

= no standard encoding (0, blank, Na, Nan,…)


Imputation: replacing a missing value with an estimate for
that value
• Mean/median
• KNN (1- replace with nearest 1 neighbour)
• Model-driven
• Iterative
Feature-column
Instance- rows
Feature Selection

Why select features? Strategies


• Avoid overfitting • Univariate statistics (f_regression, f_classification)
• Faster prediction and training • Model-based selection (Lasso)
• Less storage for model and dataset • Iterative selection (random_forests)
Univariate Statistics
• Look at each feature individually
• Features will be removed if they do not have a significant relationship with the target
• Features that are significant only in combination with another feature (interaction) will be
removed.
• Selecting features with the highest confidence is related to ANOVA (from statistics)
Pick statistic, check p-values!
• f_regression, f_classif, chi2 in scikit-learn
Mutual information (as implemented
here) is also univariate, but doesn’t
assume a linear model (like the F
statistics do) Can be used with
SelectKBest etc.
measures the reduction in uncertainty for
one variable given a known value of the
other variable

5
Model-Based selection
• Get best fit for a particular model • Build a model, select features most
• Ideally: exhaustive search over all possible important to model.
combinations (GridSearch) • Lasso, other linear models, tree-based
• Exhaustive is infeasible (and has multiple models
testing issues) • Multivariate – linear models assume
• Use heuristics in practice linear relation

Iterative Model-Based Selection


• Forwards: Start with single feature, find
most important feature, add, iterate
• Backwards: Fit model, find least
important feature, remove, iterate
• Computationally expensive

RFE: Recursive feature elimination and


selection
To identify a dataset’s key features (at each
layer, find the least significant and remove it
until the desired nr of features is reached)

Categorical variables
Data often has categorical (or discrete) features.
Remember measurement levels:
• Categorical- entities divided into distinct categories; no particular relationship
between them; no order, no interval
• Ordinal- logical order
• Interval-equal intervals on the variable represent equal differences in the measured
property; no true 0
• Ratio- same as interval, but the ratios of scores on the scale must also make sense and
have true 0 value
Often necessary to represent categorical features as numbers.

One Hot Encoding- for features, not labels (y_train)


One Hot encoding
• It is important that the math used by machine learning
models is not affected by the encoding →impossible to Count-based encoding
use 1,2,3, ... • For high cardinality categorical features (Ex: countries)
• Adding one feature for each category (feature encodes • Instead of 50 one-hot variables, replace label with the
whether a sample belongs to this category or not) value of a variable combined over that label.
Creates a binary column for each categ • For regression:“people in this state have an average
response of y”
• Binary classification:“people in this state have likelihood
p for class 1”
→all colours are equally distant
• Multiclass:One feature per class: probability distribution

6
Working with images
Digital images Arrays and Images
• The values are all discrete and integers. • Images are represented as matrices (e.g. numpy
• Can be considered as a large array of discrete arrays)
dots, • Can be written as a function f(x,y) -> pixel intensity
• Each dot has a brightness associated with it. • Types of images: Binary Images, Grayscale Images
• These dots are called picture elements - pixels. and Color Images

Binary images (1 bit image) Grayscale images


• Each pixel is either black or white. • Each pixel is a shade of gray
• Only two possible values for each pixel (0,1) • Normally from 0 (black) to 255 (white). Each pixel
• Only need one bit per pixel can be represented by 8 bits, or exactly one byte.
• Other grayscale ranges are used, but generally are a
power of 2. (22 = 4, 24 = 64)

Multi-channel Images
• Such images is a stack of multiple matrices;
representing the multiple channel values for each
pixel
• E.g RGB color is described by the amount of
red, green and blue in it.

Machine Learning with Images

7
Measures for segmentation
• A segmentation result can be measured if the ground truth is known
• Empirical Measures:
• Accuracy, Precision and Recall
• F-score, Jaccard Index

Precision- what proportion of positive


identifications was actually correct (no FP => precision of 1)
Recall-> what proportion of actual positives was identified
correctly? (No FN => recall = 1)

Transforming Text Data

Text data- can consist of words, sentences, or entire documents, often varying in length
High variability: punctuation, different forms for the same word, typos, Capitalized letters,
etc ...
(a) Representing Text Data
- A sentence can be broken down into individual words.
- Each word is represented as a categorical variable (e.g., using one-hot encoding)
- One-hot encoded vectors are possibly:
o very large (10.000s words in a general vocabulary)
o very sparse (vectors are all “0”s with a single “1”)
- Problems
o Concatenating all word vectors results in massive vectors.
o Sentences have unequal length, which is unsuitable for most ML methods.

Bag-of-Words representation
- Most common technique to numerically represent text is Bag of
Words.
- Represents each sentence or document as a vector with a value
for each word in the vocabulary.
o Binary: word present or absent in the document
o Count: how often the word appears in the document
o Popular approach: Term Frequency x Inverse
Document Frequency (TF-IDF)
- Use a single vector of length equal to the size of the vocabulary.
- Each component is the number of times that word appears in the sample (e.g., in a
sentence or document).
- Useful for sentiment analysis (e.g.recognizing words such as “great”, “terrible” ...)

8
Term Frequency (TF) = (Number of times term t appears in a document)/(Number of terms
in the document)
Inverse Document Frequency (IDF) = log(N/n), where,
- N= number of documents and
- n= number of documents a term t has appeared in
- IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low.
Thus having the effect of highlighting words that are distinct.
- We calculate TF-IDF value of a term = TF * IDF

Advantages:
• Highly interpretable. Each word is an independent feature.
• Simple method.
• Fairly effective approach for some applications

Limitations:
• All structure is lost!
- Crucial information may be lost. e.g. “I passed the IML exam, but I failed the
Computational Linguistics exam”
• Misspellings:
- e.g. “machine” and “machnie” will be counted as a different word
• Some expressions consist of different (multiple) words
- e.g. product review with the word “worth” -> What if the review said “not worth” vs
“definitely worth”
Tacking expressions with multiple words with n-grams: instead of using individual words as
tokens, use groups of n consecutive words.

9
(b) Text Data Preprocessing
• Tokenization — convert sentences to words
- Process of breaking a stream of textual content up into words, terms,
symbols, or some other meaningful elements called tokens.
- The list of tokens turns into input for additional processing including parsing or text
mining.
- Tokenization can swap out sensitive data
• E.g. Typically payment card or bank account numbers—with a randomized number
in the same format
• Restricting the Vocabulary
- Removing unnecessary punctuation, tags
- Removing stop words— frequent words such as ”the”, ”is”, etc. that have low
semantic content
- Removing infrequent words
o Words that appear only once or twice might not be
helpful
o Restrict vocabulary size to only most frequent words
(for less features)
• Stemming —words are reduced to a root by removing inflection
by dropping unnecessary characters, usually a suffix.
- Ex: studies -> studi; studying- > study
• Lemmatization —Another approach to remove inflection by determining the part of speech
and utilizing detailed database of the language.
- Ex: studies -> study; studying -> study

Lecture 3- K-nearest neighbours


Unsupervised learning- to find patterns in the data (and hope they make sense)
Supervised learning- for prediction
• Classification (discrete output, discrete label)
• Regression (continuous output)
Classifiers- try to draw a boundary (straight/wiggled line)
• Decision boundary learned from data
• The class of a point on the decision boundary is ambiguous
• trained on the dataset (labeled data points) and automatically “draw” a decision boundary
between the two classes
• The decision boundary can be a straight line (“stiff”) or a wiggly line (“flexible”)
o considered to be a model of the separation between the two classes
o The model is induced from the data
o The complexity of the model is proportional to the wiggliness of the decision boundary

Classification- training a model to Regression- fitting the data to describe the


separate the data in 2 or multiple relationship between 2 features or
classes between a feature and the label
• Ex: Linear relationship

10
Nearest-neighbour Classifiers (1-NN)
= Given a set of labeled instances (training set), new instances (test set)
are classified according to their nearest labeled neighbour
• For classifiers, we take numerical data
• Convert unstructured data to numerical

K-NN classifier
3-NN:
• K= hyper-parameter, represents the number of labeled neighbors to
consider
• Test points are assigned the majority label of k nearest neighbours
• Special cases:
o k = N: since all datapoints are considered, the predicted label for a
test point will always be the the majority label of all datapoints.
Equivalent to a majority classifier.
o Ties: in case of a tie between predicted labels, there are different
possibilities. The most common one is random selection from the tied labels.
• more neighbours=> Less complex decision boundary

Chair distance rules


• You can move to an adjacent chair in the same row;
• From the chairs at the ends of each row, you can move to the
corresponding chair in the row above or below;
• For any pair of chairs (i,j), the chair distance between chair i and chair j
is equal to the number of moves you need to make to move from chair i to
chair j.

K-Nearest Neighbours Classification

Weights in k-NN= extension of the basic algorithm: not all neighbours get an equal vote
Ex: rating of best restaurant in town, opinion of people living closer to the city count more
Distance-weighting= each neighbour has a weight which is based on its distance to the data
point to be classified (closer a point is to the center of the cell being estimated, the more
influence, or weight, it has in the averaging process)
• Inverse distance weighting – each point has a weight equal to the inverse of its distance to
the point to be classified (neighboring points have a higher vote)
• Inverse of the square of the distance
• Kernel functions (Gaussian kernel, tricube kernel)
• If we change the distance function, the results will change.

11
• Implication: with distance weighting, k=n is no longer equivalent to a majority based
classifier.
Computing distance in k-NN
Euclidian distance (straight line) Manhattan distance (distance
between projections on the axis)

K determines model complexity


• The model in k-NN is the decision boundary that
separates the classes (In regression, the model is the line
that fits the data)
• Smaller k leads to more complex decision boundaries
• k too low -> danger of overfitting, high complexity
• k too high -> danger of underfitting, low complexity
Start with the simplest model (large k in k-NN), and increase complexity (smaller k)
How to choose k:
• Typically odd for an even number of classes (e.g., 1,
3, 5, 7, ...)
• As you decrease k, accuracy might increase, but so
does computational complexity
• In other words, a small value of k is likely to lead to
overfitting (fitting “noise”)
• A rule of thumb: 𝑘 = rad 𝑛

Overfitting the test set:

Nearest Centroid classifier

= take the nearest centroid and compare how close you are to it
• find the mean of the clusters and compare how close the test data is
from them. The class whose centroid it is closest to, in squared
distance, is the predicted class for the new sample

Nearest shrunken centroid classification


• "shrinks" each of the class centroids toward the overall centroid for all classes by an amount
we call the threshold . This shrinkage consists of moving the centroid towards zero by
threshold, setting it equal to zero if it hits zero.
• Ex: if threshold was 2.0, a centroid of 3.2 would be shrunk to 1.2, a centroid of -3.4
would be shrunk to -1.4, and a centroid of 1.2 would be shrunk to zero.
• After shrinking the centroids, the new sample is classified by the usual nearest centroid
rule, but using the shrunken class centroids.
12
1) it can make the classifier more accurate by reducing the effect of noisy genes,
2) it does automatic gene selection

KNN Regression
= use k-nn to fit your data
• takes the distances and fits the best line on it
• can weight your regressor based on distance; tries to fit your outliers as well

• k-NN classification combines the discrete predictions of k-neighbours,


• k-NN regression combines continuous predictions; fits the best line between the neighbors

KNN for Missing value imputation

k-NN Advantages k-NN Disadvantages

•The cost of the learning process is zero •The model can not be interpreted (there is
• No assumptions about the characteristics no description of the learned concepts)
of the concepts to learn have to be done •It is computationally expensive to find the
• Complex concepts can be learned by k nearest neighbors when the dataset is
local approximation using simple very large
procedures •Performance depends on the number of
dimensions that we have (curse of
dimensionality)

13
Curse of Dimensionality and Overfitting
Classification task: CATS versus DOGS- 10 instances (images of cats and dogs)
Feature 1: average amount of red color in image
• More information is needed for classification, therefore we add a
second feature
Feature 2: average amount of green color in image
• Even more information is needed for classification, therefore we
add a third feature
Feature 3: average amount of blue color in image

• In three dimensions (= three features), perfect separation of CATS


and DOGS is possible with a decision boundary (plane)

This example suggests that by adding (informative) features, classification is improved.


This is often the case, adding new features increases the volume of feature space
exponentially
For instance:
• 1 feature: 10 possible feature values
• 2 features: 100 possible feature values
• 3 features: 1000 possible feature values

Lecture 4- Linear Regression

Regression
In machine learning, supervised learning to predict continuous outputs (y) is called
regression.
Needed to predict outputs:
• Continuous or categorical input features (x);
• Training examples: many x for which y is known (e.g. many people of whom we know the
height, predicting the housing prices);
• A model, a function that represents the relationship between x and y;
• A cost function, which tells us how well our model approximates the training examples;
• Optimization, a way of finding parameters for the model while minimizing the loss
function.

Linear regression (ordinary least squares)


• Given an input feature x we would like to predict an output y
• In linear regression we assume that x and y are related with the following equation:
, where m is a parameter and e represents measurement or other noise
• Goal: estimate m from training data for x and y
• Most common approach: minimize the least squares error (LSE):
• Least squares minimizes the squared distance between
measurements and regression line and is easy to compute

Solving linear regression with least squares minimization:


Approach: get the min of the curve to minimize cost funct.
• Take the derivative of the sum of squared deviations with
respect to m • set the
first derivative to 0 to find the coefficient m

14
Bias/Intercept
If the line does not pass through the origin:
• Introduce a bias (intercept) term (b)
• Parameter b is the sum of the differences between the values for y
and their estimates (m*x). increases complexity,
more df

Finding m and b

Fitting a line instead of a


polynomial to a quadratic
function

MSE of the fit

Using Scikit-Learn:

Multi-dimensional Inputs
• The model can be extended for additional input dimensions

• Output y is a linear function of the input features x; Can be a line, plane or


hyperplane
Calibration: Linear vs Polynomial fit
• Curve fitting: finding a mathematical
function (constructing a curve) that has the
best fit on a series of data points
• “smoothing” – not looking for an exact fit
but for a curve that fits the data
approximately
• Linear: y = ax + b (first degree
polynomial) – fits a curve to 2 points
• y = ax2 + bx + c (second degree
polynomial) – fits a curve to 3 points

15
• y = ax3 + bx2 + cx + d (third degree polynomial) – fits a curve to 4 points
Best fit vs perfect fit
• Perfect fit – Goes through all points
in the data.
• The best fit may not be the perfect fit.
• The best fit should give the best
predictive value.
Overfitting -> fitting a high degree polynomial

Measuring Fit
• Fit – accuracy of a predictive model; the extent to which predicted values of a target
variable are close to the observed values of that variable
• For regression models, the fit can be expressed as R2 (the percentage of variance
explained by the model)
• Residual variance: SSR = where ŷ is the predicted value for yi
• Total variance: SST=
• 𝑅2 = 1 – (SSR / SST)

Overfitting vs Underfitting
• Machine learning is so effective in finding the best fit that it is likely to construct a complex
model that would never generalize to unseen data.
• However, a complex model that reduces prediction error and yields a better fit also models
noise.
• The relation between the complexity of the induced model and underfitting and overfitting
is a crucial notion in data mining
Underfitting: Overfitting:
• The induced model is not complex • The induced model is too complex to
(flexible) enough to model the data model the data (tries to fit noise)
• Performs badly both on training and • Performs better on training set than on
validation set validation set
Validation Set
Tuning hyper-parameters:
• Never use test data for tuning the hyper-parameters
• We can divide the set of training examples into two
disjoint sets: training and validation
• Use the first set (i.e., training) to estimate the
coefficients m for different values of hyperparameter(s) (degree of the polynomial)
• Use the second set (i.e., validation) to estimate the best degree of the polynomial, by
evaluating how well the classifier does on this second set
• Then, test how well it generalizes to unseen data

To overcome overfitting in regression: Reduce the model complexity; Regularization


Regularization= reduction of the magnitude (strength of
association) of the coefficients
• We do not want overfitting => we limit the variation of the
parameters to prevent extreme fits to the training set.
• In a way, we limit the contribution of not effective parameters to make our function simpler.

16
Methods
Ridge Regression Lasso Regression Elastic Net Regression

• Reduces model complexity • LASSO (Least Absolute


by coefficient shrinkage. Shrinkage Selector Operator)
• Penalty term controlled by • Magnitude of coefficients
alpha reduced at even small values of
• By alpha, we are basically alpha.
controlling the penalty term. • Lasso reduces some of the
The higher the value of alpha, coefficients to zero. This
the bigger the penalty. property is known as feature
Therefore, the magnitude of selection, which is absent in the
coefficients are reduced. case of ridge regression.

Ridge Regression examples:

Lasso Regression Example:

Gradient Descent

- m is a parameter, e.g. your weights, biases and activations. Notice that we only
update a single parameter, i.e. we could update a single weight.
- η is the learning rate (eta) => but also sometimes alpha α or gamma γ is used.
- J is formally known as objective function, but most often it's called cost
function or loss function.
=> We take each parameter m and update it by taking the original parameter m
and subtract the learning rate η times the ratio of change

Minimize the loss function J(m), by moving the parameters m in the opposite direction of the gradient of J(m).
Learning rate-
describe the
speed of the
descent

17
Lecture 5- Logistic Regression
Linear Regression (recap L4)
can be expressed as

If the number of features increases to n =>

ML terms:
Parameters : variables learned (found) during training, e.g weights (w)
Hyperparameters: variables whose value is set before the training process begins (e.g regularization parameters (alpha),
number of neighbors (k))
Loss function (or error): what you are trying to minimize for a single training example to achieve your objective (e.g square
loss ( )
Cost function : average of your loss functions over the entire training set (e.g. mean square error
Objective function: Any function that you optimize during training (e.g. maximum likelihood, divergence between classes)
• A loss function is a part of a cost function which is a type of an objective function.

Gradient Decent (recap L4)


• To find optimal values for w
• iterative optimization algorithm that operates over a loss landscape (cost function)
• Follow the slope of the gradient W to reach the minimum cost
For Linear Regression
Cost function: average of your loss functions over the entire training set (e.g. mean square error)

Gradient of cost function: The direction of your steps to achieve your objective

Learning rate: the size of steps took in any direction

Logistic Regression
• is a classifier, not a regressor
Regression for Classification
• In some cases, we can use linear regression to determine an appropriate boundary
• Linear regression : output is a linear function of features
• Linear classification
o decision boundary is a linear function of the input
o Logistic Regression (classifier)
o Linear Support Vector machines

Logistic Regression (Classifier)


• Linear classifier which uses calculated logits (score) to predict the target class.
• Replace the sign (.) in a linear function with a sigmoid or logistic function

Linear Regression Linear Classification

Sigmoid Function= Assumes a particular functional form (a sigmoid) is applied to the linear
function of the data
• Output is a smooth and differentiable function of the inputs and the weights

Logistic Regression
Assumes a particular functional form (a sigmoid) is applied to the linear function of the data
• One parameter per data dimension (feature) and the bias
• Features can be discrete or continuous
• Output of the model between 0 and 1

ML Terms:
Decision Boundary (for classification): Single line/contour which separate data points into regions
• What is the output label at the boundary?- undetermined?
18
Probabilistic Interpretation- can be used to model class probability
Two classes => C=class X=instance

Sum of probabilities= 1

Loss Functions
Our goal in training is to find the best set of weights and biases that minimizes the loss function.
Sum of Squared loss for regression Cross entropy loss for classification

Entropy= amount of information, level of uncertainty


For p(x), entropy H(x)

Loss Function for Logistic Regression


Why don’t we use sum of squares error as our cost function in logistic regression?
• can still use it but it is not convex anymore since we have the sigmoid function.

Rather we use a logarithmic function as follows

For a single sample:

For many samples=> vector notation

Cross Entropy

• Also known as logarithmic loss, log loss or logistic loss


• Predicted class probability compared to actual class for output of 0 or 1
• Score calculated penalizes probability based on how far it is from actual value
• Penalty is logarithmic in nature.

Algorithm for Logistic Regression

Gradient Descent for Logistic Regression

19
Regularization
= Regularization is any modification to a learning algorithm that is intended to reduce its generalization error but not its
training error
- A way to cope with the excessive degrees of freedom to limit your parameter space; put a penalty on having
extreme values on the weights to be optimized => narrow range, better fit on the data
• Similar to other data estimation problems, we may not have enough samples to learn good models for logistic regression
classification
• One way to overcome this is to ‘regularize’ the model, impose additional constraints on the parameters we are fitting
• By adding a prior w

L1 vs L2 Regularization

L1 (Lasso- Least Absolute Shrinkage and Selection Operator)


- Encourages sparsity
- adds “absolute value of magnitude” of coefficient as penalty term
- if 𝛼 is zero, we get back the original Linear Regression
- if 𝛼 is very large, almost all the coefficients is zero, so it will lead to under-
fitting

Squared L2 (Ridge):
- Encourages small weights
- adds “squared magnitude” of coefficient as penalty term
- if 𝛼 is zero, we get back the original Linear Regression
- if 𝛼 is very large, too much weight is added to the penalty and it will lead to
under-fitting

Ex: normal distribution; zero mean and identity covariance

- log likelihood becomes


- this prior pushes parameters/coefficients (m) towards 0
- when we include this prior, the new gradient becomes

- The parameter \alpha is the importance of the regularization, and it’s a hyper-parameter

Multi-class Classification

Prediction: class with highest score Prediction: label samples based on areas
Logistic regression with multiple classes
Logistic regression can be used to classify data from more than 2 classes. Assume we have k classes then:

In scikit-learn: LogisticRegression (multinomial=True)

Logistic Regression Classifier


Advantages:
• Easily extended to multiple classes • Good accuracy for many simple datasets
• Probability distributions available • Resistant to overfitting
• Quick to train • Can interpret model coefficients

Disadvantages
• Linear decision boundary

20
Lecture 6- Neural Networks
Conventional ML
Feature Extraction

• Extracting features without knowing their relevance


for the target variable?
• More Features -> more information; but: Curse of
dimensionality
o Increased complexity, but not performance

Limitations of Linear Classifiers


• Linear classifiers (e.g., logistic regression) classify inputs based on linear combinations of features
• Many decisions involve non-linear functions of the input
Ex: classes cannot be separated by a single or a straight line or a plane (positive and negative cases)

Non-linear models with complex features

Highly nonlinear functions:


- Speech recognition
- Computer vision

Nonlinear classifiers
Goal: To construct non-linear discriminative classifiers that utilize functions of input variables
Neural Network approach
• Use a large number of simpler (activation) functions
• Functions are fixed (Gaussian, sigmoid, polynomial basis functions)
• Optimization involves linear combinations of these fixed functions
Layers: Input layer(1), Hidden layers (multiple), Output layer(1)
The primate brain
Inspiration: The brain- ~1011 neurons
- Each neuron communicates to other 104 neurons

Artificial Neural
Networks
• Neural networks define functions of the inputs (hidden features), computed by
neurons
• Artificial neurons are called units

Neural Network Architecture (Multi-Layer Perceptron)


Each unit computes its value based on linear combination of values of units that
point into it, and an activation function
Two different
visualizations of a 2-layer neural network. In this example: 3
input units, 4 hidden units, and 2 output units

Naming conventions; a N-layer neural network:


• N - 1 layers of hidden units
• One output layer

Representational power
Neural network with at least one hidden layer is a universal
approximator (can represent any function).
The capacity of the network increases with more hidden units
and more hidden layers

21
Neural Network Components
• Input layer: x, Independent variable
• An arbitrary amount of hidden layers
• Output layer: y , dependent variable
• A set of weights (coefficients) and biases at each layer (W and b)
• A choice of activation functions for each layer σ.

Training Neural Networks


The output of a simple 2-layer Neural Network is y

• Only the weights and biases (W and b) affect the output


• Each iteration consists of
• Feedforward : Calculating the predicted output (y)
• Backprogagation: Updating the weights and biases

Activation functions
• Applied on the hidden units
• Achieve nonlinearity
• Popular activation functions:

Forward pass: performs inference


Backward pass: performs learning
• Change weights and biases to reduced error
• A routine to compute gradient
• Use chain rule of derivative of the loss function

Backpropagation
Loss can be measured

• Propagating the error back and updating (adjusting) our weights and
biases to minimize loss.
• How do we find the appropriate amount to adjust?
• Compute the derivative of the loss function with respect to weights and
biases
The derivative of the function is the slope of the function
Gradient descent: updating the weights and biases by increasing or
reducing it.

Back-propagation: an efficient method for computing gradients


needed to perform gradient-based optimization of the weights in
a multi-layer network
Given any error function E, we just need to derive the gradients
of the activation functions

22
Deep Learning
Deep Neural Network (Deep Learning)
- A multilayer perceptron, or neural network is typically considered deep when it has multiple
(hidden) layers
o can learn a hierarchical feature representation:
o first layer- extract some simple features
o more abstract, high-level representations as you go deeper into the layers
o the extracted features are used in the end for classification

CNN for Computer Vision- image = input

MLP vs Convolutional Neural Network

- Algorithm extracts relevant features to the target on its own, and makes the
classification by itself

Activation map
Convolution Layer- Connect each hidden unit to a small input patch and share the weight across
space. This is called a convolution layer and the network is a convolutional network
- Filters its input to reveal patterns
- Filter (convolutional kernel) sliding over the entire image and extracts these numbers (result is
higher => the filter pattern exists at that particular location)

- At each layer this is


repeated (using the activation map
from previous layer) =>
abstraction increases

Each kernel has its bias term (used


to calculate output)
At the end, output is used as input
for next layer

Pooling Layer- By “pooling” (e.g.,


taking max) filter responses at different locations we gain robustness to the exact spatial
location of features.
- Max Pooling layer- finds the maximum locally which indicates generally the maximum
response of the filtering from the previous layer
- Scales down the representation=> region that you are looking at gets bigger (the same
kernel is looking at a bigger area)

Another advantage: decrease the number of


computations needed

Fully Connected NN
Algorithms that can be developed with CNNs:
23
Segmentation

Classification

Detection

Lecture 7- Support Vector Machines


Parameters and Hyperparameters
For knn- no parameters, knn doesn’t do any training
Validation set
- to tune the hyperparameters; never tune on test set
Classification problem- irregular boundaries
• Irregular distribution
• Imbalanced training sizes
• Outliers

SVM with 1 Feature Max margin classification- Hard Margin SVM


- focus on observations at the margins of the cluster, take
the midpoint as threshold (Maximal Margin)
• Better way to get a threshold

24
Linear support vector machines
• Focus on boundary points instead of fitting all the points
• Goal: learn a boundary that leads to the largest margin (buffer) from points on
both sides
• Support vectors= subset of vectors that support (determine the boundary)
o After training an SVM, a support vector is any instance located on the margin
o The decision boundary is entirely determined by the support vectors.
o Any points that are not a support vector have no influence whatsoever; you could remove them, add more
points, or move them around, and as long as they stay off the street they won’t affect the decision
boundary.
o SVMs predictions computing only involves the support vectors, not the whole training set
Decision Function- by computing it => Linear SVM classifier predicts the class of a new instance x
- Takes a dataset as input, gives a decision as output

- Result > 0 => predicted class (y) is 1 (positive)


- Else => -1 (negative)
- Inputs between the margins are of unknown class

Linear SVM Classifier- Decision Function


• For a dataset of 2 features, decision function is a 2D plane.
• Decision boundary – set of points where decision function = 0
• Dashed lines represent points where decision function is 1 or
–1:
• They are parallel and at equal distance to the decision
boundary, forming a margin around it.

Training a Margin-Based Classifier


• Decision function of SVMs : maximize the margin between
the data points and the hyperplane
• We can search for the optimal parameters (w and b) by finding a solution that:
1. Correctly classifies the training examples
2. Maximizes the margin
3. Can be found through optimisation via projective gradient descent, etc.
• Slope of the decision function equals to the norm of the weight vector
• The smaller the weight vector w, the
larger the margin
• Divide the slope by 2, the points
where the decision function is equal to
±1 are going to be twice as far away
from the decision boundary.

Hard Margin SVMs -> Define t (i) = –1 for negative instances (if y(i) = 0) and t (i) = 1 for positive instances (if y
(i)
= 1), then we can express this constraint as

Hard Margin Classification with Outliers


- Hard margin Classification is very
sensitive to outliers

Issues

25
Soft Margin SVM
- Data is not linearly separable =>
o Introduce slack variable
o Allow error e(j) in classification
o Based on the output the discriminant function wTx + b
o e(j) approximates the number of misclassified samples
- e measures how much the jth instance is allowed to violate the margin
(j)

- Soft margin SVMs: two conflicting objects


1. Making the slack variables as small as possible to reduce the margin violations,
2. Making wT · w as small as possible to increase the margin
- C Hyperparameter allows us to define the tradeoff between these two objectives
Hinge Loss= Loss function that incorporates a
margin of distance from the classification
boundary into the cost calculation
Linear Soft-Margin SVM loss function • An observation that is located directly on the
boundary would incur a loss of 1, regardless of
whether the real outcome was +1 or -1.

• When an observation has a distance of 1.5 for the • Observations that fall on the correct side of the
boundary, (outcome +1) decision boundary (hyperplane) but are within the
margin incur a cost between 0 and 1

Summary of the Hinge Loss 30


• The hinge loss is a special type of cost function that penalizes misclassified samples and correctly classified
ones that are within a defined margin from the decision boundary.
• All observations that end up on the wrong side of the hyperplane will incur a loss that is greater than 1 and
increases linearly.
• If the actual outcome was 1 and the classifier predicted 0.5, the corresponding loss would be 0.5 even though
the classification is correct.

Linear SVM cost function


• The score of correct category should be greater than sum of scores of
all incorrect categories by some safety margin (usually one).
• Although not differentiable, it’s a convex function which makes it
easy to work with usual convex optimizers used in machine learning
domain.

26
Kernel SVMs

One way to make a linear model more flexible is by adding more features.
EX: by adding interactions or polynomials of the input features. (Expand the set of input features, say by also
adding feature1 ** 2, the square of the second feature, as a new feature)
• When we transform back this line to original plane, it maps to ellipse boundary. These transformations are
called kernels.
• As a function of the original features, the linear SVM model is not actually linear anymore

Handling Overlapping Classifications


- Where should we put the classifier?

Soft Margin SVMs with 2 Features Kernel functions transform data to higher dimensions
- The threshold is a line, blues lines= margins

Adding a square of the dosages

The main idea


• start with data in a relatively low dimension (in this example one
dimension dosage in mg)
• add higher dimensions (in this example from one to two dimensions)
• find a Support Vector Classifier that separates the higher dimensional
data into two groups

27
Introducing Kernels
• Kernels work best for “small” n_samples
• Long runtime for “large” datasets (100k samples)
• Real power in infinite-dimensional spaces: rbf!
• Rbf is “universal kernel” - can learn (aka overfit)
anything

Common Kernels:
Polynomial Kernel:

SVM- Similarity function


- Another technique to tackle nonlinear problems is to add features computed using a similarity function that
measures how much each instance resembles a particular landmark.

Parameters for Gaussian RBF Kernels


Gamma= bandwidth; relates to scaling of data

SVM hyperparameters:
• Linear SVM hyperparameters- C (regularization)
• Polynomial SVM hyperparameters- C (regularization),
d (polynomial degree)
• RBF SVM hyperparameters- C (regularization), gamma
(width of kernel)

SVM Regression
Just like in the classification setting, in the regression setting we may also have
that not all observations fit our
requirement.
Hard-Margin SVM- Not all observations may fit our requirement-> Solved by
introducing variables ε:

SVM for Regression


• The goal is to find a function f(x) that has at most ε deviation from the actually obtained targets yi for all the
training data, and is at the same time is as flat as possible.
• In other words, we they not care about errors as long as they are less than ε, but will not accept any deviation
larger than this.
• Slack variables εi, ε*

• Fix epsilon based on application/outliers


• Linear kernel → robust linear regression
• Poly/rbf kernel → robust non-linear regression
28
To tackle nonlinear regression tasks, you can use a kernelized SVM model (code in slides)
Multi-Class Classification

è Class with highest score

Cost function for SVM


Multi class SVM Loss

Computing Hinge Loss


• Predicted score of the j-th class via for the i-th data point
• Hinge Loss function

One-Class Classification
• Learning from examples of just one class, e.g. positive examples?
• Desirable if there are many times of negative examples.
• "Outlier/Novelty Detection" problems can be formulated as one-class classification

One class SVMs


• “One-Class SVM” (OC-SVM). Proposed by [Scholkopf et al., 2001]
• Find a max-margin hyperplane separating positives from origin (representing negatives)

Summary- SVMs
• A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane.
• Given labelled training data (supervised learning), the algorithm outputs an optimal hyperplane which
categorizes new examples.
• In two dimensional space this hyperplane is a line dividing a plane in two parts where in each class lay in
either side.

Kernel SVMs
• The important parameters
• the regularization parameter C,
• the choice of the kernel, and the kernel-specific parameters..
• The RBF kernel has only one parameter, gamma, which is the inverse of the width of the Gaussian kernel.
gamma and C both control the complexity of the model,
• C and gamma should be adjusted together.

29
Linear SVMs Kernel SVMs
Advantages Advantages
• Accuracy • Allow for complex decision boundaries, even if the data has
• Works well on smaller cleaner datasets only a few features.
• Can be more efficient because it uses a • Work well on low-dimensional and high-dimensional data (i.e.,
subset of training points few and many features)
Disadvantages Disadvantages
• Not suited to larger datasets as the training • Do no scale very well with the number of samples. Running an
time with SVMs can be high SVM on data with up to 10,000 samples might work well, but
• Less effective on noisier datasets with working with datasets of size 100,000 or more can become
overlapping classes challenging in terms of runtime and memory usage.
• Linear SVMs have a linear decision • Require careful preprocessing of the data and tuning of the
boundary parameters..
• Originally designed as a two-class • SVM models are hard to inspect; it can be difficult to understand
classifier why a particularprediction was made, and it might be tricky to
explain the model to a nonexpert.
• Still, it might be worth trying SVMs, particularly if all of your
features represent measurements in similar units (e.g., all are pixel
intensities) and they are on similar scales.

Lecture 8- Naive Bayes and Decision Trees


Brief Introduction to Probability
Random variable= an element / event whose status is unknown:
• A = “it will rain tomorrow”
Domain= The set of values a random variable can take:
• A = The stock market will go up this year : Binary
• A = Number of times that Netherlands qualified for the World Cup : Discrete
• A = % change in a stock price : Continuous

Probability distributions
1. 0 ≤ P(A) ≤ 1
2. P(true) = 1, P(false) = 0
3. Set Theory: P(A ∪ B) = P(A) + P(B) – P(A ∩ B)

Priors= Degree of belief in an event in the absence of any other information


• E.g.
• P(rain tomorrow) = 0.8
• P(no rain tomorrow) = 0.2

Conditional Probability
• (A = 1 | B = 1): The fraction of cases where
A is true given B is true
• In some cases, given knowledge of one or more random variables, we can improve our
prior belief of another random variable
• p(slept in movie) = 0.5,
• p(slept in movie | liked movie) = 1/4
• p(didn’t sleep in movie | liked movie) = 3/4

Chain Rule
• The joint distribution can be specified in terms of conditional probability p(x,y) = p(x|y) p(x)
• Together with Bayes rule (which is actually derived from it) this is one of the most powerful rules in
probabilistic reasoning
p(slept in movie, liked movie) = 1/8;
p(slept in movie | liked movie) = 1/4
p(liked movie) = 1/2;
Bayes Rule

30
Classification Problem

X= observations (features)
Y= class (label)

Types of Classifiers
Instance based classifiers Generative: Discriminative
• Use observations directly • build a generative statistical • directly estimate a decision
without models model rule/boundary
• e.g. K nearest neighbors • e.g. Bayes classifiers • e.g. Decision trees

Naïve Bayes Classifiers

Probability that you will pass the exam if your teacher is Larry
P(x) = P(Larry) = 28/80
P(c) = P(yes) = 56/80
P(x|c) = P(Larry|yes) = 19/56
P(y|x) = 0.34 * 0.7 / 0.35 = 0.68 => 68%

• Given class variable y and dependent feature vector x1 through xn,

• Assumes conditional independence between every pair of features such that

• Such that we get

• Decision: take the class with the highest posterior probability


• Highest (p(y|x))

Naïve Bayes classifiers for continuous values


In many cases the data contains continuous features:
• Height, weight
• Levels of genes in cells
• Brain activity
For these types of data we often use a Gaussian model (normal distribution)
In this model we assume that the observed input vector X is generated from a normal distribution

Bernoulli Naïve Bayes


- Uses discrete data, features are binary p= probability of success
- Bernoulli distribution q= probability of failure
x = 0 or x = 1 (only 2 values)

31
Bernoulli Naïve Bayes Problem
Classify whether a person passes or fails an exam based on various features.
Instance X: Confident = Yes, Studied = Yes, Sick = No

• Prior class probabilities p(y) – P(Pass) = 3/5, P(Fail) = 2/5.


• P(x) = P(Confident = Yes)*P(Studied = Yes)*P(Studied = Yes)*P(Sick =
No) =(3/5)*(3/5)*(2/5) = 0.144
• Likelihoood (probability with respect to each feature)

P(Confident=Yes| Result=Pass) = 2/3 P(Confident=Yes| Result=Fail) = 1/2


P(Studied=Yes| Result=Pass) = 2/3 P(Studied=Yes| Result=Fail) = 1/2
P(Sick=No| Result=Pass) = 1/3 P(Sick=No| Result=Fail) = 1/2
P(X|Result=Pass)xP(Result=Pass) = P(X|Result=Pass)xP(Result=Pass) =
(2/3) * (2/3) * (1/3) * (3/5) = 0.13 (1/2) * (1/2) * (1/2) * (2/5) = 0.05

P(Result=Pass|X) = 0.0880/0.144 = 0.611 > P(Result=Fail|X) = 0.05/0.144 = 0.34

Gaussian Naive Bayes classifier Multinomial Naive Bayes Bernoulli Naive Bayes (BernoulliNB())
(GaussianNB()) (MultinomialNB()) • Assumes your feature vectors are
• assume that features follow a • assume count data binary (i.e. zeros and ones) or continuous
normal distribution • each feature represents an integer values which can be precisely split
count of something, like how often a (binarized) with a predefined threshold
. word appears in a sentence
Naïve Bayes Classifiers
Advantages Disadvantages
• Simple, (doing a bunch of counts) • Parameter estimation
• Works well with a small amount of training data • (Determination of p(x|y) where cases of a y label in
• The class with the highest probability is considered not enough)
as the most likely class.

Decision Trees
Decision Trees take one feature at a time and test a binary condition
For instance: is the feature larger than 0.5?
If the answer is YES, grow a node to the left
If the answer is NO, grow a node to the right
• Internal nodes correspond to attributes
(features)
• Leafs correspond to classification outcome
(output)
• Branching is determined by the best attribute
value
• Decision Tree grows with each level of
questions
• Each node (box) of the decision tree tests a condition on a feature. “Questions”= thresholds on single features
• All decision boundaries are perpendicular to the feature axes, because at each node a decision is made about a
single feature
If an attribute is continuous
Algorithm • Suppose we have a training set with the attribute “weight” where
• Choose an attribute on which to descend at each Weight = 50, 54,65, 75, 82, 95
level • The algorithm will consider the following possible splits
• Condition on earlier (higher) choices Weight <= 54 AND Weight > 50
• Generally, restrict only one dimension at a time Weight <= 65 AND Weight > 54
• Declare an output value when you get to the Weight <= 75 AND Weight > 65
bottom Weight <= 82 AND Weight > 75
Weight <= 95 AND Weight > 82
To construct a useful decision tree
1. Start from an empty decision tree
2. Split on the next best attribute
3. Repeat

32
• What is the best attribute?-> Can be determined using Information Theory, Gini Coefficient, Conditional
Entropy
• Similar to the game of “20 questions”
• The order of features is important
• “Guess the food”: it is better to start with the question “Is it a dessert?”, rather than with “Is it friet met
satésaus:?”
• For each split: exhaustive search over all features and thresholds

Criteria for classification:

Entropy- level of uncertainty

Entropy of a Joint Distribution


Ex: X= {Raining, Not raining}, Y= {Cloudy, Not cloudy}

Conditional Entropy
Ex: what is the entropy of cloudiness Y, given that it is raining? What is the entropy of cloudiness Y, if we know
whether it is raining or not raining?

Properties:
• H is always non-negative
• Chain rule: H(X, Y ) = H(X|Y ) + H(Y ) = H(Y |X ) + H(X )
• If X and Y independent, then X doesn’t tell us anything about Y: H(Y |X ) = H(Y )
• But Y tells us everything about Y : H(Y |Y ) = 0
• By knowing X , we can only decrease uncertainty about Y : H(Y|X ) ≤ H(Y )

Information Gain
• How much information about cloudiness do we get by discovering whether it is raining?
IG(Y |X ) = H(Y ) − H(Y |X ) ≈ 1 – 0.75 ≈ 0.25 bits
• Also called information gain in Y due to X
• If X is completely uninformative about Y : IG (Y |X ) = 0
• If X is completely informative about Y : IG (Y |X ) = H(Y )
• We use this to construct our decision tree

Model Complexity
• The complexity of the model induced by a decision tree is determined by the depth of the tree
• Increasing the depth of the tree increases the number of decision boundaries and may lead to overfitting
• Pre-pruning and post-pruning

33
• Limit tree size (pick one):
• max_depth
• max_leaf_nodes
• min_samples_split
• (and more)

Decision Tree Regression


Regression vs Classification
• Both partition the data
• Criterion for classification: Gini, Cross Entropy, Information Gain
• Criterion for regression: Weighted mean square error

Advantages Disadvantages
• Suitable for multi-class classification • Prone to overfitting without prunning
• Model is most easily interpretable (as it’s a series • Weak learners : single decision tree does not make
of if-else conditions) great predictions, multiple trees
• Can handle numerical and catergorial data can be combined to create stronger ensemble models
• Non-linear
• Can tolerate missing values

34
Lecture 9- Esemble Models
Bias and Variance
Estimating Generalization Error
Generalization error = bias + variance + noise.
Bias and variance typically trade off in relation to model complexity.

Low bias + Low variance = 2 most fundamental features expected for


a model

Ensemble methods try to reduce bias and/or variance of weak (single) models by combining several of them
together to achieve better performances

Simple (weak/base) models These base models perform not so well by themselves either
• Logistic Regression because
• Naïve Bayes • have a high bias (e.g. low degree of freedom models) or
• Knn • have too much variance to be robust (e.g. high degree of
• (Shallow) Decision Trees freedom models).
• Kernel SVMs

Esemble methods= Simple models used as building


blocks for designing more complex models by
combining several of them through:
• Voting
• Bagging: Train many models on bootstrapped
data, then take average (e.g Random Forests)
• Boosting : Given a weak model, run it multiple
times on (reweighted) training data, then let learned
classifiers vote (Gradient Boosting)
• Stacking : Trains many models in parallel and
combines them by training a meta-model to output a
prediction based on the different weak models
predictions

Voting
Build different models
• Classifiers that are most “sure” will vote with more conviction
• Classifiers will be most “sure” about a particular part of the space
• Average the result
More models are better – if they are not correlated.
Also works with neural networks
You can average any models as long as they provide calibrated (“good”) probabilities.
Scikit-learn: VotingClassifier

35
Bagging (Bootstrap Aggregation)
• Generic way to build “slightly different” models
• Draw bootstrap samples from dataset (as many as there are in the dataset, with repetition)
• Implemented in BaggingClassifier, BaggingRegressor

Bootstrapping= Generating samples of size B (called bootstrap samples) from an


initial dataset of size N by randomly drawing with replacement B observations.
- Can evaluate variance of confidence interval
- By using bootstrapping to generate several bootstrap samples that can be
considered as being “almost-representative” and “almost-independent” (almost
i.i.d. samples), These bootstrap samples will allow us to approximate the variance
of the estimator, by evaluating its value for each of them.

Bagging- fits several independent models and “average” their predictions to obtain a model with a lower
variance.
• Fitting fully independent models require too much data
• Relies on bootstrapping samples
• Creates multiple bootstrap samples
• Each new bootstap sample will act as
another independent dataset
• Fit weak learners for each sample,
• Aggregate them (average the output)
• Regression : simple average
• Classification problem:
• simple majority vote (hard voting)
• highest average probability (soft voting)

Review Decision Trees


Gini Index vs Entropy

Random Forests
• Strong learners composed of multiple trees can be called “forests”.
• Trees can be shallow (few depths) or deep (lot of depths, if not fully grown).
o Shallow trees - less variance but higher bias, better choice for sequential methods
o Deep trees- low bias but high variance, better choices for bagging method that is mainly focused at reducing
variance.
• Random forest approach is a bagging method where deep trees, fitted on bootstrap samples, are combined to
produce an output with lower variance
o Randomize in 2 ways to reduce correlation:
§ For each tree- Pick bootstrap sample of data
• Samples over the observations in the dataset to generate a bootstrap sample
§ For each split- Pick random sample of features
• Samples over features and keep only a random subset of them to build the tree
o More trees are always better

36
Classification and regression with RDFs
Classification: the mode of the classes outputted by the trees.
Regression: the mean of the values outputted by the trees.

Tuning Random Forests


Main parameter: max_features N _estimators > 100
• around sqrt(n_features) for classification • Prepruning might help, definitely helps with model size!
• Around n_features for regression • max_depth, max_leaf_nodes, min_samples_split again

Boosting
• Fits sequentially multiple weak learners
adaptively
• Each model in the sequence is fitted giving
more importance to observations in the
dataset that were badly handled by the
previous models in the sequence.

Bagging : aims to reduce variance


Boosting : aims to reduce bias

• “Meta-algorithm” to create strong learners


from weak learners. (E.g. AdaBoost, GentleBoost, ...)
• Trees work best for Boosting
• Gradient Boosting often the best of the bunch
• Many specialized algorithms (ranking etc)

Ada Boost
• Adaptive Boosting (AdaBoost)
• Add the weak learners one by one, looking at each iteration for the best possible pair (coefficient, weak
learner) to add to the current ensemble model.
• Updates observation weights in the dataset and train a new weak learner with a special focus given to the
observations misclassified by the current ensemble model.
• Adds the weak learner to the weighted sum according to an update coefficient that expresses the
performances of this weak model: the better a weak learner performs, the more it contributes to the strong
learner.
At the very beginning of the algorithm (first model of the
sequence), all the observations have the same weights 1/N.
Then, we repeat L times (for the L learners in the sequence)
the following steps:
• Fit the best possible weak model with the current
observations weights
• Compute the value of the update coefficient that is some
kind of scalar evaluation metric of the weak learner that
indicates how much this weak learner should be taken into
account into the ensemble model
• Update the strong learner by adding the new weak learner
multiplied by its update coefficient
• Compute new observations weights that expresse which observations we would like to focus on at the next
iteration (weights of observations wrongly predicted by the aggregated model increase and weights of the
correctly predicted observations decrease)

AdaBoost
• Increases the predictive accuracy by assigning weights to both observations at end of every tree and
weights(scores) to every classifier.
• Every classifier has a different weight on the final prediction.
• Boosting occurs sequentially
Recap - Random forest
• Parallel operations
• All trees are assigned equal weights.

37
Gradient Boosting- Combines weak learners to form a strong learner.
• Residual of the current classifier becomes the input for the next
consecutive classifier on which the trees are built (sequential
model)
• The residuals captured in a step-by-step manner by the classifiers,
in order to capture the maximum variance within the data.
• Done by introducing the learning rate to the classifiers.
• Many shallow trees
• learning_rate ↔ n_estimators
• Serial Slower to train than Random Forests, (Gserial), but much faster to predict
• Small model size
• Uses one-vs-rest for multi-class!
Gradient Boosting regression
1) Make an initial guess of the weights of the samples by computing the average (1st prediction)
(32.1+18.5+46.6+24+18/5) = 27.84
2) Compute the residuals (called pseudo residuals)
3) Combine predicted value with residuals (scaled by a learning rate)
• For a learning rate of 0.1, the predicted value for 1st row is 27.84 + 0.1 * 4.26 = 28.266
4) Compute new residuals
5) Compute the new predicted values
• New predicted values for the first row: 27.84 + (0.1*4.26) + (0.1*3.83) = 28.649
Leaf node Tree 1 Tree 2

Gradient Boosting with Decision Trees


Gradient boosting casts the problem into a gradient descent
• At each iteration, fits a tree to the opposite of the gradient of the current fitting error
• Classification : log loss
• Regression: Square loss
• Discount update by learning rate
Pseudo-residual computed at each training set, weighted sum of base learner results

Gradient Boosting- Start by assigning pseudo-residuals are set equal to the obervation values. Then, repeat L
times (for the L models of the sequence) the following steps:
• fit the best possible weak learner to pseudo-residuals (approximate the opposite of the gradient with respect to
the current strong learner)
• compute the value of the optimal step size that defines by how much we update the ensemble model in the
direction of the new weak learner
• update the ensemble model by adding the new weak learner multiplied by the step size (make a step of
gradient descent)
• compute new pseudo-residuals that indicate, for each result, in which direction we would like to update next
the ensemble model predictions
Tuning Gradient Boosting
• Pick n_estimators, tune learning rate
• Can also tune max_features
• Typically strong pruning via max_depth
Gradient Boosting
Adaboost • Additive Model
• Additive model • Misclassifications of previous models are identified
• Misclassifications of previous models are by the gradient
• Trees are usually grown as decision stumps • Tree are grown to a greater depth (more than 1)
• Each classifier has different weights assigned to the • Classifiers are wegithed equally.
final prediction based on its performance Predictive capacity restricted by a
• Both classifiers and observations are weighted to learning rate
capture the maximum variance. • Builds trees on residuals of previous classifier – to
capture variance of data
38
Stacking
• Bagging and Boosting considers mainly homogeneous
weak learners.
• Stacking
o learns several different (heterogeneous) weak learners
o combines with base models by training a meta model
to output predictions based on multiple predictions
returned by weak models
• Classification problem example
o Chose some weak learners: KNN classifers, logicst
regression, and a SVM
o Choose a neural network as a meta model
o Output of the 3 weak learners = input to neural network
o Output of neural network = Final prediction
Fitting on stacking esemble
Steps:
• Split the training data in two folds
• Choose L weak learners and fit them to data of the first fold
• For each of the L weak learners, make predictions for observations in the second fold
• Fit the meta-model on the second fold, using predictions made by the weak learners as inputs

Limitation: Only half of the data to train the base models and half of the data to train the meta-model.
Solution: “k-fold cross-training” approach (similar to what is done in k-fold cross-validation) such that all the
observations can be used to train the meta-model.

Poor man’s Stacking- build multiple models; train model on probabilities/ scores produced
Hold-out estimates of probabilities
• Split 1 produces probabilities for Fold 1, split2 for Fold 2 etc.
• Get a probability estimate for each data point!
• Unbiased estimates (like on the test set) for the whole training set!

Summary
Ensemble learning
• multiple models (often called weak learners or base models) are trained to solve the same problem and
combined to get better performances
• the main hypothesis is that if we combine the weak learners the right way we can obtain more accurate and/or
robust models
Voting
• Build different models.
• Classifiers that are most “sure” will vote with more conviction
• Average the result
Bagging methods (an ensemble model with a lower variance)
• several instance of the same base model are trained in parallel (independently from each others) on different
bootstrap samples and then aggregated in some kind of “averaging” process
• an ensemble model with a lower variance
Boosting methods (an ensemble model with a lower bias)
• several instance of the same base model are trained sequentially. At each iteration, the way to train the current
weak learner depends on the previous weak learners and more especially on how they are performing on the
data
Stacking method
• Different weak learners are fitted independently from each others and a meta-model is trained on top of that to
predict outputs based on the outputs returned by the base models

When to use Tree-based models • Gradient boosting often best performance with careful tuning
• Model non-linear relationships • Doesn’t care about scaling, no need for feature engineering!
• Single tree: very interpretable (if small)
• Random forests very robust, good benchmark
39
Lecture 10- Model Evaluation, Learning with imbalanced data

Generalization performance
• A model should always be evaluated on independent test data.
• A model’s performance on unseen data will give us the generalization performance of the model.
• Focus on supervised learning methods (evaluation of unsupervised methods is more qualitative)

Evaluating using a training set + test set


• Easiest evaluation process
• Split the data
o a training set to fit the model on
o a test set to evaluate the fitted model ← evaluates generalization performance
• Problems
o How big should the training and test set be?
o How do we know if the test set is exceptionally different from the training data?
o How do we know whether the model is overfitting the data?

Sources of error: Bias and variance


Suppose we train a model on a random sample of the training
data multiple times, and then look at the performance on the
test set.
• Bias: how far, on average, are the model’s predictions from
the correct
value ?
• Variance: how far apart are the model’s predictions ?

The error of a model can be decomposed into


• Irreducible error (how noisy is the data itself): What is the variance of the target around its true mean?
• Reducible error
• Bias2: How much does the average of the estimate deviate from the true mean?
• Variance: What is the deviation of the estimates around their mean?
Validation and Tuning
Cross-Validation-> Splitting the data in a training and test set multiple times
• Most commonly used version is k-fold cross-validation
• k is the number of partitions of the data
• Each partition serves as the test set once, while all other partitions serve as training set

Benefits Disadvantages
• Leaves less to luck: If we get a very good or bad training set • Increased computational cost
by chance, this will show in the results → performance will be • Simple cross-validation can result in
an outlier class imbalance between training and test
• Shows how sensitive the model is to the training data set. sets
High variance means high sensitivity to the training data.

40
Stratified Cross-Validation-> makes sure there is no class imbalance in the different folds

Leave-one-out cross validation (LOO)-> k-fold cross-validation, where k=N and N=the number of items in the
dataset
• Very time consuming
• Generates predictions given the maximal available data
• Can be useful to find out which items are regular and irregular from the point-of-view of the dataset.

Shuffle-split cross validation-> Controls test-size,


training-size, and number of iterations
• Also stratified variant available

Cross-validation with Groups


In cases that groups in the data are relevant to the learning problem
• E.g. emotion recognition: If the goal is to classify emotions from unknown persons, then it is best to split the
data so that the persons occurring in training and test sets are different.
• Common in medical applications where generalization to new patients is important

Tuning= Improving model’s generalization performance by adjusting


parameter values
• Simple grid search: try all possible combinations of chosen parameter values
• Grid search with cross-validation: use cross-validation to evaluate the
performance of each combination of parameter values.
Danger! Optimizing on test-set means that the test-set is not independent
anymore → requires another final test-set

Binary Classification

Goal setting!
What do I want? What do I care about?
• (precision, recall, something else)
Can I assign costs to the confusion matrix?
• (i.e. a false positive costs me $10, a false negative $100)
What guarantees do we want to give?

41
Multi-class classification metrics
Macro-average F1:
Average F1 scores over classes (“all classes are equally important”)
Weighted F1: Mean of the per-class f-scores, weighted by their support.
(“bigger classes are important”)
Micro-average F1: Make one binary confusion matrix over all classes, then
compute recall, precision once (“all samples are equally important”)

Imbalanced Data

SUMMARY

42
LECTURE 11- DIMENSIONALITY REDUCTION

Supervised Learning
• Given a set of data and labels, learn a model which will predict a label for new data.
• Given D = {Xi,Yi} learn a model (or function) F: Xk -> Yk

Often used to automate manual labor.


E.g., you might annotate part of a dataset manually, then learn a machine learning model
from these annotations, and use the model to annotate the rest of your data
• Given a satellite image, what is the terrain in the image?
Xi = pixels (image regions), Yi = terrain type
• Given some test results from a patient, how likely will the patient have a diabetes?
Xi = test results, Yi = diabetes/no diabetes

Unsupervised Learning
• Discover patterns in data
• Given D = {Xi} group the data into Y classes using a model (or function)
• F: Xi -> Yj
Ex:
• Discovering trending topics on Twitter or in the news
• Grouping data into clusters for easier analysis
• Outlier detection (e.g. Fraud detection and security systems)

The Curse of Dimensionality

In 3 dimensions (3 features), perfect


separation of cats and dogs is possible with a
decision boundary (plane)

This example suggests that by adding (informative) features, classification is improved.


• This is often the case, but adding new features increases the volume of feature space exponentially
• For instance: 1 feature has 10 different values, 2 features: 100, 3 features: 1000

43
• A simple classification model in a
high-dimensional space (e.g., a linear
decision boundary (plane) in 3D)
• Corresponds to a complex classification
model in low-dimensional space (e.g.,
non-linear decision boundaries in 2D)
• Overfitting is associated with (too)
complex models
ð too many features may lead to
overfitting too

Informative features: We want to increase the number of features to put all the relevant information in the
classifier
Curse of dimensionality: We want to decrease the number of features to
avoid the curse of dimensionality
Machine learning algorithms should optimize the trade-off between informative features and curse of
dimensionality by means of dimensionality reduction techniques

Dimensionality Reduction
Benefits of applying dimensionality reduction to a dataset:
• Reduced space required to store the data as the number of dimensions comes down
• Less dimensions lead to less computation/training time
• Some algorithms do not perform well when we have a large dimensions (e.g. KNN)
• Takes care of correlations by removing redundant features. (E.g. you have two highly correlated variables –
‘time spent on treadmill in minutes’ and ‘calories burnt’. The more time you spend running on a treadmill, the more
calories you will burn.
ð there is no point in storing both as just one of them does what you require
• Visualizing data

2 different ways:
Feature selection: Dimensionality reduction:
• Keeping the most relevant variables from the original • Finding a smaller set of new variables, each being a
dataset combination of the input variables, containing basically the
• E.g. Random Forests, Decision Trees, same information as the input variables
• Removing features with too many missing values • Unsupervised approach
• E.g Principal Component Analysis, Nonnegative rank
factorization, t-SNE (t Stochastic Nearest Neighbour)
Principal Component Analysis

- operates on features without labels (UNSUPERVISED LEARNING)


- each feature = axis of a n-dimensional coordinate system
- 2 features: XY coordinate system
- PCA rotates the XY coordinate system
- Performs Dimensionality reduction (reduces 2 dim/features to 1)
- ALWAYS scale (feature with > scale=> always principle component)

44
An orthogonal linear combination of the original variables
• First principal component explains maximum variance in the dataset
• Second principal component-> explain the remaining variance in the dataset; uncorrelated to the first principal
component
• Third principal component-> explain the variance which is not explained by the first two principal components, etc

OBJECTIVE: Find directions of maximum variance:


• Find projection (onto one vector) that maximizes the variance observed in
the data.
• Subtract projection onto PC1, iterate to find more components.
• Only well-defined up to sign / direction of arrow!
• PCA performs well if total variance explained increases exponentially
to 1

Computing PCA
Taking the whole dataset ignoring the class labels
• Compute the mean and covariance
• Center the data (subtract mean). (In practice: scale to unit variance)

Obtain the Eigenvectors and Eigenvalues from the covariance matrix or


correlation matrix, or perform Singular Vector Decomposition.
• To project the feature space via PCA onto a smaller subspace, where
the eigenvectors will form the axes of this new feature subspace.

Sort eigenvalues in descending order

Choose the k eigenvectors that correspond to the k largest eigenvalues


(k= nr of dimensions of the new feature subspace (k≤d).
d = original number of dimensions.

Construction of the projection matrix (W) that will be used to transform the data
• Projection matrix = matrix of our concatenated top k eigenvectors

Dimensionality is color blind (Unsupervised)- PC2 separates, PC1 doesn’t

Higher Dimensions
• For datasets >2 features, PCA rotates the coordinate system in such a way that:
• the projection of the data on the first principal component (new axis) has the largest variance
• the projection of the data on the second principal component (new axis) has the one-but-largest variance, etc
• If the variation in the data is associated with relevance for classification (or regression), the most relevant
features are captured by the first principal components (and the rest captures noise)
• Retaining the first principal components and throwing away the rest
effectively reduces the dimensionality
How many features (principal components) to keep?
No fixed rule that defines how many features should be used in a
classification problem; depends on:
• the amount of training data available,
• the complexity of the decision boundaries, and
• the type of classifier used
Total Explained Variance: can be used to decide the number of
features.

45
Non-negative Matrix Factorization

Latent space- contains a hidden compressed


repres. of the data
Contains a simpler repres. of our images than the
pixel space

46
Dimensionality Reduction for Data Visualization: Manifold learning
Manifold Learning- allow for much more complex mappings and often provide better
visualizations
Learn underlying “manifold” structure, use for dimensionality reduction
Pro: pretty pictures
Cons: Visualisation only, axes don’t correspond to anything in the input space, often can’t
transform new data

t-SNE
• Starts with a random embedding
• Iteratively updates points to make “close”
points close.
• Global distances are less important,
neighbourhood counts.
• Good for getting coarse view of topology.
• Can be good for finding interesting data
points.

Tuning Parameters
• n_components (default: 2): Dimension of the
embedded space.
• perplexity (default: 30): The perplexity is related to
the number of nearest neighbors that are used in other
manifold learning algorithms. Consider selecting a
value between 5 and 50.
• early_exaggeration (default: 12.0): Controls how
tight natural clusters in the original space are in the embedded space and how much space will be between them.
• learning_rate (default: 200.0): The learning rate for t-SNE is usually in the range (10.0, 1000.0).
• n_iter (default: 1000): Maximum number of iterations for the optimization. Should be at least 250.

47
LECTURE 12: CLUSTERING
GOALS
• Data exploration: Are there coherent groups? How many groups are there?
• Data partitioning= divide data by group before further processing.
• Unsupervised feature extraction= Derive features from clusters or cluster distances
E.g. Clustering techniques
• K-means, Hierarchical Clustering, Density Based Techniques, Gaussian Mixtures Models

K-means Clustering= separate samples in k groups of equal variance


• Requires number of clusters to be specified

ALGORITHM
1. Choose the number of clusters, K
2. Randomly choose initial positions of K centroids
3. Assign each of the points to the "nearest centroid" (depending on the distance measure)
4. Recompute centroid positions
5. If solution converges -> Stop, else go the step 3.

Objective function for K-means


• Finds a local minimum of minimizing squared distances

• New data points can be assigned cluster membership based on existing clusters.

Restrictions of Cluster shapes:


• Voronoi-diagrams of centers= a partition of a plane into regions close a given set of objects
• always convex in space
48
Limitations:
• Only simple cluster shapes
• Cluster boundaries are determined from the middle of the centres
• Can’t model covariances well in ansiotropicaly distributed clusters (exhibiting properties
with different values when measured in different directions.)

Computational Properties:
• By default K-means in sklearn does 10 random restarts with
different initializations.
• For large datasets, K-means initialization may take much longer than
clustering.
• Consider using random, in particular for MiniBatchKMeans
• Uses mini-batches to reduce the computation time, while still
attempting to optimise the same objective function (partial_fit)

MiniBatchKMeans
Mini-batches= subsets of the input data (rather than the whole data), randomly sampled in each training iteration.
Algorithm
1. Draw samples randomly from the dataset, to form a mini-batch. Assign to nearest centroid
2. Update the centroids by using a convex combination of the average of the samples and the previous samples
assigned to that centroid.
3. Perform 1 and 2 until convergence or for a fixed number of iterations,

Hierarchical Clustering

= a series of partitions from a single cluster containing all the data points to N clusters
containing 1 data point each.

• Start with N independent clusters: {P1 }, {P2 },...,{PN}


• Find the two closest (most similar) clusters, and join them
• Repeat step 2 until all points belong to the same cluster

49
Hierarchical clustering is heavily influenced by:
- distance metric (Euclidian, Manhattan, Maximum)
- linkage criterion- determines how clusters are merged
• the distance between 2 clusters is a function of the pairwise
distance between each point
• clusters that minimize this function are then combined
• linkage criteria are broadly equivalent for single point clusters
• Complete linkage: maximum distance between points in each
cluster
• Single linkage: minimum distance between points in each
cluster
• Average Linkage: average distance between points in each cluster
• Centroid Linkage: distance between cluster centroids
• Ward linkage: cluster variance (select clusters that maximize
decrease in variance)

Density-Based Clustering Methods


DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
2 important hyperparameters: epsilon (distance) and minPts
- First point picked at random
- Select next point randomly from unvisited points
- Count number of points in the neighbourhood (more than minPts=> new cluster
formed)
- Count number of points in each unvisited neighbourhood (>minPts=> add to
cluster)
- Repeat
- If <minPts=> marked as noise

Density = number of sample points within a specified radius r (epsilon)


Core point: sample with more than a specified number of points (min_samples) within epsilon (includes samples
inside the cluster)
Border point has fewer than min_samples within epsilon, but is in the neighborhood of a core point
Noise point : any point that is not a core point or a border point.

Finds core samples of high density and expands clusters from them
Sample is a “core sample” if > min_smaples is within epsilon “dense region”

50
• Allows complex cluster shapes
• Can detect outliers
• Needs two parameters to adjust, epislon is hard to pick
(can be done based on number of clusters though).
• Can learn arbitrary cluster shapes
• Limitations
• Varying densities
• High-dimensional data

Mixture Models

Evaluation of the clustering result


Elbow Plot

51
Silhouette Coefficient
• A function S that measures the separation between two clusters, c1 and c2.
• How can we measure the goodness of a clustering C = c1, ... cl, using the separation function S?

52

You might also like