ML 01
ML 01
1
learning by a machine or computer or artificial device
What is Machine Learning good for?
An example: digit recognition
1. Data. A random sample collected from the problem that we want to model
described with a set of attributes and their associated answer
2. Models. Description of how data are generated or behave in a general way
using a specific language, for instance, logical formulas, mathematical functions
or probability distributions.
3. Learning. The process by which concrete models are found so that they (1)
explain observed data and also (2) can predict unseen data.
On data
I Data is tabular: rows are examples (objects, instances, or data samples) and
columns are the attributes (features, ..) describing the examples
I Features can be numerical (continues range of values) or categorical (discrete
set of values)
I One special column corresponds to the supervised answer (numerical or
categorical)
I So, each example is a d-dimensional vector xi , and a dataset is a set of labelled
examples (input-output pairs):
I It is convenient to place all the input features into a matrix X ∈ Rn×d , and all
the labels into a vector y
On data pre-processing
Each problem requires a different approach in what concerns data cleaning and
preparation. This pre-process can have a deep impact on performance; it can
easily take a significant amount of time
1. Attribute coding (discretization, encoding)
2. Normalization (range, distribution)
3. Missing values (imputation)
4. Outliers
5. Feature selection
6. Feature extraction (feature engineering)
7. Dimensionality reduction and transformations
Non-tabular data (images, audio, text, time-series, graphs, . . . ) may need ad-hoc
treatments and are beyond the scope of this course.
Models
Models are the artifact by which we describe the input data; can be understood as a
compression mechanism with predictive abilities. They define how the learning
is approached. In the course, we focus on two main groups of models: functions,
and probability distributions
1. Models as functions, i.e. functions mapping input examples to target values
I f : Rd → {C1 , .., CK } for classification)
I f : Rd → R for regression, for example:
f (x) = wT x + w0
Learning is the process of finding good models from (finite) input data.
good model = a model that predicts well on unseen data (this is the
generalization ability of a model)
Is learning possible?
Suppose that the learning process has a candidate model f . How can we assess its
quality?
We have several notions of errors we can compute or at least estimate.
Assume that the input dataset is given by {(x1 , y1 ), ..., (xn , yn )}, then we denote:
I ŷ = f (x) is the prediction on object x by model f
I the error function (or loss function) l(y, ŷ) measures how “off” are
predictions from true values
We call the true error of a model f (i.e. its generalization error), the expected
error2 that the model will make on a random, possibly unseen example (x, y) drawn
from distribution p(x, y):
Z
Etrue (f ) = Ex,y [l(y, f (x))] = l(y, f (x))p(x, y)dxdy
x,y
2
also called expected risk
On error, cont.
Empirical error / training error
We only see a partial view of the process we are modelling through a finite
dataset, so we cannot compute the true error directly, and therefore we resort to
estimates/approximations for it.
Assuming that the examples are independent and identically distributed (iid), the
empirical mean of the loss is a good estimate of the population loss. So we define
the empirical error3 :
n
1 X
Eemp (f, X, y) = l(yi , f (xi ))
n
i=1
3
also called empirical risk
On error, cont.
Regularization
Minimizing training error excessively may lead to the famous notion of overfitting.
This is particularly dangerous for complex f s, so the natural way to fix this is by
limiting complexity or penalizing complexity (this is called regularization).
So, now the learning process should seek an f that minimizes this empirical risk
instead:
n
1 X
Ereg (f, X, y) = l(yi , f (xi )) + λ|f |
n
i=1
We will see all of these concepts and more in the context of linear regression.
Please refresh concepts from linear algebra, vector calculus, probability theory and
statistics.
The book Mathematics for Machine Learning contains good coverage of these topics:
I Eigendecomposition and the SVD (chapter 4)
I Partial differentiation, gradients of vector-valued functions (chapter 5)
I Probability and Distributions (chapter 6)