STAT 451: Introduction To Machine Learning Lecture Notes
STAT 451: Introduction To Machine Learning Lecture Notes
Lecture Notes
Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison
http://stat.wisc.edu/∼sraschka/teaching/stat451-fs2020/
Fall 2020
Contents
1 L01: What is Machine Learning? An Overview. 1
1.1 Machine Learning – The Big Picture . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Applications of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Overview of the Categories of Machine Learning . . . . . . . . . . . . . . . . 4
1.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Semi-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Introduction to Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 Statistical Learning Notation . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Data Representation and Mathematical Notation . . . . . . . . . . . . . . . . 8
1.6 Hypothesis space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Classes of Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . 11
1.7.1 Algorithm Categorization Schemes . . . . . . . . . . . . . . . . . . . . 11
1.7.2 Pedro Domingo’s 5 Tribes of Machine Learning . . . . . . . . . . . . . 12
1.8 Components of Machine Learning Algorithms . . . . . . . . . . . . . . . . . . 13
1.8.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.9 Different Motivations for Studying Machine Learning . . . . . . . . . . . . . . 15
1.10 On Black Boxes & Interpretability . . . . . . . . . . . . . . . . . . . . . . . . 16
1.11 The Relationship between Machine Learning and Other Fields . . . . . . . . 17
1.11.1 Machine Learning and Data Mining . . . . . . . . . . . . . . . . . . . 17
1.11.2 Machine Learning, AI, and Deep Learning . . . . . . . . . . . . . . . . 17
1.12 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.13 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2
STAT 451: Introduction to Machine Learning
Lecture Notes
Sebastian Raschka
Department of Statistics
University of Wisconsin–Madison
http://stat.wisc.edu/∼sraschka/teaching/stat451-fs2020/
Fall 2020
One of the main motivations why we develop (computer) programs is to automate various
kinds of (often tedious) processes. Originally, machine learning was developed as a subfield
of Artificial Intelligence (AI), and one of the goals behind machine learning was to replace
the need for developing computer programs “manually.” Considering that programs are
being developed to automate processes, we can think of machine learning as the process
of “automating automation.” In other words, machine learning lets computers “create”
programs (often, the intent for developing these programs is making predictions) themselves.
We can say that machine learning is the process of turning data into programs (Figure 1).
In the machine learning community, it is broadly accepted that the term machine learning
was first coined by Arthur Lee Samuel, a pioneer in the AI field, in 19591 . One quotation that
almost every introductory machine learning resource cites is the following, which summarizes
the concept behind machine learning nicely and concisely:
Machine learning is the field of study that gives computers the ability to learn
without being explicitly programmed. 2 — Arthur L. Samuel, AI pioneer, 1959
Now, before we introduce machine learning more formally, here is what some other people
said about the field:
The field of machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience.
— Tom Mitchell, Professor Machine Learning at Carnegie Mellon University and
author of the popular “Machine Learning” textbook
1 Arthur L Samuel. “Some studies in machine learning using the game of checkers”. In: IBM Journal of
learn from experience should eventually eliminate the need for much of this detailed programming effort.”
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 2
Machine
— ArthurLearning
Samuel (1959)
Machine learning is the field of study that gives computers the
Inputs
ability to learn without being explicitly programmed
Computer Program
— ArthurOutputs
Samuel (1959)
!3
Inputs
Figure 1: Machine learning vs. “classic” programming.
Computer
A bit more concrete is Tom Mitchell’s description from his MachineProgram
Learning book3 :
Outputs
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P , if its performance at tasks in T , as
measured by P , improves with experience E.
!3
To illustrate this quote with an example, consider the problem of recognizing handwritten
digits (Figure 2):
pp. 870–877.
4 Yann LeCun et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the
After the field of machine learning was “founded” more than a half a century ago, we can now
find applications of machine learning in almost every aspect of our life. Popular applications
of machine learning include the following:
• Drug design
• Medical diagnoses
• ...
While we go over some of these applications in class, it is a good exercise to think about
how machine learning could be applied in these problem areas or tasks listed above:
The three broad categories of machine learning are summarized in Figure 3: (1) super-
vised learning, (2) unsupervised learning, and (3) reinforcement learning. Note that in this
class, we will primarily focus on supervised learning, which is the “most developed” branch
of machine learning. While we will also cover various unsupervised learning algorithms,
reinforcement learning will be out of the scope of this class.
Labeled data
Supervised Learning Direct feedback
Predict outcome/future
No labels/targets
Unsupervised Learning No feedback
Find hidden structure in data
Decision process
Reinforcement Learning Reward system
Learn series of actions
Figure 3: Categories of machine learning (Source: Raschka & Mirjalili: Python Machine Learning,
3rd Ed.).
Supervised learning is the subcategory of machine learning that focuses on learning a clas-
sification (Figure 4), or regression model (Figure 5), that is, learning from labeled training
data (i.e., inputs that also contain the desired outputs or targets; basically, “examples” of
what we want to predict).
x2
x1
Figure 4: Illustration of a binary classification problem (plus and minus signs denote class labels)
and two feature variables, (x1 and x2 ). (Source: Raschka & Mirjalili: Python Machine Learning,
3rd Ed.).
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 5
x
Figure 5: Illustration of a linear regression model with one feature variable (x1 ) and the target
variable y. The dashed-line indicates the functional form of the linear regression model. (Source:
Raschka & Mirjalili: Python Machine Learning, 3rd Ed.).
x2
x1
Figure 6: Illustration of clustering, where the dashed lines indicate potential group membership
assignments of unlabeled data points. (Source: Raschka & Mirjalili: Python Machine Learning, 3rd
Ed.).
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 6
Reinforcement is the process of learning from rewards while performing a series of actions.
In reinforcement learning, we do not tell the learner or agent (for example, a robot), which
action to take but merely assign a reward to each action and/or the overall outcome. Instead
of having “correct/false” labels for each step, the learner must discover or learn a behavior
that maximizes the reward for a series of actions. In that sense, it is not a supervised setting.
RL is somewhat related to unsupervised learning; however, reinforcement learning really is
its own category of machine learning. Reinforcement learning will not be covered further in
this class. However, for those who are interested, Dr. Mirjalili and I wrote an introduction
to reinforcement learning for the 3rd edition of “Python Machine Learning.”
Typical applications of reinforcement learning involve playing games (chess, Go, Atari video
games) and some form of robots, e.g., drones, warehouse robots, and more recently self-
driving cars.
Environment
Reward
State
Action
Agent
Figure 7: Illustration of reinforcement learning (Source: Raschka & Mirjalili: Python Machine
Learning, 3rd Ed.).
Loosely speaking, semi-supervised learning can be described as a mix between supervised and
unsupervised learning. In semi-supervised learning tasks, some training examples contain
outputs, but some do not. We then use the labeled training subset to label the unlabeled
portion of the training set, which we then also utilize for model training.
Unless noted otherwise, we will focus on supervised learning and classification, the most
prevalent form of machine learning. However, we will also see some regression examples
throughout the lectures, and there will be lectures on unsupervised learning later in this
course.
In supervised learning, we are given a labeled training dataset from which a machine learning
algorithm can learn a model. The learned (or trained) model can be used to predict labels
of unlabeled data points. These unlabeled data points could be either test data points (for
which we actually have labels but we withheld them for testing purposes) or unlabeled data
that we already collected or will collect in the future. For example, given a corpus of spam
and non-spam email, a supervised learning task would be to learn a model that predicts
to which class (spam or non-spam) new emails belong. Of course, this all underlies the
assumption that the training dataset and unlabeled data points for prediction have been
sampled from the same probability distribution – after all, we cannot expect the model
to make reliably predictions on fundamentally different data. Figures 8 and 9 provide a
simplified and more detailed overview of a typical machine learning workflow.
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 7
More formally, we define h as the “hypothesis,” a function that we use to approximate some
unknown function
f (x) = y, (1)
where x is a vector of input features associated with a training example or dataset instance
(for example, the pixel values of an image) and y is the outcome we want to predict (e.g.,
what class of object we see in an image). In other words, h(x) is a function that predicts y.
In classification, we define the hypothesis function as
h : X → Y, (2)
where X = Rm and Y = {1, ..., k} with class labels k. in the special case of binary classifi-
cation, we have Y = {0, 1} (alternatively, we may use Y = {−1, 1}).
And in regression, the task is to learn a function
h : Rm → R. (3)
Labels
Training Data
Machine Learning
Algorithm
Labels
Training Dataset
Learning
Final Model New Data
Labels Algorithm
Model Selection
Cross-Validation
Performance Metrics
Hyperparameter Optimization
A crucial assumption we make in supervised learning is that the training examples have the
same distribution as the test (future) examples. In real-world applications, this assumption
is oftentimes violated, which is one of the common challenges in the field.
Please note that the notation varies wildly across the literature, since machine learning is
a field that is popular across so many disciplines. For example, in the context of statistical
learning theory, we can think of the dataset D = {hx[1] , y [1] i, hx[2] , y [2] i . . . , hx[n] , y [n] i} as a
sample from the population of all possible input-target pairs, and x[i] and y [i] are instances
of two random variables X ∈ X and Y ∈ Y and that the training pairs are drawn from a
joint distribution P (X, Y ) = P (X)P (Y |X).
Given an error term we can then formulate the following relationship:
Y = f (X) + . (5)
The goal in statistical learning is then to estimate f , which can then be used to predict Y :
fˆ(X) = Ŷ . (6)
In previous section, we referred to the ith pair in a labeled training set D as hx[i] , y [i] i.
We will adopt the convention to use italics for scalars, boldface characters for vectors, and
uppercase boldface fonts for matrices.
• x: A scalar denoting a single training example with 1 feature (e.g., the height of a
person)
• x: A training example with m features (e.g., with m = 3 we could represent the height,
weight, and age of a person), represented as a column vector (i.e., a matrix with 1
column, x ∈ Rm ),
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 9
x1
x2
x = . . (7)
..
xm
(Note that most programming languages, incl. Python, start indexing at 0!)
T
x1
xT2
X = . . (8)
..
xTn
Note that in order to distinguish the feature index and the training example index, we will
use a square-bracket superscript notation to refer to the ith training example and a regular
subscript notation to refer to the jth feature:
In the previous section on supervised learning, we defined the hypothesis h(x) to predict
a target y. Machine learning algorithms sample from a hypothesis space that is usually
smaller than the entire space of all possible hypotheses H – an exhaustive search covering
all h ∈ H would be computationally infeasible since H grows exponentially with the size
(dimensionality) of the training set. This is illustrated in the following paragraph.
Assume we are given a dataset with 4 features and 3 class labels, the class labels
y ∈ {Setosa, Versicolor, Virginica}. Also, assume all features are binary. Given 4 features
with binary values (True, False), we have 24 = 16 different feature combinations (see Ta-
ble 1). Now, of the 16 rules, we have three classes to consider (Setosa, Versicolor, Virginica).
Hence, we have 316 = 43, 046, 721 potential combinations of 16 rules that we can evaluate
(this is the size hypothesis space, |H| = 43, 046, 721)!
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 10
Table 1: Example of decision rules for the Iris flower data dataset.
sepal length <5 sepal width <5 petal length <5 petal width <5 Class
cm cm cm cm Label
True True True True Setosa
True True True False Versicolor
True True False True Setosa
... ... ... ...
Now, imagine the features are not binary but real-valued. The hypothesis space will become
so big that it would be impossible to evaluate it exhaustively. Hence, we use machine
learning to reduce the search space within the hypothesis space (Figure 10). As a side
note, a neural network with a single hidden layer, a finite number of neurons, and non-
linear activation functions such as sigmoid units, was proved to be a universal function
approximator5 . However, the concept of universal function approximation 6 does not imply
practicality, usefulness, or adequate performance in practical problems.
Hypothesis space
a particular learning
algorithm category
has access to
Hypothesis space
a particular learning
algorithm can sample
Particular hypothesis
(i.e., a model/classifier)
Typically, the number of training examples required is proportional to the flexibility of the
learning algorithms. I.e., we need more training data (labeled examples) for models with
(As a rule of thumb, the more parameter to fit and/or hyperparameters to tune, the larger
the set of hypotheses to choose from.)
5 George Cybenko. “Approximations by superpositions of a sigmoidal function”. In: Mathematics of
Control, Signals and Systems 2 (1989), pp. 183–192.
6 Balázs Csanád Csáji. “Approximation with artificial neural networks”. In: Faculty of Sciences, Etvs
To aid our conceptual understsanding, each of the algorithms can be categorized into various
categories.
• eager vs lazy;
• batch vs online;
• parametric vs nonparametric;
• discriminative vs generative.
These concepts or categorizations will become more clear once we discussed a few of the
different algorithms. However, below are brief descriptions of the various categorizations
listed above.
Eager vs lazy learners. Eager learners are algorithms that process training data imme-
diately whereas lazy learners defer the processing step until the prediction. In fact, lazy
learners do not have an explicit training step other than storing the training data. A pop-
ular example of a lazy learner is the Nearest Neighbor algorithm, which we will discuss in
the next lecture.
Batch vs online learning. Batch learning refers to the fact that the model is learned
on the entire set of training examples. Online learners, in contrast, learn from one training
example at the time. It is not uncommon, in practical applications, to learn a model via
batch learning and then update it later using online learning.
Parametric vs nonparametric models. Parametric models are “fixed” models, where
we assume a certain functional form for f (x) = y. For example, linear regression can be
considered as a parametric model with h(x) = w1 x1 + ... + wm xm + b. Nonparametric
models are more “flexible” and do not have a pre-specfied number of parameters. In fact,
the number of parameters grows typically with the size of the training set. For example,
a decision tree would be an example of a nonparametric model, where each decision node
(e.g., a binary “True/False” assertion) can be regarded as a parameter.
Discriminative vs generative. Generative models (classically) describe methods that
model the joint distribution P (X, Y ) = P (Y )P (X|Y ) = P (X)P (Y |X) for training pairs
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 12
hx[i] , y [i] i 7 . Discriminative models are taking a more “direct” approach, modeling P (Y |X)
directly. While generative models provide typically more insights and allow sampling from
the joint distribution, discriminative models are typically easier to compute and produce
more accurate predictions. Helpful for understanding discriminative models is the following
analogy: discriminative modeling is like trying to extract information from text in a foreign
language without learning that language.
Another useful way to think about different machine learning algorithms is Pedro Domingo’s
categorization of machine learning algorithms into five tribes (Figure 11), which he defined
in his book “The Master Algorithm”8 . These five tribes are as follows:
an approximation of X and sample training examples x ∼ X. Examples of such models are Generative
Adversarial Networks and Variational Autoencoders; deep learning models that are not covered in this class.
8 Pedro Domingos. The master algorithm: How the quest for the ultimate learning machine will remake
As it was already indicated in the figure of Pedro Domingo’s five tribes of machine Learning
(Figure 11), there are several different components of machine learning algorithms. (Note
that these components are a bit different from those in Figure 11.)
Representation. The first component is the “representation,” i.e., which hypotheses we
can represent given a certain algorithm class.
Optimization. The second component is the optimization metric that we use to fit the
model.
Evaluation. The evaluation component is the step where we evaluate the performance of
the model after model fitting.
To extend this list slightly, these are the following 5 steps that we want to think about when
approaching a machine learning application:
Note that optimization end evaluation measures are usually not the same in practice. For
example, the optimization objective of the logistic regression algorithm is to minimize the
negative log-likelihood (or binary cross-entropy), whereas the evaluation metric could be the
classification accuracy or misclassification error.
Also, while the list above indicates a linear workflow, in practice, we often jump back to
previous steps and e.g., collect more data, try out different algorithms, and/or tune the
“knobs” (i.e., “hyperparameters”) of learning algorithms.
The following two subsections will provide a short overview of the optimization (training)
and evaluation parts.
1.8.1 Training
• Combinatorial search, greedy search (e.g., decision trees over, not within nodes);
• Unconstrained convex optimization (e.g., logistic regression);
• Constrained convex optimization (e.g., SVM);
• Nonconvex optimization, here: using backpropagation, chain rule, reverse autodiff.
(e.g., neural networks).
• Constrained nonconvex optimization (semi-adversarial networks9 , not covered in this
course)
9 Vahid Mirjalili et al. “Semi-Adversarial Networks: Convolutional Autoencoders for Imparting Privacy
to Face Images”. In: 2018 International Conference on Biometrics (ICB). IEEE. 2018.
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 14
There exists a number of different algorithms for each optimization task category (for exam-
ple, gradient descent or conjugate gradient, and quasi-Newton methods to optimize convex
optimization problems). Also, the objective functions that we optimize can take different
forms. Below are some examples:
• Maximize information gain/minimize child node impurities (CART decision tree clas-
sification)
• Minimize a mean squared error cost (or loss) function (CART, decision tree regression,
linear regression, adaptive linear neurons, . . . )
1.8.2 Evaluation
Intuition. There are several different evaluation metric to assess the performance of a
model, and the most common ones will be discussed in future lectures. Unless noted oth-
erwise, though, we will focus on the classification accuracy (ACC) or misclassification error
(ERR = 1 − ACC).
The classification accuracy of an algorithm is usually evaluated empirically by counting the
fraction of correctly classified instances considering all instances that the model attempted
to classify. For instance, if we have a test dataset of 100 instances and a model classified
7,000 out of 10,000 instances correctly, then we say that the model has a 70% accuracy
on that dataset. In practice, we are often interested in the generalization performance of
a model, which is the performance on new, unseen data that has the same distribution as
the training data. The simplest way to estimate the generalization accuracy is to compute
the accuracy on a reasonably sized unseen dataset (e.g., the test dataset that we set aside).
However, there are several different techniques to estimate the generalization performance
which have different strengths and weaknesses. Being such an important topic, we will
devote a seperate lecture to model evaluation.
0
if ŷ = y
L(ŷ, y) = (11)
1 if ŷ 6= y,
where ŷ is the class label predicted by a given hypothesis h(x) = ŷ, and y is the ground
truth (correct class label). Then, the prediction error can be defined as the expected value
ERR = E L(Ŷ , Y ) , (12)
n
1 X [i] [i]
ERRDtest = L ŷ , y , (13)
n i=1
• Accuracy (1-Error)
• ROC AUC
• Precision
• Recall
• (Cross) Entropy
• Likelihood
• Squared Error/MSE
• L-norms
• Utility
• Fitness
• ...
There are several different motivations or approaches we take when are studying ML. While
the following bullet points attempt an overall categorization, there are many exceptions:
• Engineers: focusing on developing systems with high predictive performance for real-
world problem solving
• Mathematicians, computer scientists, and statisticians: understanding properties of
predictive models and modeling approaches
• Neuroscientists: understanding and modeling how a brain and intelligence works
Note that machine learning was originally inspired by neuroscience, when the first attempt
of an artificial neuron, the McCulloch-Pitts Neuron10 , was modeled after a biological neuron
and letter lead to the popular perceptron algorithm by Frank Rosenblatt11 (but this is a
topic for Stat 453).
10 Warren S McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity”.
Statistical Science
2001, Vol. 16, No. 3, 199–231
FigureStatistical
12: EvolvedModeling: The Two
antenna (Source: Cultures
https://en.wikipedia.org/wiki/Evolved antenna) via evolu-
treats the data mechanism as unknown. The statistical community
lems.and
by a given stochastic data model. The other uses algorithmic models
rapidly
Algorithmic modeling, both in theory and practice, has developed
hasin fields outside statistics. It can be used both on large complex
been committed to the almost exclusive use of data models. Thisdata sets and as a more accurate and informative alternative to data
commit-
Leo Breiman ment has led to irrelevant theory, questionable conclusions, and modeling has kept on smaller data sets. If our goal as a field is to use data to
tionary algorithms; used on a 2006 NASA spacecraft. solve
statisticians from working on a large range of interesting current problems, then we need to move away from exclusive dependence
prob-
on data models and adopt a more diverse set of tools.
lems. Algorithmic modeling, both in theory and practice, has developed
rapidly in fields outside statistics. It can be used both on large complex
Abstract.
data sets and There
as a are
more twoaccurate
culturesand
in the use of statistical
informative alternative modeling
to datato
reach conclusions 1. INTRODUCTION The values of the parameters are estimated from
modeling on smallerfrom datadata.
sets.One assumes
If our goal asthat the data
a field is to areuse generated
data to
by aproblems,
given stochastic data model. The away
other from
uses algorithmic
Statistics starts models and Think of the data as
with data. the data and the model then used for information
solve then we need to move exclusive dependence
Back in 2001, Leo Breiman wrote an interesting, highly recommended
ontreats the dataand
data models
and/or prediction. Thusarticle
mechanism
adopt a more
this:
as unknown.
diverse set
been committed to the almost exclusive use of
Thebeing called
the black box is
of statistical
filled in like “Sta-
generated
tools.
datavariables
community
models. This
by a black
commit-
hasbox in which a vector of
input x (independent variables) go in one
12
tistical Modeling: The Two Cultures” where he contrasted two different
ment has led to irrelevant theory, questionable
1. INTRODUCTION y
statisticians from working on a largeThe range
approaches
side,conclusions,
linear regressionand on theand
x
of interesting
logistic regression
values
come of the
out. Inside
other
current
parameters
with
hasside
the black
kept
prob-
the response variables y
arebox,estimated from
nature functions to
lems. Algorithmic modeling, both in the Cox model
theory and practice, has then
developed
respect the the two different goals “information” and “prediction.” He
Statistics starts with data. Think of the data as
rapidly in fields outside statistics. It and/or
being generated by a black box in which a vector of
referred to the two
data and
associate
can bevariables,
thethe
used bothsoThus
prediction.
model
predictor
onthelarge
used for information
variables
the complex
black
picture is box
with the response
likeisthis:
filled in like
Model validation. Yes–no using goodness-of-fit
data sets and as a more accurate and this:informative alternative to data
approaches as the “data modeling culture” and the “algorithmic
input variables x (independent variables) go in one modeling
tests and residual examination.culture.”
modeling on smaller data sets. If our goal as a field yis to use data
side, and on the other side the response variables y linear regression
Estimated culture population. 98% of all statisti-
natureto x
solve problems, then we need to move away from y exclusivelogistic dependence
regression x
come out. Inside the black box, nature functions to cians.
on data models and adopt a more diverse set of tools. Cox model
associate the predictor variables with the response There are two goals in analyzing the data:
variables, so the picture is like this: The Algorithmic Modeling Culture
validation.ToYes–no
Model Prediction. using goodness-of-fit
A 1. INTRODUCTION B The and
tests values of the
areresidual
going
be able to
parameters
to examination.
be
predict
to future input
what the responses
are estimated
variables;from
C The analysis in this culture considers the inside of
y nature x the data and the model then 98%used forinformation
information the box complex and unknown. Their approach is to
Statistics starts with data. Think of the data as Estimated culture
Information. population.
To extract someof all statisti- about
being generated by a black box in which a vector of and/orhow
cians. prediction.
nature Thus the black the
is associating box is filled in variables
response like find a function f!x"—an algorithm that operates on
There are two goals in analyzing the data:
input variables x (independent variables) go in one this: to the input variables. x to predict the responses y. Their black box looks
The Algorithmic Modeling Culture like this:
side, and on
Prediction. Tothe othertoside
be able the what
predict response variables y
the responses linear regression
There
y are two different
logistic approaches
regression x toward these
come
are out.toInside
going be to the black
future box,variables;
input nature functions to The analysis in this culture considers the inside of
goals: y unknown x
associate the To
Information. predictor
extractvariables with the response
some information about the box complex andCox model Their approach is to
unknown.
variables,
how naturesoisthe picture is the
associating like response
this: variables find a function Dataf!x"—an
The validation.
Modeling algorithm
Culture that operates on
Model Yes–no using goodness-of-fit
to the input variables. x to predict the responses y. Their black box looks
tests and residual examination.
y nature x like this:The analysis in this culture starts with assuming decision trees
There are two different approaches toward these Estimated culture
a stochastic data model for98%
population. the of all statisti-
inside of the black neural nets
cians.
box.
goals: y For example, a common data model
unknown x is that data
There are two goals in analyzing the data: are generated by independent draws from Model validation. Measured by predictive accuracy.
The Data Modeling Culture The Algorithmic Modeling Culture Estimated culture population. 2% of statisticians,
Prediction. To be able to predict what the responses response variables = f(predictor variables,
The analysis in this culture considers the inside of many in other fields.
are analysis
The going to be to future
in this cultureinput variables;
starts with assuming decision treesrandom noise, parameters)
the box complex and unknown.
nets Their approach is to
Figure 13: Three screenshots from Breiman’s “Statistical Modeling:
box.
Information.
a stochastic
how
Fornature
dataTomodel
example,
extract
for some
is aassociating
common data
information
the inside
the model
about
of the black
response
TheI will
In this paper
variables
Twoargue Cultures”
that the focus in thepaper.
statistical community on data models has:
is that data
neural
find a function f!x"—an algorithm that operates on
x to Leo Breiman
predict the is Professor,
responses y. Department
Their
Model validation. Measured by predictive accuracy. black of looks
box Statistics,
(A) the two overall motivations or goals in analyzing data. (B) •The
aretogenerated
the inputby variables.
independent draws from so-called
like this:
Estimated
“data
Led to irrelevant theory and modeling
questionable
University
culture sci-
of California,
population. Berkeley, California 94720-
2% of statisticians,
There variables
response are two different approaches
= f(predictor toward these
variables, entific conclusions;
many4735 (e-mail:
in other [email protected]).
culture.” The so-called “algorithmic modeling” culture.
goals: random noise, parameters) y unknown x
In this paper I will argue that the focus in the 199
The Data Modeling Culture statistical community on data models has:
Leo Breiman is Professor, Department of Statistics,
The analysis
University in this culture
of California, Berkeley, starts with assuming
California 94720- • Led to irrelevantdecision theorytrees
and questionable sci-
a stochastic
4735 data model for the inside of the black
(e-mail: [email protected]). entific conclusions; neural nets
box. For example, a common data model is that data
The moral of the story is that whether to choose a “data model” or an “algorithmic model”
are generated by independent draws from 199
Model validation. Measured by predictive accuracy.
Estimated culture population. 2% of statisticians,
response variables = f(predictor variables, many in other fields.
really depends on the problem we want to solve, and it’s best to use the appropriate tool
random noise, parameters)
In this paper I will argue that the focus in the
for the task. In this course, we will of course not be restricted to one kind or the other.
Leo Breiman is Professor, Department of Statistics,
statistical community on data models has:
University of California, Berkeley, California 94720- • Led to irrelevant theory and questionable sci-
We will cover techniques of what would fit into the “data modeling culture” (Bayes optimal
4735 (e-mail: [email protected]). entific conclusions;
199
12 LeoBreiman et al. “Statistical modeling: The two cultures (with comments and a rejoinder by the
author)”. In: Statistical science 16.3 (2001), pp. 199–231.
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 17
classifiers, Bayesian networks, naive Bayes, logistic regression, etc.) as well as “algorithmic
approaches” (k-nearest neighbors, decision trees, support vector machines, etc.).
Further, Breiman mentions three lessons learned in the statistical modeling and machine
learning communities, which I summarized below:
• Rashomon effect; the the multiplicity of good models. Often we have mutliple good
models that fit the data well. If we have different models that all fit the data well,
which one should we pick?
• Occam’s razor. While we prefer favoring simple models, there is usually a conflict
between accuracy and simplicity to varying degree (in later lectures, we will learn
about techniques for selecting models within a “sweet spot” considering that this is a
trade-off)
• Bellman and the “curse of dimensionality.” Usually, having more data is con-
sidered a good thing (i.e., more information). However, more data can be harmful to
a model and make it more prone to overfitting (fitting the training data too closely
and not generalizing well to new data that was not seen during training; fitting noise).
Note that the curse of dimensionality refers to an increasing number of feature vari-
ables given a fixed number of training examples. Some models have smart workarounds
for dealing with large feature sets. E.g., Breiman’s random forest algorithm, which
partitions the feature space to fit individual decision trees that are then later joined
to a decistion tree ensemble, but more on that in future lectures.
Also note that there’s a “No Free Lunch” theorem for machine learning13 , meaning that
there’s no single best algorithm that works well across different problem domains.
Data mining focuses on the discovery of patterns in datasets or “gaining knowledge and
insights” from data – often, this involves a heavy focus on computational techniques, working
with databases, etc (nowdays, the term is more or less synonymous to “data science”). We
can then think of machine learning algorithms as tools within a data mining project. Data
mining is not “just” but also emphasis data processing, visualization, and tasks that are
traditionally not categorized as “machine learning” (for example, association rule mining).
Artificial intelligence (AI) was created as a subfield of computer science focussing on solving
tasks that humans are good at (for example, natural language processing, image recognition).
Or in other words, the goal of AI is to mimick human intelligence.
There are two subtypes of AI: Artificial general intelligence (AGI) and narrow AI. AGI refers
to an intelligence that equals humans in several tasks, i.e., multi-purpose AI. In contrast,
narrow AI is more narrowly focused on solving a particular task that humans are traditionally
good at (e.g., playing a game, or driving a car – I would not go so far and refer to “image
classification” as AI).
In general, AI can be approached in many ways. One approach is to write a computer
program that implements a set of rules devised by domain experts. Now, hand-crafting rules
13 David H Wolpert. “The lack of a priori distinctions between learning algorithms”. In: Neural compu-
can be very laborious and time consuming. The field of machine learning then emerged as
a subfield of AI – it was concerned with the development of algorithms so that computers
can automatically learn (predictive) models from data.
Assume we want to develop a program that can recognize handwritten digits from images.
One approach would be to look at all of these images and come up with a set of (nested) if-
this-than-that rules to determine which digit is displayed in a particular image (for instance,
by looking at the relative locations of pixels). Another approach would be to use a machine
learning algorithm, which can fit a predictive model based on a thousands of labeled image
samples that we may have collected in a database.
Now, there is also deep learning, which in turn is a subfield of machine learning, referring
to a particular subset of models that are particularly good at certain tasks such as image
recognition and natural language processing.
Or in short, machine learning (and deep learning) can definitely be helpful with develop-
ing “AI,” however, AI doesn’t necessarily have to be developed using machine learning –
although, machine learning makes “AI” much more convenient.
Deep Learning
AI
Figure 14: Relationship between machine learning, deep learning, and artificial intelligence. Note
that there is also overlap between Machine learning and data mining, data science, statistics, etc.
(not shown).
1.12 Software
We will talk more about software in upcoming lectures, but at this point, I want to provide
a brief overview of the “Python for scientific computing” landscape.
Sebastian Raschka STAT451 FS20. L01: Intro to Machine Learning Page 19
Figure 15: Scientific Python packages, some of which we will discuss in class. (Image
by Jake VanderPlas; Source: https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-
keynote?slide=8). The graphic is structured in terms of “low level” to “high level.” For example,
NumPy is a numerical array library for Python. SciPy is a package with scientific computing func-
tions that extends/depends on NumPy. Scikit-learn is a machine learning library that uses both
NumPy and SciPy.
The Python-topics we will make use of in this course are the following:
1.13 Glossary
Machine learning borrows concepts from many other fields and redefines what has been
known in other fields under different names. Below is a small glossary of machine learning-
specfic terms along with some key concepts to help navigate the machine learning literature.