Machine Learning
Machine Learning
UNDERSTANDING
MACHINE LEARNING
From Theory to
Algorithms
Shai Shalev-Shwartz
The Hebrew University, Jerusalem
Shai Ben-David
University of Waterloo, Canada
www.cambridge.org
Information on this title: www.cambridge.org/9781107057135
c Shai Shalev-Shwartz and Shai Ben-David 2014
A catalog record for this publication is available from the British Library.
Contents
Preface page xv
1 Introduction 1
1.1 What Is Learning? 1
1.2 When Do We Need Machine Learning? 3
1.3 Types of Learning 4
1.4 Relations to Other Fields 6
1.5 How to Read This Book 7
1.6 Notation 8
Part 1 Foundations
2 A Gentle Start 13
2.1 A Formal Model – The Statistical Learning Framework 13
2.2 Empirical Risk Minimization 15
2.3 Empirical Risk Minimization with Inductive Bias 16
2.4 Exercises 20
vii
viii Contents
6 The VC-Dimension 43
6.1 Infinite-Size Classes Can Be Learnable 43
6.2 The VC-Dimension 44
6.3 Examples 46
6.4 The Fundamental Theorem of PAC Learning 48
6.5 Proof of Theorem 6.7 49
6.6 Summary 53
6.7 Bibliographic Remarks 53
6.8 Exercises 54
7 Nonuniform Learnability 58
7.1 Nonuniform Learnability 58
7.2 Structural Risk Minimization 60
7.3 Minimum Description Length and Occam’s Razor 63
7.4 Other Notions of Learnability – Consistency 66
7.5 Discussing the Different Notions of Learnability 67
7.6 Summary 70
7.7 Bibliographic Remarks 70
7.8 Exercises 71
Contents ix
10 Boosting 101
10.1 Weak Learnability 102
10.2 AdaBoost 105
10.3 Linear Combinations of Base Hypotheses 108
10.4 AdaBoost for Face Recognition 110
10.5 Summary 111
10.6 Bibliographic Remarks 111
10.7 Exercises 112
x Contents
Contents xi
22 Clustering 264
22.1 Linkage-Based Clustering Algorithms 266
22.2 k-Means and Other Cost Minimization Clusterings 268
22.3 Spectral Clustering 271
22.4 Information Bottleneck* 273
22.5 A High-Level View of Clustering 274
22.6 Summary 276
22.7 Bibliographic Remarks 276
22.8 Exercises 276
xii Contents
31 PAC-Bayes 364
31.1 PAC-Bayes Bounds 364
31.2 Bibliographic Remarks 366
31.3 Exercises 366
Contents xiii
References 385
Index 395
Preface
The term machine learning refers to the automated detection of meaningful patterns
in data. In the past couple of decades it has become a common tool in almost any
task that requires information extraction from large data sets. We are surrounded
by a machine learning–based technology: Search engines learn how to bring us the
best results (while placing profitable ads), antispam software learns to filter our e-
mail messages, and credit card transactions are secured by a software that learns
how to detect frauds. Digital cameras learn to detect faces and intelligent personal
assistance applications on smart-phones learn to recognize voice commands. Cars
are equipped with accident-prevention systems that are built using machine learning
algorithms. Machine learning is also widely used in scientific applications such as
bioinformatics, medicine, and astronomy.
One common feature of all of these applications is that, in contrast to more tra-
ditional uses of computers, in these cases, due to the complexity of the patterns that
need to be detected, a human programmer cannot provide an explicit, fine-detailed
specification of how such tasks should be executed. Taking examples from intelli-
gent beings, many of our skills are acquired or refined through learning from our
experience (rather than following explicit instructions given to us). Machine learn-
ing tools are concerned with endowing programs with the ability to “learn” and
adapt.
The first goal of this book is to provide a rigorous, yet easy-to-follow, introduction
to the main concepts underlying machine learning: What is learning? How can a
machine learn? How do we quantify the resources needed to learn a given concept?
Is learning always possible? Can we know whether the learning process succeeded or
failed?
The second goal of this book is to present several key machine learning algo-
rithms. We chose to present algorithms that on one hand are successfully used
in practice and on the other hand give a wide spectrum of different learning
techniques. Additionally, we pay specific attention to algorithms appropriate for
large-scale learning (a.k.a. “Big Data”), since in recent years, our world has become
increasingly “digitized” and the amount of data available for learning is dramati-
cally increasing. As a result, in many applications data is plentiful and computation
xv
xvi Preface
time is the main bottleneck. We therefore explicitly quantify both the amount of
data and the amount of computation time needed to learn a given concept.
The book is divided into four parts. The first part aims at giving an initial rigor-
ous answer to the fundamental questions of learning. We describe a generalization
of Valiant’s Probably Approximately Correct (PAC) learning model, which is a first
solid answer to the question “What is learning?” We describe the Empirical Risk
Minimization (ERM), Structural Risk Minimization (SRM), and Minimum Descrip-
tion Length (MDL) learning rules, which show “how a machine can learn.” We
quantify the amount of data needed for learning using the ERM, SRM, and MDL
rules and show how learning might fail by deriving a “no-free-lunch” theorem. We
also discuss how much computation time is required for learning. In the second part
of the book we describe various learning algorithms. For some of the algorithms,
we first present a more general learning principle and then show how the algorithm
follows the principle. While the first two parts of the book focus on the PAC model,
the third part extends the scope by presenting a wider variety of learning models.
Finally, the last part of the book is devoted to advanced theory.
We made an attempt to keep the book as self-contained as possible. However,
the reader is assumed to be comfortable with basic notions of probability, linear
algebra, analysis, and algorithms. The first three parts of the book are intended
for first-year graduate students in computer science, engineering, mathematics, or
statistics. It can also be accessible to undergraduate students with the adequate
background. The more advanced chapters can be used by researchers intending to
gather a deeper theoretical understanding.
ACKNOWLEDGMENTS
The book is based on Introduction to Machine Learning courses taught by Shai
Shalev-Shwartz at Hebrew University and by Shai Ben-David at the University of
Waterloo. The first draft of the book grew out of the lecture notes for the course
that was taught at Hebrew University by Shai Shalev-Shwartz during 2010–2013.
We greatly appreciate the help of Ohad Shamir, who served as a teaching assistant
for the course in 2010, and of Alon Gonen, who served as TA for the course in
2011–2013. Ohad and Alon prepared a few lecture notes and many of the exercises.
Alon, to whom we are indebted for his help throughout the entire making of the
book, has also prepared a solution manual.
We are deeply grateful for the most valuable work of Dana Rubinstein. Dana
has scientifically proofread and edited the manuscript, transforming it from lecture-
based chapters into fluent and coherent text.
Special thanks to Amit Daniely, who helped us with a careful read of the
advanced part of the book and wrote the advanced chapter on multiclass learnabil-
ity. We are also grateful for the members of a book reading club in Jerusalem who
have carefully read and constructively criticized every line of the manuscript. The
members of the reading club are Maya Alroy, Yossi Arjevani, Aharon Birnbaum,
Alon Cohen, Alon Gonen, Roi Livni, Ofer Meshi, Dan Rosenbaum, Dana Rubin-
stein, Shahar Somin, Alon Vinnikov, and Yoav Wald. We would also like to thank
Gal Elidan, Amir Globerson, Nika Haghtalab, Shie Mannor, Amnon Shashua, Nati
Srebro, and Ruth Urner for helpful discussions.