0% found this document useful (0 votes)
63 views

Data Science Project - An Inductive Learning Approach, Verri

Uploaded by

joaor.jungblut
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Data Science Project - An Inductive Learning Approach, Verri

Uploaded by

joaor.jungblut
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 238

DATA SCIENCE PROJECT

AN INDUCTIVE LEARNING APPROACH

F.A.N. VERRI
DATA SCIENCE PROJECT
AN INDUCTIVE LEARNING APPROACH

FILIPE A. N. VERRI

Version 0.1 “Audacious Hatchling” December 16, 2024


Cite this book as:

@misc{verri2024datascienceproject,
author = {Verri, Filipe Alves Neto},
title = {Data Science Project: An Inductive Learning Approach},
year = 2024,
publisher = {Leanpub},
version = {v0.1.0},
doi = {10.5281/zenodo.14498011},
url = {https://leanpub.com/dsp}
}

F. A. N. Verri (2024). Data Science Project: An Inductive Learning Approach. Version v0.1.0.
doi: 10.5281/zenodo.14498011. url: https://leanpub.com/dsp.

Disclaimer: This version is a work in progress. Proofreading and professional editing are
still pending.

The book is typeset with XETEX using the Memoir class. All figures are original and cre-
ated with TikZ. Proudly written in Neovim with the assistance of GitHub Copilot. The
book cover image was created with the assistance of Gemini and DALL·E 2. We use the
beautiful STIX fonts for text and math. Some icons are from Font Awesome 5 by Dave
Gandy.

Scripture quotations are from The ESV® Bible (The Holy Bible, English Standard Version®),
copyright © 2001 by Crossway, a publishing ministry of Good News Publishers. Used by
permission. All rights reserved.

Data Science Project: An Inductive Learning Approach © 2023–2024 by Filipe A. N. Verri


is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International. To view
a copy of this license, visit creativecommons.org/licenses/by-nc-nd/4.0.
To my wife, for inspiring and supporting me in writing this book.

Above all, God be praised.


Contents

Contents v

Foreword ix

Preface xi

1 A brief history of data science 1


1.1 The term “data science” . . . . . . . . . . . . . . . . . . 3
1.2 Timeline and historical markers . . . . . . . . . . . . . 6
1.2.1 Timeline of data handling . . . . . . . . . . . . 6
1.2.2 Timeline of data analysis . . . . . . . . . . . . . 12

2 Fundamental concepts 19
2.1 Data science definition . . . . . . . . . . . . . . . . . . 21
2.2 The data science continuum . . . . . . . . . . . . . . . 24
2.3 Fundamental data theory . . . . . . . . . . . . . . . . . 26
2.3.1 Phenomena . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Measurements . . . . . . . . . . . . . . . . . . . 28
2.3.3 Knowledge extraction . . . . . . . . . . . . . . . 29

3 Data science project 31


3.1 CRISP-DM . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 ZM approach . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Roles of the ZM approach . . . . . . . . . . . . . 35
3.2.2 Processes of the ZM approach . . . . . . . . . . 36
3.2.3 Limitations of the ZM approach . . . . . . . . . 37
3.3 Agile methodology . . . . . . . . . . . . . . . . . . . . . 37
3.4 Scrum framework . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Scrum roles . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Sprints and backlog . . . . . . . . . . . . . . . . 39

v
vi CONTENTS

3.4.3 Scrum for data science projects . . . . . . . . . . 41


3.5 Our approach . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 The roles of our approach . . . . . . . . . . . . . 43
3.5.2 The principles of our approach . . . . . . . . . . 45
3.5.3 Proposed workflow . . . . . . . . . . . . . . . . 49

4 Structured data 55
4.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Database normalization . . . . . . . . . . . . . . . . . . 58
4.2.1 Relational algebra . . . . . . . . . . . . . . . . . 59
4.2.2 Normal forms . . . . . . . . . . . . . . . . . . . 60
4.3 Tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Common messy datasets . . . . . . . . . . . . . 66
4.4 Bridging normalization, tidiness, and data theory . . . . 73
4.4.1 Tidy or not tidy? . . . . . . . . . . . . . . . . . . 73
4.4.2 Change of observational unit . . . . . . . . . . . 76
4.5 Data semantics and interpretation . . . . . . . . . . . . 78
4.6 Unstructured data . . . . . . . . . . . . . . . . . . . . . 79

5 Data handling 81
5.1 Formal structured data . . . . . . . . . . . . . . . . . . 83
5.1.1 Splitting and binding . . . . . . . . . . . . . . . 84
5.1.2 Split invariance . . . . . . . . . . . . . . . . . . 86
5.1.3 Illustrative example . . . . . . . . . . . . . . . . 87
5.2 Data handling pipelines . . . . . . . . . . . . . . . . . . 90
5.3 Split-invariant operations . . . . . . . . . . . . . . . . . 91
5.3.1 Tagged splitting and binding . . . . . . . . . . . 92
5.3.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Joining . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.4 Selecting . . . . . . . . . . . . . . . . . . . . . . 98
5.3.5 Filtering . . . . . . . . . . . . . . . . . . . . . . 99
5.3.6 Mutating . . . . . . . . . . . . . . . . . . . . . . 101
5.3.7 Aggregating . . . . . . . . . . . . . . . . . . . . 102
5.3.8 Ungrouping . . . . . . . . . . . . . . . . . . . . 102
5.4 Other operations . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Projecting or grouping . . . . . . . . . . . . . . 105
5.4.2 Grouped and arranged operations . . . . . . . . 107
5.5 An algebra for data handling . . . . . . . . . . . . . . . 109

6 Learning from data 111


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 113
CONTENTS vii

6.2 The learning problem . . . . . . . . . . . . . . . . . . . 115


6.2.1 Learning tasks . . . . . . . . . . . . . . . . . . . 115
6.2.2 A few remarks . . . . . . . . . . . . . . . . . . . 117
6.3 Optimal solutions . . . . . . . . . . . . . . . . . . . . . 118
6.3.1 Bayes classifier . . . . . . . . . . . . . . . . . . . 118
6.3.2 Regression function . . . . . . . . . . . . . . . . 120
6.4 ERM inductive principle . . . . . . . . . . . . . . . . . 122
6.4.1 Consistency of the learning process . . . . . . . 122
6.4.2 Rate of convergence . . . . . . . . . . . . . . . . 123
6.4.3 VC entropy . . . . . . . . . . . . . . . . . . . . . 123
6.4.4 Growing function and VC dimension . . . . . . 124
6.5 SRM inductive principle . . . . . . . . . . . . . . . . . . 126
6.5.1 Bias invariance trade-off . . . . . . . . . . . . . 129
6.5.2 Regularization . . . . . . . . . . . . . . . . . . . 131
6.6 Linear problems . . . . . . . . . . . . . . . . . . . . . . 132
6.6.1 Perceptron . . . . . . . . . . . . . . . . . . . . . 132
6.6.2 Maximal margin classifier . . . . . . . . . . . . 137
6.7 Closing remarks . . . . . . . . . . . . . . . . . . . . . . 139

7 Data preprocessing 141


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Formal definition . . . . . . . . . . . . . . . . . 144
7.1.2 Degeneration . . . . . . . . . . . . . . . . . . . 144
7.1.3 Data preprocessing tasks . . . . . . . . . . . . . 145
7.2 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.1 Treating inconsistent data . . . . . . . . . . . . 146
7.2.2 Outlier detection . . . . . . . . . . . . . . . . . 148
7.2.3 Treating missing data . . . . . . . . . . . . . . . 149
7.3 Data sampling . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 Random sampling . . . . . . . . . . . . . . . . . 152
7.3.2 Scope filtering . . . . . . . . . . . . . . . . . . . 152
7.3.3 Class balancing . . . . . . . . . . . . . . . . . . 153
7.4 Data transformation . . . . . . . . . . . . . . . . . . . . 154
7.4.1 Type conversion . . . . . . . . . . . . . . . . . . 155
7.4.2 Normalization . . . . . . . . . . . . . . . . . . . 157
7.4.3 Dimensionality reduction . . . . . . . . . . . . . 158
7.4.4 Data enhancement . . . . . . . . . . . . . . . . 159
7.4.5 Comments on unstructured data . . . . . . . . . 160

8 Solution validation 161


8.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1.1 Binary classification evaluation . . . . . . . . . 164
8.1.2 Regression estimation evaluation . . . . . . . . 169
8.1.3 Probabilistic classification evaluation . . . . . . 172
8.2 An experimental plan for data science . . . . . . . . . . 175
8.2.1 Sampling strategy . . . . . . . . . . . . . . . . . 176
8.2.2 Collecting evidence . . . . . . . . . . . . . . . . 177
8.2.3 Estimating expected performance . . . . . . . . 180
8.2.4 Comparing strategies . . . . . . . . . . . . . . . 182
8.2.5 About nesting experiments . . . . . . . . . . . . 183
8.3 Final remarks . . . . . . . . . . . . . . . . . . . . . . . 184

A Mathematical foundations 185


A.1 Algorithms and data structures . . . . . . . . . . . . . . 187
A.1.1 Computational complexity . . . . . . . . . . . . 187
A.1.2 Algorithmic paradigms . . . . . . . . . . . . . . 188
A.1.3 Data structures . . . . . . . . . . . . . . . . . . 192
A.2 Set theory . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.2.1 Set operations . . . . . . . . . . . . . . . . . . . 196
A.2.2 Set operations properties . . . . . . . . . . . . . 196
A.2.3 Relation to Boolean algebra . . . . . . . . . . . . 197
A.3 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . 197
A.3.1 Operations . . . . . . . . . . . . . . . . . . . . . 198
A.3.2 Systems of linear equations . . . . . . . . . . . . 200
A.3.3 Eigenvalues and eigenvectors . . . . . . . . . . 200
A.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.4.1 Axioms of probability and main concepts . . . . 201
A.4.2 Random variables . . . . . . . . . . . . . . . . . 202
A.4.3 Expectation and moments . . . . . . . . . . . . 203
A.4.4 Common probability distributions . . . . . . . . 205
A.4.5 Permutations and combinations . . . . . . . . . 208

B Topics on learning machines 209


B.1 Multi-layer perceptron . . . . . . . . . . . . . . . . . . . 210
B.2 Decision trees . . . . . . . . . . . . . . . . . . . . . . . 212

Bibliography 215

Glossary 221
Foreword

by Ana Carolina Lorena

Data is now a ubiquitous presence and is collected every time and ev-
erywhere. However, the real challenge lies in harnessing this data to
generate actionable insights that guide decision-making and drive in-
novation. This is the essence of data science, a multidisciplinary field
that leverages mathematical, statistical, and computational techniques
to analyse data and solve complex problems.
The book “Data Science Project: An Inductive Learning Approach”
by F.A.N. Verri provides readers with a structured and insightful explo-
ration of the entire data science pipeline, from the initial stages of data
collection to the final step of model deployment. The book effectively
balances theory and practice, focusing on the inductive principles un-
derpinning predictive analytics and machine learning.
While other texts focus solely on machine learning algorithms or
delve deeply into tool-specific details, this book provides a holistic view
of every stage of a data science project. It emphasises the importance of
robust data handling, sound statistical learning principles, and meticu-
lous model evaluation. The author thoughtfully integrates the mathe-
matical foundations and practical considerations needed to design and
execute successful data science projects.
Beyond the technical mechanics, this book challenges readers to crit-
ically evaluate their models’ strengths and limitations. It underscores
the importance of semantics in data handling, equipping readers with
the skills to interpret and transform data meaningfully.
Whether you are a student embarking on your first data science
project or a data scientist professional seeking to expand and refine your
skills, this book’s clarity, rigour, and practical focus make it a guide that
will serve you well in tackling the complex challenges of data-driven

ix
x FOREWORD

decision-making. The book will expand your understanding and inspire


you to approach data science projects with a commitment to creating re-
sponsible and impactful solutions.
Preface

Dear reader,

This book is based on the lecture notes from my course PO-235 Data
Science Project, which I teach to graduate students at both the Aero-
nautics Institute of Technology (ITA) and the Federal University of São
Paulo (UNIFESP) in Brazil. I have been teaching this subject since 2021,
and I have continually updated the material each year.
Also, I was the coordinator of the Data Science Specialization Pro-
gram (CEDS) at ITA. That experience, which included a great deal of
administrative work, as well as teaching and supervising professionals
in the course, has helped me to understand the needs of the market and
the students.
Moreover, parts of the project development methodology presented
here came from my experience as a lead data scientist in R&D projects
for the Brazilian Air Force, which included projects in areas such as im-
age processing, natural language processing, and spatio-temporal data
analysis.
Literature provides us with a wide range of excellent theoretical ma-
terial on machine learning and statistics, and highly regarded practical
books on data science tools. However, I missed something that could
provide a solid foundation on data science, covering all steps in a data
science project, including its software engineering aspects.
My goal is to provide a book that serves as a textbook for a course on
data science projects or as a reference for professionals working in the
field. I strive to maintain a formal tone while preserving the practical
aspects of the subject. I do not focus on a specific tool or programming

xi
xii PREFACE

language, but rather seek to explain the semantics of data science tasks
that can be implemented in any programming language.
Also, instead of teaching specific machine learning algorithms, I try
to explain why machine learning works, thereby increasing awareness
of its pitfalls and limitations. For this purpose, I assume you have a
strong mathematical and statistical foundation.
One important artificial constraint I have imposed in the material
(for the sake of the course) is that I only consider predictive methods,
more specifically inductive ones. I do not address topics such as cluster-
ing, association rule mining, transductive learning, anomaly detection,
time series forecasting, reinforcement learning, etc.
I expect my approach on the subject to provide understanding of all
steps in a data science project, including a deeper focus on correct eval-
uation and validation of data science solutions.
Note that, in this book, I openly express my opinions and beliefs. On
several occasions it may sound controversial. I am not trying to be rude
or to demean any researcher or practitioner in the field; rather, I aim to
be honest and transparent.

I’d rather be bold and straightforward than cower about my beliefs.

I hope you enjoy reading.


xiii

I intend to make this book forever free and


open-source. You can find the source code at
github.com/verri/dsp-book. Derivatives are not
allowed, but you can contribute to the book. Contrib-
utors will be acknowledged here.

However, if you like this book, consider purchasing


the e-book version at leanpub.com/dsp. Any amount
you pay will help me keep this book updated and write
new books.

If you have suggestions or questions, please open


or join a discussion at github.com/verri/dsp-
book/discussions. Feel free to ask anything.
Theoretical discussions and practical advice are
also welcome.

If you want to support me, you can buy me a coffee at


buymeacoffee.com/verri. I will greatly appreciate it.
I have a number of ideas for new books and courses,
and financial support will enable me to make them a
reality.
xiv PREFACE

Contributors
I would like to thank the following contributors for their help in improv-
ing this book:

• Johnny C. Marques
• Manoel V. Machado (aka ryukinix)
• Vitor V. Curtis

All contributors have freely waived their rights to the content they
contributed to this book.
A brief history of data science
1
“Begin at the beginning,” the King said gravely, “and go on till
you come to the end: then stop.”
— Lewis Carroll, Alice in Wonderland

There are many points of view regarding the origin of data science. For
the sake of contextualization, I separate the topic into two approaches:
the history of the term itself and a broad timeline of data-driven sciences,
highlighting the important figures in each age.
I believe that the history of the term is important for understand-
ing the context of the discipline. Over the years, the term has been em-
ployed to label quite different fields of study. Before presenting my view
on the term, I present the views of two influential figures in the history
of data science: Peter Naur and William Cleveland.
Moreover, studying the key facts and figures in the history of data-
driven sciences enables us to comprehend the evolution of the field and
hopefully guide us towards evolving it further. Besides, history also
teaches us ways to avoid repeating the same mistakes.
Most of the significant theories and methods in data science have
been developed simultaneously across different fields, such as statistics,
computer science, and engineering. The history of data-driven sciences
is long and rich. I present a timeline of the ages of data handling and
the most important milestones of data analysis.
I do not intend to provide a comprehensive history of data science.
I aim to provide enough context to support the development of the ma-
terial in the following chapters, sometimes avoiding directions that are
not relevant in the context of inductive learning.

1
2 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

Chapter remarks

Contents
1.1 The term “data science” . . . . . . . . . . . . . . . . . . . 3
1.2 Timeline and historical markers . . . . . . . . . . . . . . 6
1.2.1 Timeline of data handling . . . . . . . . . . . . . 6
1.2.2 Timeline of data analysis . . . . . . . . . . . . . . 12

Context

• The term “data science” is recent and has been used to label rather
different fields.
• The history of data-driven sciences is long and rich.
• Many important theories and methods in data science have been
developed simultaneously in different fields.
• The history of data-driven sciences is important to understand the
evolution of the field.

Objectives

• Understand the history of the term “data science.”


• Recognize the major milestones in the history of data-driven sci-
ences.
• Identify important figures in the field of data-driven sciences.

Takeaways

• We have evolved both in terms of theory and application of data-


driven sciences.
• There is no consensus on the definition of data science (including
which fields it encompasses).
• However, there is sufficient evidence to support data science as a
distinct science.
1.1. THE TERM “DATA SCIENCE” 3

1.1 The term “data science”


The term data science is relatively recent and has been used to label
rather different fields of study. In the following, I emphasize the history
of a few notable usages of the term.

Peter Naur (1928 – 2016) The term “data science” itself was coined in
the 1960s by Peter Naur (/naʊə/). Naur was a Danish computer scientist
and mathematician who made significant contributions to the field of
computer science, including his work on the development of program-
ming languages1. His ideas and concepts laid the groundwork for the
way we think about programming and data processing today.
Naur disliked the term computer science and suggested it be called
datalogy or data science. In the 1960s, the subject was practised in Den-
mark under Peter Naur’s term datalogy, which means the science of data
and data processes.
He coined this term to emphasize the importance of data as a fun-
damental component of computer science and to encourage a broader
perspective on the field that included data-related aspects. At that time,
the field was primarily centered on programming techniques, but Naur’s
concept broadened the scope to recognize the intrinsic role of data in
computation.
In his book2, “Concise Survey of Computer Methods”, he parts from
the concept that data is “a representation of facts or ideas in a formalised
manner capable of being communicated or manipulated by some pro-
cess.”3 Note however that his view of the science only “deals with data
[…] while the relation of data to what they represent is delegated to other
fields and sciences.”
It is interesting to see the central role he gave to data in the field of
computer science. His view that the relation of data to what they rep-
resent is delegated to other fields and sciences is still debatable today.
Some scientists argue that data science should focus on the techniques
to deal with data, while others argue that data science should encom-
pass the whole business domain. A depiction of Naur’s view of data
science is shown in fig. 1.1.

1He is best remembered as a contributor, with John Backus, to the Backus–Naur form
(BNF) notation used in describing the syntax for most programming languages.
2P. Naur (1974). Concise Survey of Computer Methods. Lund, Sweden: Studentlitter-
atur. isbn: 91-44-07881-1. url: http://www.naur.com/Conc.Surv.html.
3I. H. Gould (ed.): ‘IFIP guide to concepts and terms in data processing’, North-
Holland Publ. Co., Amsterdam, 1971.
4 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

Figure 1.1: Naur’s view of data science.

Data science

Computer science Domain expertise

For Naur, data science studies the techniques to deal with data,
but he delegates the meaning of data to other fields.

William Cleveland (born 1943) In 2001, a prominent statistician


used the term “data science” in his work to describe a new discipline
that comes from his “plan to enlarge the major areas of technical work
of the field of statistics4.” In 2014, that work was republished5. He ad-
vocates the expansion of statistics beyond theory into technical areas,
significantly changing statistics. Thus, it warranted a new name.
As a result, William Swain Cleveland II is credited with defining data
science as it is most used today. He is a highly influential figure in the
fields of statistics, machine learning, data visualization, data analysis
for multidisciplinary studies, and high performance computing for deep
data analysis.
In his view, data science is the “modern” statistics, where it is en-
larged by computer science methods and domain expertise. An illus-
tration of Cleveland’s view of data science is shown in fig. 1.2. It is im-
portant to note that Cleveland never defined an explicit list of computer
science fields and business domains that should be included in the new
discipline. The main idea is that statistics should rely on computational

4W. S. Cleveland (2001). “Data Science: An Action Plan for Expanding the Technical
Areas of the Field of Statistics”. In: ISI Review. Vol. 69, pp. 21–26.
5W. S. Cleveland. Data Science: An Action Plan for the Field of Statistics. Statistical
Analysis and Data Mining, 7:414–417, 2014. reprinting of 2001 article in ISI Review, Vol
69.
1.1. THE TERM “DATA SCIENCE” 5

Figure 1.2: Cleveland’s view of data science.

Data science

Domain expertise

Statistics

Computer science

For Cleveland, data science is the “modern” statistics, where it is


enlarged by computer science and domain expertise.

methods and that the domain expertise should be considered in the anal-
ysis.

Buzzword or a new science? Be aware that scientific literature has


no consensus on the definition of data science, and it is still considered
by some to be a buzzword6.
Most of the usages of the term in literature and in the media are
either a rough reference to a set of data-driven techniques or a marketing
strategy. Naur (fig. 1.1) and Cleveland (fig. 1.2) are among the few that
try to carefully define the term. However, both of them do not see data
science as an independent field of study, but rather an enlarged scope
of an existing science. I disagree; the social and economic demand for
data-driven solutions has led to an evolution in our understanding of
the challenges we are facing. As a result, we see many “data scientists”
being hired and many “data science degree” programs emerging.
In chapter 4, I dare to provide a (yet another) definition for the term.
I argue that its object of study can be precisely established to support it
as a new science.

6Press, Gil. “Data Science: What’s The Half-Life of a Buzzword?”. Forbes. Available
at forbes.com/sites/gilpress/2013/08/19/data-science-whats-the-half-life-of-a-buzzword.
6 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

1.2 Timeline and historical markers


Kelleher and Tierney (2018)7 provides an interesting timeline of data-
driven methods and influential figures in the field. I reproduce it here
with some changes, including some omissions and additions. On the
subject of data analysis, I include some exceptional remarks from V. N.
Vapnik (1999)8.
I first address data handling — which includes data sources, collec-
tion, organization, storage, and transformation —, and then data analy-
sis and knowledge extraction.

1.2.1 Timeline of data handling


The importance of collecting and organizing data goes without saying.
Data fuels analysis and decision making. In the following, I present
some of the most important milestones in the history of data handling.

Figure 1.3: Timeline of the ages of data handling.

3,800 BC – 18th c. 1890 – 1960 1970s 1980 – 1990 2000 –

Pre-digital Digital Formal Integrated Ubiquitous

Figure 1.3 illustrates the proposed timeline. Ages have no absolute


boundaries, but rather periods where some important events happened.
Also, observe that the timescale is not linear. The Pre-digital Age is the
longest period, and one could divide it into smaller periods. My choices
of ages and their boundaries are motivated by didactic reasons.

Pre-digital age
We can consider the earliest records of data collection to be the notches
on sticks and bones (probably) used to keep track of the passing of time.
The Lebombo bone, a baboon fibula with notches, is one of the earliest
known mathematical objects. It was found in the Lebombo Mountains
located between South Africa and Eswatini. They estimate it is more
than 40,000 years old. It is conjectured to be a tally stick, but its exact

7J. D. Kelleher and B. Tierney (2018). Data science. The MIT Press.
8V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-Verlag
New York, Inc. isbn: 978-1-4419-3160-3.
1.2. TIMELINE AND HISTORICAL MARKERS 7

purpose is unknown. Its 29 notches suggest that it may have been used
as a lunar phase counter. However, since it is broken at one end, the 29
notches may or may not be the total number9.
Another milestone in the history of data collection is the record of
demographic data. One of the first known censuses was conducted in
3,800 BC in the Babylonian Empire. It was ordered to assess the popu-
lation and resources of the empire. Records were stored on clay tiles10.
Since the early forms of writing, humanity’s abilities to register data
and events increased significantly. The first known written records date
back to around 3,500 BC, the Sumerian archaic (pre-cuneiform) writ-
ing. This writing system was used to represent commodities using clay
tokens and to record transactions11.
“Data storage” was also a challenge. Some important devices that
increased our capacity to register textual information are the Sumerian
clay tablets (3,500 BC), the Egyptian papyrus (3,000 BC), the Roman wax
tablets (100 BC), the codex (100 AD), the Chinese paper (200 AD), the
printing press (1440), and the typewriter (1868).
Besides those improvements in unstructured data storage, at least
in the Western and Middle Eastern world, there are no significant ad-
vances in structured data collection until the 19th century. (An Eastern
timeline research seems much harder to perform. Unfortunately, I left
it out in this book.)
I consider a major influential figure in the history of data collection
to be Florence Nightingale (1820 – 1910). She was a passionate statis-
tician and probably the first person to use statistics to influence pub-
lic and official opinion. The meticulous records she kept during the
Crimean War (1853 – 1856) were the evidence that saved lives — part of
the mortality came from lack of sanitation. She was also the first to use
statistical graphics to present data in a way that was easy to understand.
She is credited with developing a form of the pie chart now known as the
polar area diagram. She also reformed healthcare in the United King-
dom and is considered the founder of modern nursing; where a great
part of the work was to collect data in a standardized way to quickly
draw conclusions12.

9P. B. Beaumont and R. G. Bednarik (2013). In: Rock Art Research 30.1, pp. 33–54.
doi: 10.3316/informit.488018706238392.
10C. G. Grajalez et al. (2013). “Great moments in statistics”. In: Significance 10.6,
pp. 21–28. doi: 10.1111/j.1740-9713.2013.00706.x.
11G. Ifrah (1998). The Universal History of Numbers, from Prehistory to the Invention of
the Computer. First published in French, 1994. London: Harvill. isbn: 1 86046 324 x.
12C. G. Grajalez et al. (2013). “Great moments in statistics”. In: Significance 10.6,
8 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

Digital age
In the modern period, several devices were developed to store digital13
information. One particular device that is important for data collection
is the punched card. It is a piece of stiff paper that contains digital infor-
mation represented by the presence or absence of holes in predefined
positions. The information can be read by a mechanical or electrical de-
vice called a card reader. The earliest famous usage of punched cards
was by Basile Bouchon (1725) to control looms. Most of the advances un-
til the 1880s were about the automation of machines (data processing)
using hand-punched cards, and not particularly specialized for data col-
lection.
However, the 1890 census in the United States was the first to use
machine-readable punched cards to tabulate data. Processing 1880 cen-
sus data took eight years, so the Census Bureau contracted Herman
Hollerith (1860 – 1929) to design and build a tabulating machine. He
founded the Tabulating Machine Company in 1896, which later merged
with other companies to become International Business Machines Cor-
poration (IBM) in 1924. Later models of the tabulating machine were
widely used for business applications such as accounting and inventory
control. Punched card technology remained a prevalent method of data
processing for several decades until more advanced electronic comput-
ers were developed in the mid-20th century.
The invention of the digital computer is responsible for a revolution
in data handling. The amount of information we can capture and store
increased exponentially. ENIAC (1945) was the first electronic general-
purpose computer. It was Turing-complete, digital, and capable of being
reprogrammed to solve a full range of computing problems. It had 20
words of internal memory, each capable of storing a 10-digit decimal
integer number. Programs and data were entered by setting switches
and inserting punched cards.
For the 1950 census, the United States Census Bureau used the UNI-
VAC I (Universal Automatic Computer I), the first commercially pro-
duced computer in the United States14.
It goes without saying that digital computers have become much
more powerful and sophisticated since then. The data collection process

pp. 21–28. doi: 10.1111/j.1740-9713.2013.00706.x.


13Digital means the representation of information in (finite) discrete form. The term
comes from the Latin digitus, meaning finger, because it is the natural way to count using
fingers. Digital here does not mean electronic.
14Read more in https://www.census.gov/history/www/innovations/.
1.2. TIMELINE AND HISTORICAL MARKERS 9

has been easily automated and scaled to a level that was unimaginable
before. However, “where” storing data is not the only challenge. “How”
to store data is also a challenge. The next period of history addresses
this issue.

Formal age
In 1970, Edgar Frank Codd (1923 – 2003), a British computer scientist,
published a paper entitled “A Relational Model of Data for Large Shared
Data Banks”15. In this paper, he introduced the concept of a relational
model for database management.
A relational model organizes data in tables (relations) where each
row represents a record and each column represents an attribute of the
record. The tables are related by common fields. Codd showed means to
organize the tables of a relational database to minimize data redundancy
and improve data integrity. Section 4.2 provides more details on the
topic.
His work was a breakthrough in the field of data management. The
standardization of relational databases led to the development of struc-
tured query language (SQL) in 1974. SQL is a domain-specific language
used in programming and designed for managing data held in a rela-
tional database management system (RDBMS).
As a result, a new challenge rapidly emerged: how to aggregate data
from different sources. Once data is stored in a relational database, it is
easy to query and manage it. However, data is usually stored in different
databases, and it is not always possible to directly combine them.

Integrated age
The solution to this problem was the development of the extract, trans-
form, load (ETL) process. ETL is a process in data warehousing responsi-
ble for extracting data from several sources, transforming it into a format
that can be analyzed, and loading it into a data warehouse.
The concept of data warehousing dates back to the late 1980s when
IBM researchers Barry Devlin and Paul Murphy developed the “busi-
ness data warehouse.”
Two major figures in the history of ETL are Ralph Kimball (born
1944) and Bill Inmon (born 1945), both American computer scientists.

15E. F. Codd (1970). “A Relational Model of Data for Large Shared Data Banks”. In:
Commun. ACM 13.6, pp. 377–387. issn: 0001-0782. doi: 10.1145/362384.362685.
10 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

Although they differ in their approaches, they both agree that data ware-
housing is the foundation for business intelligence (BI) and analytics,
and that data warehouses should be designed to be easy to understand
and fast to query for business users.
A famous debate between Kimball and Inmon is the top-down ver-
sus bottom-up approach to data warehousing. Inmon’s approach is top-
down, where the data warehouse is designed first and then the data
marts16 are created from the data warehouse. Kimball’s approach is
bottom-up, where the data marts are created first and then the data ware-
house is created from the data marts.
One of the earliest and most famous case studies of the implemen-
tation of a data warehouse is that of Walmart. In the early 1990s, Wal-
mart faced the challenge of managing and analyzing vast amounts of
data from its stores across the United States. The company needed a so-
lution that would enable comprehensive reporting and analysis to sup-
port decision-making processes. The solution was to implement a data
warehouse that would integrate data from various sources and provide
a single source of truth for the organization.

Ubiquitous age
The last and current period of history is the ubiquitous age. It is charac-
terized by the proliferation of data sources.
The ubiquity of data generation and the evolution of data-centric
technologies have been made possible by a multitude of figures across
various domains.

• Vinton Gray Cerf (born 1943) and Robert Elliot Kahn (born 1938),
often referred to as the “Fathers of the Internet,” developed the
TCP/IP protocols, which are fundamental to internet communi-
cation.
• Tim Berners-Lee (born 1955), credited with inventing the World
Wide Web, laid the foundation for the massive data flow on the
internet.
• Steven Paul Jobs (1955 – 2011) and Stephen Wozniak (born 1950),
from Apple Inc., and William Henry Gates III (born 1955), from
Microsoft Corporation, were responsible for the introduction of

16A data mart is a specialized subset of a data warehouse that is designed to serve the
needs of a specific business unit, department, or functional area within an organization.
1.2. TIMELINE AND HISTORICAL MARKERS 11

personal computers, leading to the democratization of data gener-


ation.
• Lawrence Edward Page (born 1973) and Sergey Mikhailovich Brin
(born 1973), the founders of Google, transformed how we access
and search for information.
• Mark Elliot Zuckerberg (born 1984), the co-founder of Facebook,
played a crucial role in the rise of social media and the generation
of vast amounts of user-generated content.

In terms of data handling, this change of landscape has brought


about the development of new technologies and techniques for data stor-
age and processing. Especially the development of NoSQL databases
and distributed computing frameworks.
NoSQL databases are non-relational databases that can store and
process large volumes of unstructured, semi-structured, and structured
data. They are highly scalable and flexible, making them ideal for big
data applications.
Some authors argue that the rise of big data is characterized by the
five V’s of big data: Volume, Velocity, Variety, Veracity, and Value. The
amount of data generated is massive, the speed at which data is gener-
ated is high, the types of data generated are diverse, the quality of data
generated is questionable, and the value of data generated is high.
Once massive amounts of unstructured data became available, the
need for new data processing techniques arose. The development of
distributed computing frameworks such as Apache Hadoop and Apache
Spark enabled the processing of massive amounts of data in a distributed
manner.
Douglass Read Cutting and Michael Cafarella, the developers of the
software Apache Hadoop, proposed both the Hadoop distributed file sys-
tem (HDFS) and MapReduce, which are the cornerstones of the Hadoop
framework, in 2006. Hadoop’s distributed storage and processing capa-
bilities enabled organizations to handle and analyze massive volumes
of data.
Currently, Google holds a patent for MapReduce17. However, their
framework inherits from the architecture proposed in Hillis (1985)18 the-
17J. Dean and S. Ghemawat (Jan. 2008). “MapReduce: simplified data processing on
large clusters”. In: Commun. ACM 51.1, pp. 107–113. issn: 0001-0782. doi: 10.1145/
1327452.1327492.
18W. D. Hillis (1985). “The Connection Machine”. Hillis, W.D.: The Connection Ma-
chine. PhD thesis, MIT (1985). Cambridge, MA, USA: Massachusetts Institute of Tech-
nology. url: http://hdl.handle.net/1721.1/14719.
12 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

sis. MapReduce is not particularly novel, but its simplicity and scalabil-
ity made it popular.
Nowadays, another important topic is internet of things (IoT). IoT
is a system of interrelated computing devices that communicate with
each other over the internet. The devices can be anything from cell-
phones, coffee makers, washing machines, headphones, lamps, wear-
able devices, and almost anything else you can think of. The reality of
IoT increased the challenges of data handling, especially in terms of data
storage and processing.
In summary, we currently live in a world where data is ubiquitous
and comes in many different forms. The challenge is to collect, store,
and process this data in a way that is meaningful and useful, also re-
specting privacy and security.

1.2.2 Timeline of data analysis


The way we think about data and knowledge extraction has evolved sig-
nificantly over the years. In the following, I present some of the most
important milestones in the history of data analysis and knowledge ex-
traction.

Figure 1.4: Timeline of the ages of data analysis.

3,800 BC – 16th c. 17th c. – 19th c. 20th c. – present

Summary statistics Probability advent Learning from data

Figure 1.4 illustrates the proposed timeline. I consider changes of


ages to be smooth transitions, and not strict boundaries. The theoretical
advances are slower than the technological ones — the latter influences
more data handling than data analysis —, so not much has changed
since the beginning of the 20th century.

Summary statistics
The earliest known records of systematic data analysis date back to the
first censuses. The term statistics itself refers to the analysis of data about
the state, including demographics and economics. That early (and sim-
plest) form of statistical analysis is called summary statistics, which con-
sists of describing data in terms of its central tendencies (e.g., arithmetic
mean) and variability (e.g., range).
1.2. TIMELINE AND HISTORICAL MARKERS 13

Probability advent
However, after the 17th century, the foundations of modern probability
theory were laid out. Important figures for developing that theory are
Blaise Pascal (1623 – 1662), Pierre de Fermat (1601 – 1665), Christiaan
Huygens (1629 – 1695), and Jacob Bernoulli (1655 – 1705).
The foundation methods brought to life the field of statistical infer-
ence. In the following years, important results were achieved.

Bayes’ rule Reverend Thomas Bayes (1701 – 1761) was an English


statistician, philosopher, and Presbyterian minister. He is known for
formulating a specific case of the theorem that bears his name: Bayes’
theorem. The theorem is used to calculate conditional probabilities us-
ing an algorithm (his Proposition 9, published in 1763) that uses evi-
dence to calculate limits on an unknown parameter.
The Bayes’ rule is the foundation of learning from evidence, once it
allows us to calculate the probability of an event based on prior knowl-
edge of conditions that might be related to the event. Classifiers based
on Naïve Bayes — the application of Bayes’ theorem with strong inde-
pendence assumptions between known variables — are likely to have
been used since the second half of the eighteenth century.

Gauss’ method of least squares Johann Carl Friedrich Gauss (1777


– 1855) was a German mathematician and physicist who made signifi-
cant contributions to many fields in mathematics and sciences. Circa
1794, he developed the method of least squares for calculating the orbit
of Ceres to minimize the impact of measurement error19.
The method of least squares marked the beginning of the field of
regression analysis. It marked a shift to find the solution of systems of
equations — especially, overdetermined systems — using data instead
of theoretical models.

Playfair’s data visualization Another change in the way we analyze


data was the development of data visualization. Data visualization is the
graphical representation of information and data.
William Playfair (1759 – 1823) was a secret agent on behalf of Great
Britain during its war with France in the 1790s. He invented several
types of diagrams between the 1780s and 1800s, such as the line, area,
19The method was first published by Adrien-Marie Legendre (1752 – 1833) in 1805,
but Gauss claimed in 1809 that he had been using it since circa 1794.
14 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

and bar chart of economic data, and the pie chart and circle graph to
show proportions.

Learning from data


In the twentieth century and beyond, new advances were made in the
field of statistics. The development of learning machines enabled the
development of new techniques for data analysis.
The recent advances in computation and data storage are crucial for
the large-scale application of these techniques.
This era is characterized by a change of focus from trying to fit data
to a theoretical model to trying to extract knowledge from data. The
main goal is to develop algorithms that can learn from data with mini-
mal human intervention.

Fisher’s discriminant analysis In the 1930s, Sir Ronald A. Fisher


(1890 – 1962), a British polymath, developed discriminant analysis20,
which was initially used to find linear functions to solve the problem of
separating two or more classes of objects21.
The method is based on the so-called Fisher discriminant, which is
a linear combination of variables. The method can be used not only for
classification but also for dimensionality reduction.
Tackling the problem of the importance of the variables for a partic-
ular task, Fisher’s work increased the understanding of the importance
of feature selection in data analysis.

Shannon’s information theory The field — that studies quantifica-


tion, storage, and communication of information —, was originally es-
tablished by the works of Harry Nyquist (1889 – 1976) and Ralph Hart-
ley (1888 – 1970) in the 1920s, and Claude Shannon (1916 – 2001) in
the 1940s. Information theory brought many important concepts to the
field of data analysis, such as entropy, mutual information, and infor-
mation gain. This theory is the foundation of several machine learning
algorithms.
Information theory sees data as a sequence of symbols that can be
compressed and transmitted. The theory is used to quantify the amount
of information in a data set. It also changed dominant paradigms in

20https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf
21After Rosenblatt’s work, however, it was used to solve inductive inference (classifi-
cation) as well. For curiosity, Fisher’s paper introduced the famous Iris data set.
1.2. TIMELINE AND HISTORICAL MARKERS 15

the field of statistics, such as the use of likelihood functions and the
Bayesian approach.

K-Nearest Neighbors In 1951, Evelyn Fix (1904 – 1965) and Joseph


L. Hodges Jr. (1922 – 2000) wrote a technical report entitled “Discrim-
inatory Analysis, Nonparametric Discrimination: Consistency Proper-
ties.” In this paper, they proposed the k-nearest neighbors algorithm,
which is a non-parametric method used for classification and regression.
The algorithm marks a shift from traditional parametric methods — and
purely statistical ones — to non-parametric methods.
It also shows how intuitive models can be used to solve complex
problems. The k-nearest neighbors algorithm is based on the idea that
objects that are similar are likely to be in the same class.

Rosenblatt’s perceptron In the 1960s, a psychologist called Frank


Rosenblatt (1928 – 1971) developed the perceptron, the first model of
a learning machine. While the idea of a mathematical neuron was not
new, he was the first to describe the model as a program, showing the
ability of the perceptron to learn simple tasks such as the logical opera-
tions AND and OR.
This work was the foundation of the field of artificial neural net-
works. The “training” of the perceptron was a breakthrough in the field
of learning machines, drawing attention to the field of artificial intelli-
gence.
A few years after, the book “Perceptrons: an introduction to com-
putational geometry” by Marvin Minsky and Seymour Papert in 1969
drew attention to the limitations of the perceptron22. They showed that
a single-layer perceptron was limited to linearly separable problems, a
fact that led to a decline in the interest in neural networks. Consult sec-
tion 6.6.1 for more details about the technique.
This fact contributed to the first AI winter, resulting in funding cuts
for neural network research.

Hunt inducing trees In 1966, Hunt, Marin, and Stone’s book23 de-
scribed a way to induce decision trees from data. The algorithm is based

22Although Rosenblatt was aware of the limitations of the perceptron and was probably
working on solutions, he died in 1971.
23E. B. Hunt, J. Marin, and P. J. Stone (1966). Experiments in Induction. New York, NY,
USA: Academic Press.
16 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

on the concept of information entropy and is a precursor of the Quin-


lan’s ID3 algorithm24 and its variations. These algorithms gave rise to
the field of decision trees, which is a popular method for classification
and regression.
Trees are intuitive models that can be easily interpreted by humans.
They are based on symbolic rules that can be used to explain their inter-
nal decision-making process.

Empirical risk minimization principle Although many learning


machines were developed until the 1960s, they did not advance signifi-
cantly the understanding of the general problem of learning from data.
Between the 1960s and 1986 — before the backpropagation algorithm
was proposed —, the field of practical data analysis was basically stag-
nant. The main reason for that was the lack of a theoretical framework
to support the development of new learning machines.
However, these years were not completely unfruitful. As early as
1968, Vladimir Vapnik (born 1936) and Alexey Chervonenkis (1938 –
2014) developed the fundamental concepts of VC entropy and VC di-
mension for data classification problems. As a result, a novel inductive
principle was proposed: the empirical risk minimization (ERM) princi-
ple. This principle is the foundation of statistical learning theory.

Resurgence of neural networks In 1986, researchers developed in-


dependently a method to optimize coefficients of a multi-layer neural
network25. The method is called backpropagation and is the founda-
tion of the resurgence of neural networks. The technique enabled the
training of artificial networks that can solve nonlinearly separable prob-
lems.
This rebirth of neural networks happened in a scenario very differ-
ent from the 1960s. The availability of data and computational power
fueled a new approach to the problem of learning from data. The new
approach preferred the use of simple algorithms and intuitive models
over theoretical models, fueling areas such as bio-inspired computing
and evolutionary computation.
24J. R. Quinlan (1986). “Induction of Decision Trees”. In: Machine Learning 1, pp. 81–
106. url: https://api.semanticscholar.org/CorpusID:13252401.
25Y. Le Cun (1986). “Learning Process in an Asymmetric Threshold Network”. In:
Disordered Systems and Biological Organization. Berlin, Heidelberg: Springer Berlin Hei-
delberg, pp. 233–240. isbn: 978-3-642-82657-3; D. E. Rumelhart, G. E. Hinton, and
R. J. Williams (1986). “Learning representations by back-propagating errors”. In: Nature
323.6088, pp. 533–536. doi: 10.1038/323533a0.
1.2. TIMELINE AND HISTORICAL MARKERS 17

Ensembles Following the new approach, ensemble methods were de-


veloped. Based on ideas of boosting26 and bagging27, ensemble methods
combine multiple learning machines to improve the performance of the
individual machines.
The difference between boosting and bagging is the way the ensem-
ble is built. In boosting, the ensemble is built sequentially, where each
new model tries to correct the errors of the previous models. In bagging,
the ensemble is built in parallel, where each model is trained indepen-
dently with small changes in the data. The most famous bagging ensem-
ble methods are random forests28, while XGBoost, a gradient boosting
method29, has been extensively used in machine learning competitions.

Support vector machines In 1995, Cortes and V. N. Vapnik (1995)30


proposed the support vector machine (SVM) algorithm, a learning ma-
chine based on the VC theory and the ERM principle. Based on Cover’s
theorem31, they developed a method that finds the optimal hyperplane
that separates two classes of data in a high-dimensional space with the
maximum margins. The resulting method led to practical and efficient
learning machines.

Deep learning Although the idea of neural networks with multiple


layers was around since the 1960s, only in the late 2000s did the field of
deep learning catch the attention of the scientific community by achiev-
ing state-of-the-art results in computer vision and natural language pro-
cessing. Yoshua Bengio, Geoffrey Hinton and Yann LeCun are recog-
nized for their conceptual and engineering breakthroughs in the field,
winning the 2018 Turing Award32.

26R. E. Schapire (1990). “The strength of weak learnability”. In: Machine Learning 5.2,
pp. 197–227. doi: 10.1007/BF00116037.
27L. Breiman (1996). “Bagging predictors”. In: Machine Learning 24.2, pp. 123–140.
doi: 10.1007/BF00058655.
28T. K. Ho (1995). “Random decision forests”. In: Proceedings of 3rd International
Conference on Document Analysis and Recognition. Vol. 1, 278–282 vol.1. doi: 10.1109/
ICDAR.1995.598994.
29J. H. Friedman (2001). “Greedy function approximation: A gradient boosting ma-
chine.” In: The Annals of Statistics 29.5, pp. 1189–1232. doi: 10.1214/aos/1013203451.
30C. Cortes and V. N. Vapnik (1995). “Support-vector networks”. In: Machine Learning
20.3, pp. 273–297. doi: 10.1007/BF00994018.
31T. M. Cover (1965). “Geometrical and Statistical Properties of Systems of Linear In-
equalities with Applications in Pattern Recognition”. In: IEEE Transactions on Electronic
Computers EC-14.3, pp. 326–334. doi: 10.1109/PGEC.1965.264137.
32https://awards.acm.org/about/2018-turing
18 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE

LUSI learning theory In the 2010s, the researchers V. N. Vapnik


and Izmailov (2015)33 proposed the learning using statistical inference
(LUSI) principle, which is an extension of statistical learning theory.
The LUSI theory is based on the concept of statistical invariants, which
are properties of the data that are preserved under transformations. The
theory is the foundation of the learning from intelligent teachers para-
digm. They regard the LUSI theory as the next step in the evolution of
learning theory, calling it the “complete statistical theory of learning.”

33V. N. Vapnik and R. Izmailov (2015). “Learning with Intelligent Teacher: Similarity
Control and Knowledge Transfer”. In: Statistical Learning and Data Sciences. Ed. by A.
Gammerman, V. Vovk, and H. Papadopoulos. Cham: Springer International Publishing,
pp. 3–32. isbn: 978-3-319-17091-6.
Fundamental concepts
2
The simple believes everything,
but the prudent gives thought to his steps.
— Proverbs 14:15 (ESV)

A useful starting point for someone studying data science is a definition


of the term itself. In this chapter, I discuss some common definitions
and provide a definition of my own. As discussed in chapter 1, there is
no consensus on the definition of data science. However, they all agree
that data science is cross-disciplinary and a very important field of study.
Another important discussion is the evidence that data science is ac-
tually a new science. I argue that a “new science” is not a subject whose
basis is built from the ground up1, but a subject that has a particular
object of study and that meets some criteria.
Once we establish that data science is a new science, we need to
understand one core concept: data. In this book, I focus on structured
data, which are data that are organized in a tabular format. I discuss
the importance of understanding the nature of the data we are working
with and how we represent them.
Finally, I discuss two important concepts in data science: database
normalization and tidy data. Database normalization is mainly focused
on data storage. Tidy data is mainly focused on the requirements of
data for analysis. Both concepts interact with each other and have their
mathematical foundations. I bridge the gap between the two concepts
by discussing their common mathematical foundations.

1As it would be as unproductive as creating a “new math” for each new application.
All “sciences” rely on each other in some way

19
20 CHAPTER 2. FUNDAMENTAL CONCEPTS

Chapter remarks

Contents
2.1 Data science definition . . . . . . . . . . . . . . . . . . . 21
2.2 The data science continuum . . . . . . . . . . . . . . . . 24
2.3 Fundamental data theory . . . . . . . . . . . . . . . . . . 26
2.3.1 Phenomena . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Measurements . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Knowledge extraction . . . . . . . . . . . . . . . . 29

Context

• There is no consensus on the definition of data science.


• Understanding the nature of data is important to extract knowl-
edge from it.
• Structured data are data that are organized in a tabular format.

Objectives

• Define data science.


• Present the main concepts about data theory.

Takeaways

• Data science is a new science that studies the knowledge extraction


from measurable phenomena using computational methods.
• Database normalization and tidy data are complementary concepts
that interact with each other.
2.1. DATA SCIENCE DEFINITION 21

2.1 Data science definition


In literature, we can find many definitions and descriptions of data sci-
ence.
For Zumel and Mount (2019)2, “data science is a cross-disciplinary
practice that draws on methods from data engineering, descriptive statis-
tics, data mining, machine learning, and predictive analytics.” They com-
pare the area with operations research, stating that data science focuses
on implementing data-driven decisions and managing consequences of
these decisions.
Wickham, Çetinkaya-Rundel, and Grolemund (2023)3 declare that
“data science is an exciting discipline that allows you to transform raw
data into understanding, insight, and knowledge.”
Hayashi (1998)4 says that data science “is not only a synthetic con-
cept to unify statistics, data analysis, and their related methods, but also
comprises its results” and that it “intends to analyze and understand
actual phenomena with ‘data.’”
I find the first definition too restrictive once new methods and tech-
niques are always under development. We never know when new “data-
related” methods will become obsolete or a trend. Also, Zumel and
Mount’s view gives the impression that data science is an operations
research subfield. Although I do not try to prove otherwise, I think it is
much more useful to see it as an independent field of study. Obviously,
there are many intersections between both areas (and many other areas
as well). Because of such intersections, I try my best to keep definitions
and terms standardized throughout chapters, sometimes avoiding pop-
ular terms that may generate ambiguities or confusion.
The second one is not really a definition. However, it states clearly
what data science enables us to do. The terms “understanding,” “in-
sight,” and “knowledge” are very important in the context of data sci-
ence. They are the goals of a data science project.
The third definition brings an important aspect behind the data: the
phenomena from which they come. Data science is not only about data,
but about understanding the phenomena they represent.

2N. Zumel and J. Mount (2019). Practical Data Science with R. 2nd ed. Shelter Island,
NY, USA: Manning.
3H. Wickham, M. Çetinkaya-Rundel, and G. Grolemund (2023). R for Data Science:
Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media.
4C. Hayashi (1998). “What is Data Science? Fundamental Concepts and a Heuristic
Example”. In: Data Science, Classification, and Related Methods. Ed. by C. Hayashi et al.
Tokyo, Japan: Springer Japan, pp. 40–51. isbn: 978-4-431-65950-1.
22 CHAPTER 2. FUNDAMENTAL CONCEPTS

Note that these definitions do not contradict each other. But, they
do not attempt to emphasize the “science” aspect of it. From these ideas,
let us define the term.

Definition 2.1: (Data science)

Data science is the study of knowledge extraction from measur-


able phenomena using computational methods.

I want to highlight the meaning of some terms in this definition.


Computational methods means that data science methods use comput-
ers to handle data and perform the calculations. Knowledge means in-
formation that humans can understand and/or apply to solve problems.
Measurable phenomena are events or processes where raw data can be
quantified in some way5. Raw data are data collected directly from some
source and that have not been subject to any transformation by a soft-
ware program or a human expert. Data is any piece of information that
can be digitally stored.
Kelleher and Tierney (2018)6 summarize very well the challenges
of data science: “extracting non-obvious and useful patterns from large
data sets […]; capturing, cleaning, and transforming […] data; [storing
and processing] big […] data sets; and questions related to data ethics
and regulation.”
Data science naming contrasts with conventional sciences. Usually,
a “science” is named after its object of study. Biology is the study of
life, Earth science studies the planet Earth, and so on. I argue that data
science does not study data itself, but how we can use it to understand
a certain phenomenon.
One similar example is “computer science.” Computer science is
not the study of computers themselves, but the study of computing and
computer systems. Similarly, one could state that data science studies
knowledge extraction7 and data systems8.
Moreover, the conventional scientific paradigm is essentially model-
driven: we observe a phenomenon related to the object of study, we rea-
5Non-measurable phenomena are related to metaphysics and are not the object of
study in data science. They are the object of study in other sciences, such as philosophy,
theology, etc. However, many concepts from metaphysics are borrowed to explain data
science concepts.
6J. D. Kelleher and B. Tierney (2018). Data science. The MIT Press.
7Related to data analysis, see section 1.2.2.
8Related to data handling, see section 1.2.1.
2.1. DATA SCIENCE DEFINITION 23

son about the possible explanation (the model or hypothesis), and we


validate our hypothesis (most of the time using data, though). In data
science, however, we extract knowledge directly and primarily from the
data. Expert knowledge and reasoning may be taken into account, but
we give data the opportunity to surprise us.
Thus, while the objects of study in conventional sciences are the
phenomena themselves and the models that can explain them, the ob-
jects of study in data science are the means (computational methods and
processes) that can extract reliable and ethical knowledge from data ac-
quired from any measurable phenomenon — and, of course, their con-
sequences.

Figure 2.1: My view of data science.

Data science

Philosophy /
Statistics
domain expertise

Computer science

Data science is an entirely new science. Being a new science does


not mean that its basis is built from the ground up. Most of the
subjects in data science come from other sciences, but its object
of study (computational methods to extract knowledge from mea-
surable phenomena) is particular enough to unfold new scientific
questions – such as data ethics, data collection, etc. Note that I
emphasize philosophy over domain expertise because, in terms of
scientific knowledge, the former is more general than the latter.

Figure 2.1 shows my view of data science. Data science is an entirely


24 CHAPTER 2. FUNDAMENTAL CONCEPTS

new science that incorporates concepts from other sciences. In the next
section, I argue the reasons to understand data science as a new science.

2.2 The data science continuum


In the previous section, I argued that data science is a new science defin-
ing its object of study. This is just the first step to establish a new science,
especially because the object of study in data science is not new. Com-
puter science, statistics, and other sciences have been studying methods
to process data for a long time.
One key aspect of the establishment of a new science is the social
demand and the importance of the object of study in our society. Many
say that “data is the new oil.” This is because the generation, storage,
and processing of data has increased exponentially in the last decades.
As a consequence, whoever holds the data and can effectively extract
knowledge from them has a competitive advantage.
As a consequence of the demand, a set of methods are developed and
then experiments are designed to assess their effectiveness. If the meth-
ods are effective, they gain credibility, are widely accepted, and become
the foundation of a new scientific discipline.
Usually, a practical consequence of academic recognition is the cre-
ation of new courses and programs in universities. This is the case of
data science. Many universities have created data science programs in
the last few years.
Once efforts to develop the subject increase, it is natural that some
methodologies and questions not particularly related to any other sci-
ence evolve. This effect produces what I call the “data science contin-
uum.”
In a continuum, the subject is not a new science yet. It is a set of
methods and techniques borrowed from other sciences. However, some
principles emerge that are connected with more than one already es-
tablished science. (For instance, a traditional computational method
adapted to assume statistical properties of the data.) With time, the
premises and hypotheses of new methods become distinctive. The par-
ticular properties of the methods lead to the inception of methodologies
to validate them. While validating the methods, new questions arise.
The data science continuum is an instance of this process; see fig. 2.2.
At first glance, data science seems like just a combination of computer
science, statistics, linear algebra, etc. However, the principles and prior-
ities of data science are not the same as those in these disciplines. Simi-
2.2. THE DATA SCIENCE CONTINUUM 25

Figure 2.2: The data science continuum.

Established sciences

Computer Philosophy
Statistics
science and others

Emergence of principles

Unique methods

Validation and new challenges

Data science

The data science continuum is the process of development of data


science as a new science. It began by borrowing methods and tech-
niques from established sciences. Over time, distinct principles
emerged that spanned multiple disciplines. As these principles
developed, new methods and their premises became unique. This
uniqueness led to the creation of specific methodologies for vali-
dating these methods. During the validation process, new ques-
tions and challenges arose, further distinguishing data science
from its parent disciplines.
26 CHAPTER 2. FUNDAMENTAL CONCEPTS

larly, the accepted methodologies in data science differ and keep evolv-
ing from those in other sciences. New questions arise, such as:

• How can we guarantee that the data we are using is reliable?


• How can we collect data in a way that does not bias our conclu-
sions?
• How can we guarantee that the data we are using is ethical?
• How can we present our results in a way that is understandable to
non-experts?

2.3 Fundamental data theory


As I stated, data science is not an isolated science. It incorporates several
concepts from other fields and sciences. In this section, I explain the
basis of each component of definition 2.1.

2.3.1 Phenomena
Phenomenon is a term used to describe any observable event or process.
They are the source we use to understand the world around us. In gen-
eral, we use our senses to perceive phenomena. To make sense of them,
we use our knowledge and reasoning.
Philosophy is the study of knowledge and reasoning. It is a very
broad field of study that has been divided into many subfields. One pos-
sible starting point is ontology, which is the study of being, existence,
and reality. Ontology studies what exists and how we can classify it. In
particular, ontology describes the nature of categories, properties, and
relations.
Aristotle (384 – 322 BC) is one of the first philosophers to study on-
tology. In Κατηγορίαι9, he proposed a classification of the world into
ten categories. Substance, or οὐ σία, is the most important one. It is the
category of being. The other categories are properties, quantity, quality,
relation, place, time, position, state, and action.
Although rudimentary10, Aristotle’s categories served as a basis for
the development of logical reasoning and scientific classification, espe-
9For Portuguese readers, I suggest Aristotle (2019). Categorias (Κατηγορiαι). Greek
and Portuguese. Trans. by J. V. T. da Mata. São Paulo, Brasil: Editora Unesp. isbn: 978-
85-393-0785-2.
10Most historians agree that Categories was written before Aristotle’s other works.
Many concepts are further developed in his later works.
2.3. FUNDAMENTAL DATA THEORY 27

cially in the Western world. The categories are still used in many appli-
cations, including computer systems and data systems.
Aristotle marked a rupture with many previous philosophers. While
Heraclitus (6th century – 5th century BC) defended that everything is in
a constant state of flux and Plato (c. 427 – 348 BC) defended that only the
perfect can be known, Aristotle focused on the world we can perceive
and understand. His practical view also opposed Antisthenes (c. 446 –
366 BC) view that the predicate determines the object, which leads to
the impossibility of negation and consequently contradiction.
What is the importance of ontology for data science? Describing,
which is basically reducing the complexity of the world to simple, small
pieces, is the first step to understand any phenomenon. Drawing a sim-
plistic parallel, phenomena are like the substance category, and the data
we collect are like the other categories, which describe the properties, re-
lations, and states of the substance. A person who can easily organize
their thoughts to identify the entities and their properties in a problem
is more likely to collect relevant data. Also, the understanding of logical
and grammatical limitations — such as univocal and equivocal terms —
is important to avoid errors in data science applications11.
Another important field in Philosophy is epistemology, which is the
study of knowledge. Epistemology elaborates on how we can acquire
knowledge and how we can distinguish between knowledge and opin-
ion. In particular, epistemology studies the nature of knowledge, justi-
fication, and the rationality of belief.
Finally, logic is the study of reasoning. It studies the nature of rea-
soning and argumentation. In particular, logic studies the nature of in-
ference, validity, and fallacies.
I further discuss knowledge and reasoning in section 2.3.3.
In the context of a data science project, we usually focus on phe-
nomena from a particular domain of expertise. For example, we may be
interested in phenomena related to the stock market, or related to the
weather, or related to human health. Thus, we need to understand the
nature of the phenomena we are studying.
Fully understanding the phenomena we are tackling requires both
general knowledge of epistemology, ontology, and logic, and particular
knowledge of the domain of expertise.

11It is very common to see data scientists reducing the meaning of the columns in a
dataset to a single word. Or even worse, they assume that different columns with the same
name have the same meaning. This is a common source of errors in data science projects.
28 CHAPTER 2. FUNDAMENTAL CONCEPTS

Observe as well that we do not restrict ourselves to the intellectual


understanding of philosophy. There are several computational meth-
ods that implement the concepts of epistemology, ontology, and logic.
For example, we can use a computer to perform deductive reasoning, to
classify objects, or to validate an argument. Also, we have strong mathe-
matical foundations and computational tools to organize categories, re-
lations, and properties.
The reason we need to understand the nature of the phenomena we
are studying is that we need to guarantee that the data we are collecting
are relevant to the problem we are trying to solve. An incorrect per-
ception of the phenomena may lead to incorrect data collection, which
certainly leads to incorrect conclusions.

2.3.2 Measurements
Similarly to Aristotle’s work, data scientists focus on the world we can
perceive with our senses (or using external sensors). In a more restric-
tive way, we focus on the world we can measure12. Measurable phe-
nomena are those that we can quantify in some way. For example, the
temperature of a room is a measurable phenomenon because we can
measure it using a thermometer. The number of people in a room is
also a measurable phenomenon because we can count them.
When we quantify a phenomenon, we perform data collection. Data
collection is the process of gathering data on a targeted phenomenon in
an established systematic way. Systematic means that we have a plan to
collect the data and we understand the consequences of the plan, includ-
ing the sampling bias. Sampling bias is the influence that the method
of collecting the data has on the conclusions we can draw from them.
Once we have collected the data, we need to store them. Data storage is
the process of storing data in a computer.
To perform those tasks, we need to understand the nature of data.
Data are any piece of information that can be digitally stored. Data can
be stored in many different formats. For example, we can store data in
a spreadsheet, in a database, or in a text file. We can also store data
in many different types. For example, we can store data as numbers,
strings, or dates.
In data science, studying data types is important because they need
to correctly reflect the nature of the source phenomenon and be com-

12Some phenomena might be knowable but not measurable. For example, the exis-
tence of God is a knowable phenomenon, but it is not measurable.
2.3. FUNDAMENTAL DATA THEORY 29

patible with the computational methods we are using. Data types also
restrict the operations we can perform on the data.
The foundation and tools to understand data types come from com-
puter science. Among the subfields, I highlight:

• Algorithms and data structures: the study of data types and the
computational methods to manipulate them.
• Databases: the study of storing and retrieving data.

The fundamental concepts are the same independently of the pro-


gramming language, hardware architecture, or the relational database
management system (RDBMS) we are using. As a consequence, in this
book, I focus on the concepts and not on the tools.

2.3.3 Knowledge extraction


Like discussed before, knowledge and reasoning are important aspects
of data science. Philosophical and mathematical foundations from epis-
temology and logic provide us with ways to obtain knowledge from a set
of premises and known (and accepted) facts13.
Deductive reasoning is the process of deriving a conclusion (or new
knowledge) from a set of previous knowledge. Deductive reasoning,
thus, enables us to infer generalization rules from known general rules.
Important figures that bridged the gap between the subjects of phi-
losophy and mathematics are René Descartes (1596 – 1650) and Got-
tfried Wilhelm Leibniz (1646 – 1716). Descartes was the first to use
algebra to solve knowledge problems, effectively creating methods to
mechanize reasoning. Leibniz, after Descartes, envisioned a universal
algebraic language that would encompass logical principles and meth-
ods. Their work influenced the development of calculus, Boolean alge-
bra, and many other fields.
Once we have collected and stored the data, the next step is to extract
knowledge from them. Knowledge extraction is the process of obtaining
knowledge from data. The reasoning principle here is inductive reason-
ing. Inductive reasoning is the process of deriving generalization rules
from specific observations. Inductive reasoning and data analysis are
closely related. Refer to section 1.2.2 for a timeline of the development
of data analysis.

13In mathematics, axioms are the premises and accepted facts. Corollaries, lemmas,
and theorems are the results of the reasoning process.
30 CHAPTER 2. FUNDAMENTAL CONCEPTS

In data science, we use computational methods to extract knowledge


from data. These computational methods may come from many differ-
ent fields. In particular, I highlight:

• Statistics: the study of data collection, organization, analysis, in-


terpretation, and presentation.
• Machine learning: the study of computational methods that can
automatically learn from data. It is a branch of artificial intelli-
gence.
• Operations research: the study of computational methods to opti-
mize decisions.

Also, many other fields contribute to the development of domain-


specific computational methods to extract knowledge from data. For
example, in the field of biology, we have bioinformatics, which is the
study of computational methods to analyze biological data. Earth sci-
ences have geoinformatics, which is the study of computational meth-
ods to analyze geographical data. And so on.
Each method has its own assumptions and limitations. Thus, we
need to understand the nature of the methods we are using. In par-
ticular, we need to understand the expected input and output of them.
Whenever the available data do not match the requirements of the tech-
nique, we may perform data preprocessing14.
Data preprocessing mainly includes data cleaning, data transforma-
tion, and data enhancement. Data cleaning is the process of detecting
and correcting (or removing) corrupt or inaccurate pieces of data. Data
transformation is the process of converting data from one format or type
to another. Data enhancement is the process of adding additional infor-
mation to the data, usually by integrating data from different sources
into a single, unified view.

14It is important to highlight that it is expected that some of the method’s assumptions
are not fully met. These methods are usually robust enough to extract valuable knowledge
even when data contain imperfections, errors, and noise. However, it is still useful to
perform data preprocessing to adjust data as much as possible.
Data science project
3
with contributions from Johnny C. Marques

Figured I could throw myself a pity party or go back to school


and learn the computers.
— Don Carlton, Monsters University (2013)

Once we have established what data science is, we can now discuss how
to conduct a data science project. First of all, a data science project is
a software project. The difference between a data science software and
a traditional software is that some components of the former are con-
structed from data. This means that part of the solution is not designed
from the knowledge of the domain expert.
One example of a project is a spam filter that classifies emails into
two categories: spam and non-spam. A traditional approach is to design
a set of rules that are known to be effective. However, the effectiveness
of the filters is limited by the knowledge of the designer and is cumber-
some to maintain. A data science approach automatically learns the fil-
ters from a set of emails that are already classified as spam or non-spam.
Another important difference in data science projects is that tradi-
tional testing methods, such as unit tests, are not sufficient. The solu-
tion inferred from the data must be validated considering the stochastic
nature of the data.
In this chapter, we discuss common methodologies for data science
projects. We also present the concept of agile methodologies and the
Scrum framework. We finally propose an extension to Scrum adapted
for data science projects.

31
32 CHAPTER 3. DATA SCIENCE PROJECT

Chapter remarks

Contents
3.1 CRISP-DM . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 ZM approach . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Roles of the ZM approach . . . . . . . . . . . . . . 35
3.2.2 Processes of the ZM approach . . . . . . . . . . . 36
3.2.3 Limitations of the ZM approach . . . . . . . . . . 37
3.3 Agile methodology . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Scrum framework . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Scrum roles . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Sprints and backlog . . . . . . . . . . . . . . . . . 39
3.4.3 Scrum for data science projects . . . . . . . . . . . 41
3.5 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 The roles of our approach . . . . . . . . . . . . . . 43
3.5.2 The principles of our approach . . . . . . . . . . . 45
3.5.3 Proposed workflow . . . . . . . . . . . . . . . . . 49

Context

• A data science project is a software project.


• Data science methodologies focus on the data analysis process.
• The industry demands not only data analysis but also software de-
velopment.

Objectives

• Explore common methodologies for data science projects.


• Understand agile methodologies and the Scrum framework.
• Propose an extension to Scrum adapted for data science projects.

Takeaways

• Modern data science methodologies take into account the software


development aspects of the project.
• Scrum is a good framework for software development, and it can
be adapted for data science projects.
3.1. CRISP-DM 33

3.1 CRISP-DM
CRISP-DM1 is a methodology for data mining projects. It is an acronym
for Cross Industry Standard Process for Data Mining. It is a methodol-
ogy that was developed in the 1990s by IBM, and it is still widely used
today.
CRISP-DM is a cyclic process. The process is composed of six phases:

1. Business understanding: this is the phase where the project objec-


tives are defined. The objectives must be defined in a way that is
measurable. The phase also includes the definition of the project
plan.
2. Data understanding: this is the phase where the data is collected
and explored. The data is collected from the data sources, and it
is explored to understand its characteristics. The phase also in-
cludes the definition of the data quality requirements.
3. Data preparation: this is the phase where the data is prepared for
the modeling phase. The data is cleaned, transformed, and ag-
gregated. The phase also includes the definition of the modeling
requirements.
4. Modeling: this is the phase where the model is trained and val-
idated. The model is trained using the prepared data, and it is
validated using the validation data. The phase also includes the
definition of the evaluation requirements.
5. Evaluation: this is the phase where the model is evaluated. The
model is evaluated using the evaluation data. The phase also in-
cludes the definition of the deployment requirements.
6. Deployment: this is the phase where the model is deployed. The
model is deployed using the deployment requirements. The phase
also includes the definition of the monitoring requirements.

CRISP-DM is cyclic and completely focused on the data. However,


it does not address the software development aspects of the project. The
“product” of the project is both models and findings, not the full software
solution. As a result, aspects such as user interface, communication,
and integration are not addressed by the methodology.

1Official guide available at https://www.ibm.com/docs/it/SS3RA7_18.3.0/pdf/


ModelerCRISPDM.pdf.
34 CHAPTER 3. DATA SCIENCE PROJECT

Figure 3.1: Diagram of the CRISP-DM process.

Business un- Data under-


derstanding standing

Data
preparation
Deployment Data

Modeling

Evaluation

Each block represents a phase of the CRISP-DM process. Data is


the central element of the process. Arrows represent the transi-
tions between the phases.

Figure 3.1 shows a diagram of the CRISP-DM process. The double


arrow between the business understanding and the data understanding
phases represents the iterative nature of these steps. Once we are satis-
fied with the data understanding, we can proceed to the data prepara-
tion phase. The same iteration is possible between the data preparation
and the modeling phases, since modeling methods can require different
data preparation methods. Finally, an evaluation is performed. If the
model is satisfactory, we proceed to the deployment phase. Otherwise,
we return to the business understanding phase. The idea is to revisit the
project objectives and, if necessary, the project plan.
The CRISP-DM methodology is a good starting point for data science
projects. However, it does not mean that it should be followed strictly.
The process is flexible, and adaptations are possible at any stage.
3.2. ZM APPROACH 35

3.2 ZM approach
Zumel and Mount (2019)2 also propose a methodology for data science
projects — which we call here the ZM approach. Besides describing
each step in a data science project, they further address the roles of each
individual involved in the project. They state that data science projects
are always collaborative, as they require domain expertise, data exper-
tise, and software expertise.
Requirements of a data science project are dynamic, and we need to
perform many exploratory phases. Unlike traditional software projects,
we should expect significant changes in the initial requirements and
goals of the project.
Usually, projects based on data are urgent, and they must be com-
pleted in a short time — not only due to the business requirements, but
also because the data changes over time. The authors state that agile
methodologies are suitable (and necessary) for data science projects.

3.2.1 Roles of the ZM approach


In their approach, they define five roles. The roles are:

Project sponsor It is the main stakeholder of the project, the one that
needs the results of the project. He represents the business interests and
champions the project. The project is considered successful if the spon-
sor is satisfied. Note that, ideally, the sponsor cannot be the data scien-
tist, but someone who is not involved in the development of the project.
However, he needs to be able to express quantitatively the business goals
and participate actively in the project.

Client The client is the domain expert. He represents the end users’
interests. In a small project, he is usually the sponsor. He translates the
daily activities of the business into the technical requirements of the
software.

Data scientist The data scientist is the one that sets and executes the
analytic strategy. He is the one that communicates with the sponsor and
the client, effectively connecting all the roles. In small projects, he can
also act as the developer of the software. However, in large projects, he is
2N. Zumel and J. Mount (2019). Practical Data Science with R. 2nd ed. Shelter Island,
NY, USA: Manning.
36 CHAPTER 3. DATA SCIENCE PROJECT

usually the project manager. Although it is not required to be a domain


expert, the data scientist must be able to understand the domain of the
problem. He must be able to understand the business goals and the
client’s requirements. Most importantly, he must be able to define and
solve the right tasks.

Data architect The data architect is the one that manages data and
data storage. He usually is involved in more than one project, so he is
not an active participant. He is the one who receives instructions to
adapt the data storage and means to collect data.

Operations The operations role is the one that manages infrastruc-


ture and deploys final project results. He is responsible for defining
requirements such as response time, programming language, and the
infrastructure to run the software.

3.2.2 Processes of the ZM approach


Zumel and Mount’s model is similar to CRISP-DM, but emphasizes that
back-and-forth is possible at any stage of the process. Figure 3.2 shows
a diagram of the process. The phases and the answers we are looking
for in each phase are:

• Define the goal: what problem are we trying to solve?


• Collect and manage data: what information do we need?
• Build the model: what patterns in the data may solve the problem?
• Evaluate the model: is the model good enough to solve the prob-
lem?
• Present results and document: how did we solve the problem?
• Deploy the model: how to use the solution?

The step “Present results and document” is a differentiator from the


other approaches, like CRISP-DM. In the ZM approach, result presen-
tation is essential; data scientists must be able to communicate their
results effectively to the client/sponsor. This phase is also emphasized
in the view of Wickham, Çetinkaya-Rundel, and Grolemund (2023).
3.3. AGILE METHODOLOGY 37

Figure 3.2: Diagram of the data science process proposed by


Zumel and Mount (2019).

Define
the goal

Deploy Collect and


the model manage data

Present Build the


results model

Evaluate
the model

Each block represents a phase of the data science process. The


emphasis is on the cyclic nature of the process. Arrows represent
the transitions between the phases, which can be back-and-forth.

3.2.3 Limitations of the ZM approach


The ZM approach is particularly interesting in consulting projects, as
the data scientist is not part of the organization. Note that tasks be-
yond deployment, such as maintenance and monitoring, are not directly
addressed by the methodology but are delegated to the operations role.
Like CRISP-DM, the ZM approach does not address the software devel-
opment aspects of the project.

3.3 Agile methodology


Agile is a methodology for software development. It is an alternative to
the waterfall methodology. The waterfall methodology is a sequential
design where each phase must be completed before the next phase can
begin.
The methodology is based on the four values of the Agile Manifesto3:

• Individuals and interactions over processes and tools;


3https://agilemanifesto.org/
38 CHAPTER 3. DATA SCIENCE PROJECT

• Working software over comprehensive documentation;


• Customer collaboration over contract negotiation;
• Responding to change over following a plan.

Note that the manifesto does not discard the items on the right, but
rather values the items on the left more. For example, comprehensive
documentation is important, but working software is more important.

3.4 Scrum framework


The Scrum framework is one of the most widely adopted agile method-
ologies. It is an iterative, incremental process that enables teams to de-
liver high-value products efficiently while adapting to changing require-
ments. Scrum emphasizes teamwork, accountability, and continuous
improvement. Developed by Schwaber and Sutherland (2020)4 in the
early 1990s, Scrum is based on three pillars: transparency, inspection,
and adaptation. These principles help teams to navigate complex, adap-
tive problems and deliver productive outcomes.
Scrum organizes work into cycles called sprints, and involves de-
fined roles, ceremonies, and artifacts that ensure the progress and qual-
ity of the product. Key events such as daily stand-ups, retrospectives,
and sprint reviews create regular opportunities to inspect and adapt,
making Scrum a responsive and resilient framework5.

3.4.1 Scrum roles


Scrum defines three critical roles: the product owner, the Scrum master,
and the development team. Each of these roles has specific responsibil-
ities, and together they ensure that the Scrum process runs smoothly
and that the product development aligns with the overall business ob-
jectives.

Product owner This role is responsible for defining the product vi-
sion and maximizing the value of the work done by the team. The
4Latest version of the Scrum Guide available at K. Schwaber and J. Sutherland (2020).
Scrum Guide: The Definitive Guide to Scrum: The Rules of the Game. Scrum.org. url:
https://scrumguides.org/docs/scrumguide/v2020/2020-Scrum-Guide-US.pdf.
5S. Denning (2016). “Why Agile Works: Understanding the Importance of Scrum
in Modern Software Development”. In: Forbes. url: https : / / www . forbes . com / sites /
stevedenning/2016/08/10/why-agile-works/.
3.4. SCRUM FRAMEWORK 39

product owner manages the product backlog, ensuring that it is visible,


transparent, and prioritized according to business needs. They serve as
the main point of contact between stakeholders and the development
team6. The product owner must constantly balance the requirements
of the business and the technical capabilities of the team, ensuring that
the highest-value items are worked on first.

Scrum master The Scrum master acts as a facilitator and coach for
the Scrum team, ensuring that the team adheres to the Scrum frame-
work and agile principles. Unlike a traditional project manager, the
Scrum master is not responsible for managing the team directly but for
enabling them to perform optimally by removing impediments and fos-
tering a self-organizing culture7.

Development team It is a cross-functional group of professionals re-


sponsible for delivering potentially shippable product increments at the
end of each sprint. The team is self-managing, meaning they decide
how to achieve their goals within the sprint. The team is small enough
to remain nimble but large enough to complete meaningful work8. The
development team works collaboratively and takes collective responsi-
bility for the outcome of the sprint.

3.4.2 Sprints and backlog


A sprint is the basic unit of work in Scrum, as presented in fig. 3.3.
Sprints are time-boxed iterations, usually lasting between one to four
weeks, during which a defined amount of work is completed. The goal
is to deliver a potentially shippable product increment at the end of each
sprint. Sprints are continuous, with no breaks in between, fostering a
regular, predictable rhythm of work.
Before a sprint starts, the Scrum team holds a sprint planning meet-
ing to decide what will be worked on. The work for the sprint is se-
lected from the product backlog, which is a prioritized list of features,
enhancements, bug fixes, and other deliverables necessary for the prod-
uct. The product backlog is dynamic and evolves as new requirements
6J. Smith (2019). “Understanding Scrum Roles: Product Owner, Scrum Master, and
Development Team”. In: Open Agile Journal 12, pp. 22–28.
7C. G. Cobb (2015). The Project Manager’s Guide to Mastering Agile: Principles and
Practices for an Adaptive Approach. John Wiley & Sons.
8S. Rubin (2012). “Scrum for Teams: Maximizing Efficiency in Short Iterations”. In:
Agile Processes Journal 8, pp. 45–52.
40 CHAPTER 3. DATA SCIENCE PROJECT

Figure 3.3: Scrum framework overview.


Scrum master daily scrum

i sprint review

ʝ í ií
product owner dev team sprint
incremental version
sprint planning

ø ɚ ɔʝ
product backlog sprint retrospective
sprint backlog

ʝií

The development centers around the sprints, which are time-


boxed iterations. Roles, other ceremonies, and artifacts support
the sprints and ensure the progress and quality of the product.

and changes emerge. The items selected for the sprint become part of
the sprint backlog, a subset of the product backlog that the development
team commits to completing during the sprint.
At the heart of the Scrum process is the daily scrum (or stand-up
meeting), a brief meeting where the team discusses progress toward the
sprint goal, any obstacles they are facing, and their plans for the next
day. This daily inspection ensures that everyone stays aligned and can
quickly adapt to any changes or challenges.
The burn down/up chart is a visual tool used during the sprint to
track the Scrum team’s progress against the planned tasks. It displays
the amount of remaining work (in hours or story points) over time, al-
lowing the team and the product owner to monitor whether the work
is on pace to be completed by the end of the sprint. The chart’s line
decreases as tasks are finished, providing a clear indicator of potential
delays or blockers. If progress is falling behind, the team adjusts the
approach during the sprint by re-prioritizing tasks or removing impedi-
ments. Thus, this chart provides real-time visibility into the team’s effi-
ciency and contributes to more agile and proactive work management.
At the end of each sprint, the team holds a sprint review, during
3.4. SCRUM FRAMEWORK 41

which they demonstrate the work completed during the sprint. The
sprint review is an opportunity for stakeholders to see progress and pro-
vide feedback, which may lead to adjustments in the Product Backlog.
Following the review, the team conducts a sprint retrospective to dis-
cuss what went well, what did not, and how they can improve their pro-
cesses moving forward. These continuous improvement cycles are key
to Scrum’s success, allowing teams to adapt both their work and their
working methods iteratively.
The sprint retrospective is a crucial event in the Scrum framework,
held at the end of each sprint. Its primary purpose is to provide the
Scrum team with an opportunity to reflect on the sprint that just con-
cluded and identify areas for improvement. During the retrospective,
the team discusses what went well, what challenges they encountered,
and how they can enhance their processes for future sprints. This con-
tinuous improvement focus allows the team to adapt their workflow and
collaboration methods, fostering a more efficient and effective develop-
ment cycle. By encouraging open and honest feedback, the retrospective
plays a vital role in maintaining team cohesion and driving productivity
over time.

3.4.3 Scrum for data science projects


Some consider that Scrum is not adequate for data science projects. The
main reason is that Scrum is designed for projects where the require-
ments are known in advance. Also, data-driven exploratory phases are
not well supported by Scrum.
I argue that this view is wrong. Scrum is a framework, and it is de-
signed to be adapted to the needs of the project; Scrum is not a rigid
process. In the following, I propose an extension to Scrum that makes
it suitable for data science projects9.
One of my major concerns in the proposal of the extension is that
data science projects usually involve “data scientists” who are not pri-
marily developers, but statisticians or domain experts. They usually do

9Note that many other adaptations to Scrum have been described in literature. For ex-
ample, J. Saltz and A. Suthrland (2019). “SKI: An Agile Framework for Data Science”. In:
2019 IEEE International Conference on Big Data (Big Data), pp. 3468–3476. doi: 10.1109/
BigData47090.2019.9005591; J. Baijens, R. Helms, and D. Iren (2020). “Applying Scrum
in Data Science Projects”. In: 2020 IEEE 22nd Conference on Business Informatics (CBI).
vol. 1, pp. 30–38. doi: 10.1109/CBI49978.2020.00011; N. Kraut and F. Transchel (2022).
“On the Application of SCRUM in Data Science Projects”. In: 2022 7th International Con-
ference on Big Data Analytics (ICBDA), pp. 1–9. doi: 10.1109/ICBDA55095.2022.9760341.
42 CHAPTER 3. DATA SCIENCE PROJECT

not possess “hacking-level” skills, and they often do not know good prac-
tices of software development.
Scrum is a good starting point for a compromise between the need
for autonomy (required in dynamic and exploratory projects) and the
need for a detailed plan to guide the project (required to avoid bad prac-
tices and low-quality software). A good project methodology is needed
to ensure that the project is completed on time and within budget.

3.5 Our approach


The previously mentioned methodologies lack focus on the software de-
velopment aspects of the data science project. For instance, CRISP-DM
defines the stages only of the data mining process, i.e., it does not ex-
plicitly address user interface or data collection. Zumel and Mount’s
approach addresses data collection and presentation of results, but del-
egates software development to the operations role, barely mentioning
it. Scrum is a good framework for software development, but it is not
designed for data science projects. For instance, it lacks the exploratory
phases of data science projects.
For the sake of this book, we focus on the inductive approach to data
science projects. Inductive reasoning is the process of making general-
izations based on individual observations. In our case, it means that we
develop components (generalization) of the software based on the data
(individual observations). Such a component is the model. For a deeper
discussion on the inductive approach, refer to section 2.3.3 and chap-
ter 6.
Thus, we propose an extension to Scrum that makes it suitable for
data science projects. The extension is based on the following observa-
tions:

• Data science projects have exploratory phases;


• Data itself is a component of the solution;
• The solution is usually modularized, with parts constructed from
data and other parts constructed like traditional software;
• The solution is usually deployed as a service that must be main-
tained and monitored;
• Team members not familiar with software development practices
must be guided to produce high-quality software.
3.5. OUR APPROACH 43

Moreover, we add two other values besides the Agile Manifesto val-
ues. They are:

• Confidence and understanding of the model over performance;


• Code version control over interactive environments.

The first value is based on the observation that the model perfor-
mance is not the most important aspect of the model. The most impor-
tant aspect is being sure that the model behaves as expected (and some-
times why it behaves as expected). It is not uncommon to find models
that seem to perform well during evaluation steps10, but that are not
suitable for production.
The second value is based on the observation that interactive en-
vironments are not suitable for the development of consistent and re-
producible software solutions. Interactive environments help in the ex-
ploratory phases, but the final version of the code must be version con-
trolled. Often, we hear stories that models cannot be reproduced be-
cause the code that generated them is not runnable anymore. This is
a serious problem, and it is not acceptable for maintaining a software
solution.
As in the Agile manifesto, the values on the right are not discarded,
but the values on the left are more important. We do not discard the
importance of model performance or the convenience of interactive en-
vironments, but they are not the most important aspects of the project.
These observations and values are the basis of our approach. The
roles and principles of our approach are described in the following sec-
tions.

3.5.1 The roles of our approach


Although the roles of the ZM approach consider people potentially from
different organizations and the roles of Scrum focus on the development
team, we can associate the responsibilities between them.
Parts of the responsibilities of the product owner (that represents the
stakeholders) in a Scrum project are divided between the sponsor and
the client in the ZM approach. The data scientist is the one that leads
the project in a similar way as the Scrum master.
In our approach, we consider four roles that cover the responsibil-
ities in a data science project that also involves software development:

10Of course, when evaluation is not performed correctly.


44 CHAPTER 3. DATA SCIENCE PROJECT

Table 3.1: Roles and responsibilities of our approach.

Our approach Scrum ZM approach


Business Stakeholders Sponsor and
spokesman (customer) client
Lead data Product owner Data scientist
scientist
Scrum master Scrum master –
Data science Development Data architect
team team and operations

The roles of Scrum are associated with the roles defined by Zumel
and Mount (2019). Note that the association is not exact. In our
approach, the data scientist leads the development team and in-
teracts with the business spokesman. The development team in-
cludes people with both data science and software engineering
expertise.

the business spokesman, the lead data scientist, the Scrum master, and
the data science team. An association between the roles of our proposal,
Scrum, and the ZM approach is shown in table 3.1.

Business spokesman It is the main stakeholder of the project. He is


the one who needs the results of the project and understands the domain
of the problem. Like the “client” in the ZM approach, he must be able to
translate the business goals into technical requirements. In a sense, the
business spokesman evaluates the main functionalities and usability of
the software. Like the “sponsor” in the ZM approach, he must be able to
express quantitatively the business goals. If possible, his participation in
sprint reviews provides valuable feedback and direction to the project.

Lead data scientist The lead data scientist, like the product owner,
is the one who represents the interests of the stakeholder. She must
be able to understand the application domain and the business goals.
We decide to call her “lead data scientist” to make it clear that she also
has data science expertise. The reason is that mathematical and statis-
tical expertise is essential to understand the data and the models. Cor-
rectly interpreting the results and communicating them to the business
3.5. OUR APPROACH 45

spokesman are essential tasks of the lead data scientist. All other re-
sponsibilities of the traditional product owner are also delegated to her.

Scrum master No extra responsibilities are added to the Scrum mas-


ter role, however, some data science expertise may be helpful. For ex-
ample, the Scrum master must ensure that the data science team is fol-
lowing not only good practices of software development but also data
science.

Data science team The data science team is the development team.
It includes people with expertise in data science, database management,
software engineering, and any other domain-specific expertise that is
required for the project.

3.5.2 The principles of our approach


Before describing the functioning of our approach, we present the prin-
ciples that guide it.

Modularize the solution


Data science projects usually contain four main modules: a front-end,
a back-end, a dataset, and a “solution search system.” The front-end is
the user interface, i.e. the part of the software that the client interacts
with. The back-end is the server-side code which usually contains the
preprocessor11 and the model. The dataset is the curated data that is
used to train the model. Sometimes, the dataset is not static, but actually
scripts and queries that produce a dynamic dataset. The solution search
system is the software that employs data preprocessing and machine
learning techniques, usually in a hyper-parameter12 optimization loop,
to find the best solution, i.e. the combination of preprocessor and model
that best solves the problem.

Version control everything


This includes the code, the data, and the documentation. The most used
tool for code version control is Git13. For datasets, extensions to Git exist,
11Preprocessor is a fitted chain of data handling operations that make the necessary
adjustments to the data before it is fed to the model. For more details, consult chapter 7.
12Hyper-parameters are parameters that are not fitted or learned from the data, but
rather set by the user.
13https://git-scm.com/
46 CHAPTER 3. DATA SCIENCE PROJECT

such as DVC14. One important aspect is to version control the solution


search code. Interactive environments, such as Jupyter Notebooks or R
Markdown, are not suitable for this purpose. They can be used to draft
the code, but the final version must be version controlled.
Note that the preprocessors and the models themselves15 are arti-
facts that result from a particular version of the dataset and the solution
search code. The same occurs with reports generated from exploratory
analysis and validation steps. They must be cached for important ver-
sions of the dataset and the solution search code.

Continuous integration and continuous deployment

The code should be automatically (or at least semi-automatically) tested


and deployed. The back-end and front-end are tested using unit tests.
The dataset is not exactly tested, but the exploratory analysis report is
updated for each version of the dataset. A human must validate the
reports, but the reports must be generated automatically. The solution
search code is tested using validation methods such as cross-validation
and Bayesian analysis — refer to chapter 8 — on the discovered models.
Usually the solution search code is computationally intensive, and
it is not feasible to run it on every commit. Instead, it is usually run peri-
odically, for example once a day. If the cloud infrastructure required to
run the solution search code is not available to automate validation and
deployment, at least make sure that the code is easily runnable. This
means that the code must be well documented, and that the required in-
frastructure must be well documented. Also aggregate commands using
a Makefile16 or a similar tool.
Pay attention to the dependencies between the dataset and model
training. If the dataset changes significantly, not only the deployed pre-
processor and model must be retrained, but the whole model search al-
gorithm may need to be rethought.
Finally, since both the solution search and validation methods are
stochastic, one must guarantee that the results are reproducible. Make
sure you are using a good framework in which you can set the random
seed.

14https://dvc.org/
15The learned solution that is the result of the application of a learning algorithm to
the dataset.
16https://www.gnu.org/software/make/manual/make.html
3.5. OUR APPROACH 47

Reports as deliverables
During sprints, the deliverables of phases like exploratory analysis and
solution search are not only the source code, but also the reports gen-
erated. That is probably the reason why interactive environments are
so popular in data science projects. However, the data scientist must
guarantee that the reports are version controlled and reproducible. The
reports must be generated in a way that is understandable by the busi-
ness spokesperson.

Setup quantitative goals


Do not fall into the trap of forever improving the model. Instead, set
up quantitative goals for the model performance or behavior in general.
For example, the model must have a precision of at least 90% and recall
of at least 80%. Once you reach the goal, prioritize other tasks.

Measure exactly what you want


During model validation, if needed, use custom metrics based on the
project goals. Usually, more than one metric is needed, and they might
be conflicting. Use strategies to balance the metrics, such as Pareto op-
timization.
Beware of the metrics that are most commonly used in the litera-
ture. It is important to know their meanings, properties, and limitations.
For example, the accuracy metric is not suitable for imbalanced datasets.
They might not be suitable for your project.
Notice that during model training, some methods are limited to the
loss functions that they can optimize. However, if possible, choose a
method that can optimize the loss function that you want. Otherwise,
even if you are not explicitly optimizing the desired metric in the solu-
tion search metric, you might find a model that performs well on that
metric. This is why a good experimental design is important; so you can
identify which strategies are more likely to find a good solution (prepro-
cessor and model).

Report model stability and performance variance


Understanding the limitations and properties of the generated model
is more important than the model’s performance. For example, if the
model’s expected performance is high, but the validation results showed
48 CHAPTER 3. DATA SCIENCE PROJECT

instability, it is not suitable for production. Also, in some scenarios, in-


terpretability is more important than performance.
It does not mean that performance is not important. But we only
consider optimizing the expected performance once we trust the solu-
tion search method. If increasing the average performance of the mod-
els generated by the solution search method results in other goals not
being met, it is not worth it.

In user interface, mask data-science-specific terminology


Usually, data science software gives the user the option to choose differ-
ent solutions. For instance, a regressor that predicts the probability of a
binary outcome yields different classifiers depending on a threshold. In
this scenario, the user must be able to choose the threshold in terms of
expected performance of (usually conflicting) metrics of interest.
In order to avoid confusion, the user interface must mask data sci-
ence terminology, preferring domain-specific terms that are more un-
derstandable by the client. This helps non-experts to use the software
consciously.

Monitor model performance in production


Good data science products do not finish with the deployment of the
model. There should exist means to monitor the model behavior in pro-
duction. If possible, set up feedback from the user interface. A history
of usage is usually kept in a database.
In most cases, the model loses performance over time, usually be-
cause of changes in the data distribution. This is called concept drift.
If that happens with considerable effect on the model performance, the
model must be retrained. The retraining can be automated, but it must
be monitored by robust validation methods. Sometimes, the solution
must be rethought, restarting the project from the beginning.

Use the appropriate infrastructure


Many data science projects require a significant amount of computa-
tional resources during development. The solution search method is
usually computationally intensive. The dataset can be large, and the
model can be complex. The infrastructure must be able to handle the
computational requirements of the project.
Unless the application is known to be challenging, projects should
start with the simplest methods and a small dataset. Then, as the project
3.5. OUR APPROACH 49

evolves, the complexity of the methods and the size of the dataset can
be increased. Nowadays, cloud services are a good option for scalability.
Finally, during development, the requirements of the deployment
infrastructure must be considered. The deployment infrastructure must
be able to handle the expected usage of the system. For instance, in the
back-end, one may need to consider the response time of the system,
the programming language, and the infrastructure to run the software.
The choice of communication between the front-end and the back-end
is also important. For instance, one may choose between a REST API or
a WebSocket. A REST API is more suitable for stateless requests, while a
WebSocket is more suitable for stateful requests. For example, if the user
interface must be updated in real-time, a WebSocket is more suitable. If
the user interface is used to submit batch requests, a REST API is more
suitable.

3.5.3 Proposed workflow


The proposed workflow is based on the principles described above. Our
approach adapts the Scrum framework by establishing three kinds of
sprints: data sprints, solution sprints, and application sprints. We also
describe where exploratory and reporting phases fit into the workflow.

Product backlog
In the data science methodologies described in this chapter, the problem
definition is the first step in a data science project. In our methodology,
this task is dealt with in the product backlog. The product backlog is a
list of all desired work on the project. The product backlog is dynamic,
and it is continuously updated by the lead data scientist.
Each item in the product backlog reflects a requirement or a feature
the business spokesperson wants to be implemented. Like in traditional
Scrum, the items are ordered by priority. Here, however, they are clas-
sified into three types: data tasks, solution search tasks, and application
tasks.

Sprints
The sprints are divided into three types: data sprints, solution sprints,
and application sprints. Sprints occur sequentially, but it is possible to
have multiple sprints of the same type in sequence. Like in traditional
Scrum, the sprint review is performed at the end of each sprint. Data
50 CHAPTER 3. DATA SCIENCE PROJECT

sprints comprise only data tasks, solution sprints comprise only solution
search tasks, and so on.

Data sprint The data sprint is focused on the database-related tasks:


collection, integration, tidying, and exploration. Chapter 4 covers the
subjects of collection, integration (database normalization and joins)
and tidying. Part of data exploration is to semantically describe the vari-
ables, especially in terms of data types and their properties; which is also
discussed in that chapter. All these tasks aim to prepare a dataset that
represents the data that the solution will see in production.
Exploration also refers to exploratory data analysis, which is not cov-
ered in this book. For a good introduction to exploratory data analy-
sis, which includes both understanding and identifying issues in the
data through descriptive statistics and visualization, consult chapter 3
of Zumel and Mount (2019)17.
The products (deliverables) of the data sprint are the exploratory
analysis report and the data itself. The exploratory analysis report is
a document that describes the main characteristics of the data, such as
the distribution of the variables, the presence of missing values, and the
presence of outliers. In our context, this report should be generated au-
tomatically, and it should be version controlled. The data is the curated
data that is used to train the model. The source code that generates
the data — usually scripts and queries that combine data from different
places — must be version controlled. At this point of the project, the
data scientist must guarantee that the data is of high quality and that
it represents the data that the solution will “see” in production. It is
very important that, at this stage, no transformation based on the values
of the dataset is performed. Failing to do so leads to data leakage in the
validation phase.

Solution sprint The solution sprint is focused on the solution search


tasks: data preprocessing, machine learning, and validation. Chapter 7
covers the subjects of data preprocessing — adjustments to the data to be
used by specific machine learning models —, chapter 6 covers the sub-
jects of learning machines — methods that can estimate an unknown
function based on input and output examples —, and chapter 8 cov-
ers the subjects of validation — through evaluation we estimate the ex-

17N. Zumel and J. Mount (2019). Practical Data Science with R. 2nd ed. Shelter Island,
NY, USA: Manning.
3.5. OUR APPROACH 51

Definition 3.1: (Data leakage)

During the validation of the solution, we simulate the production


environment by leaving some data out of the training set — con-
sult chapter 8. The remaining dataset, called test set, emulates un-
seen data. Data leakage is the situation where information from
the test set is used to transform the training set in any way or to
train the model. As a result, the validation performance of the so-
lution is overestimated. Sometimes, even an excess of data explo-
ration can lead to indirect leakage or bias in the validation phase.

pected performance for unseen data, i.e. the probability distribution of


the performance in production.
The products of the solution sprint are the validation report and the
solution itself, which is the preprocessor and the model. The validation
report is a document that describes the expected performance of the so-
lution in the real-world. The script or program that searches for the best
pair of preprocessor and model must be version-controlled. This algo-
rithm is usually computationally intensive, and it should run every time
the dataset is updated.

Application sprint The application sprint is focused on the applica-


tion itself. The tasks in this sprint focus on the development of the user
interface, the communication between the front-end and the back-end,
and the monitoring of the solution in production. Software documenta-
tion and unit tests are also part of this sprint.

Figure 3.4 shows the tasks and results for each sprint type and their
relationships. Every time a data sprint results in modifications in the
dataset, the solution search algorithm must be re-executed. The same
occurs when the solution search algorithm is modified: the application
must be updated to use the new preprocessor and model.

Choice and order of sprints


Sprints are sequential. Ordinarily, a data sprint is followed by a solution
sprint, and a solution sprint is followed by an application sprint. How-
ever, it is possible to have multiple sprints of the same type in sequence.
Moreover, like the back-and-forth property of the CRISP-DM and ZM
52 CHAPTER 3. DATA SCIENCE PROJECT

Figure 3.4: Tasks and results for each sprint types and their rela-
tionships.

Exploratory
Collect data
Data sprint
Data tasks

analysis
Integrate data

Tidy data

Explore data Data

Sprint review
Validation
Solution sprint
Solution search

report
Data preprocessing

Machine learning

Validation
Preprocessor
and model
Application sprint
Application tasks

User interface

Communication Application
Monitoring

Summary of the tasks and results for each sprint type and their
relationships.
3.5. OUR APPROACH 53

approach, it is possible to go back to a previous sprint type, especially


when new requirements or problems are identified.

Figure 3.5: Example of sprints of a data science project.

data data solution


data solution application


… …

Each loop represents a sprint of a different kind: data sprint, so-


lution sprint, and application sprint. The arrows represent the
transitions between the sprints.

Figure 3.5 shows an example of a sequence of sprints of a data sci-


ence project. The figure shows that the sprints are sequential and their
types can appear in any order.
We do not advise having mixed sprints. For instance, a sprint that
contains both data and solution tasks. This may lead to a split focus,
which may result in the team acting as multiple teams. We argue that,
independently of the skill set of the team members, all members must
be aware of all parts of the project. This is important to guarantee that
the solution is coherent and that the team members can help each other.

Sprint reviews
A proper continuous integration/continuous deployment (CI/CD) pipe-
line guarantees that by the end of the sprint, exploratory analysis, per-
formance reports, and the working software are ready for review. The
sprint review is a meeting where the team presents the results of the
sprint to the business spokesman. The business spokesman must ap-
54 CHAPTER 3. DATA SCIENCE PROJECT

prove the results of the sprint. (The lead data scientist approves the re-
sults in the absence of the business spokesman.) It is important that the
reports use the terminology of the client, and that the software is easy
to use for the domain expert.

Relationship with other methodologies


Our approach covers all the phases of the CRISP-DM and the ZM ap-
proach. Moreover, it includes aspects of software development that are
not covered by the other methodologies. Table 3.2 relates the phases of
the CRISP-DM and the ZM approach with the sprint types and other
artifacts and ceremonies of our approach.

Table 3.2: Relationship between coverage of other methodologies


and our approach.

CRISP-DM ZM approach Our approach


Bus. understanding Define the goal Product backlog
Data understanding Collect/manage data Data sprint
Data preparation Collect/manage data Data/solution sprint
Modeling Build the model Solution sprint
Evaluation Evaluate the model Solution sprint
Present results Sprint reviews
Deployment Deploy the model Application sprint

The phases of the CRISP-DM and the ZM approach are compared


with the components of our approach. The table shows that our
approach covers all the phases of the other methodologies.

One highlight is that we approach CRISP-DM’s “data preparation”


and ZM’s “collect/manage data” tasks more carefully. Data handling
operations that are not parametrized by the data values themselves are
performed in the data sprint. On the other hand, operations that use
statistics/information of a sampling of the data are performed together
with modeling — consult chapter 5.
This approach not only improves the reliability of the solution vali-
dation, but also improves the maintenance of the solution. (Not surpris-
ingly, frameworks like scikit-learn and tidymodels have an object that
allows the user to combine data preprocessing and model training.)
Structured data
4
Like families, tidy datasets are all alike, but every messy
dataset is messy in its own way.
— Hadley Wickham, Tidy Data

As one expects, when we measure a phenomenon, the resulting data


come in many different formats. For example, we can measure the tem-
perature of a room using a thermometer. The resulting data are num-
bers. We can assess English proficiency using an essay test. The result-
ing data are texts. We can register relationships between proteins and
their functions. The resulting data are graphs. Thus, it is essential to
understand the nature of the data we are working with.
The most common data format is the structured data. Structured
data refers to information that is organized in a tabular format. We re-
strict the kind of information we store in each cell, i.e., the data type of
each measurement. The data type restricts the operations we can per-
form on the data. For example, we can perform arithmetic operations
on numbers, but not on text. All cells in the same column must share
the same data type.
In this chapter, I discuss the most common data types and the most
common data formats. More specifically, we are interested in how the
semantics of the data are encoded in the data format. Database normal-
ization and tidy data are two concepts that are crucial for the under-
standing of structured data.
As a result, the reader will be equipped with the mindset to perform
data tasks — collection, integration, tidying, and exploration — well.

55
56 CHAPTER 4. STRUCTURED DATA

Chapter remarks

Contents
4.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Database normalization . . . . . . . . . . . . . . . . . . . 58
4.2.1 Relational algebra . . . . . . . . . . . . . . . . . . 59
4.2.2 Normal forms . . . . . . . . . . . . . . . . . . . . 60
4.3 Tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Common messy datasets . . . . . . . . . . . . . . 66
4.4 Bridging normalization, tidiness, and data theory . . . . . 73
4.4.1 Tidy or not tidy? . . . . . . . . . . . . . . . . . . . 73
4.4.2 Change of observational unit . . . . . . . . . . . . 76
4.5 Data semantics and interpretation . . . . . . . . . . . . . 78
4.6 Unstructured data . . . . . . . . . . . . . . . . . . . . . . 79

Context

• Data comes in many different formats.


• Good data analysis requires understanding the data types and their
meanings.

Objectives

• Understand the most common data types and formats.


• Enable the reader to perform data tasks well by associating data
format and semantics.

Takeaways

• The choice of the observational unit is not always straightforward.


• Format and types must reflect the information the solution will
“see” in production.
4.1. DATA TYPES 57

4.1 Data types


The most common classification of data types is Stevens’ types: nominal,
ordinal, interval, and ratio. Nominal data are data that can be classified
into categories. Ordinal data are data that can be classified into cate-
gories and ordered. Interval data are data that can be classified into cat-
egories, ordered, and measured in fixed units. Ratio data are data that
can be classified into categories, ordered, measured in fixed units, and
have a true zero. In practice, they differ on the logical and arithmetic
operations we can perform on them.

Table 4.1: Stevens’ types.

Data type Operations


Nominal =
Ordinal =, <
Interval =, <, +, −
Ratio =, <, +, −, ×, ÷

Stevens’ types are a classification of data types based on the oper-


ations we can perform on them.

Table 4.1 summarizes the allowed operations for each of Stevens’


types. All types enable equality comparison. Ordinal data can also be
tested in terms of their order, but they do not allow quantitative differ-
ence. Interval data, on the other hand, allow addition and subtraction.
Finally, the true zero of ratio data enables us to calculate relative differ-
ences (multiplication and division).
For example, colors are nominal data. We can classify colors into cat-
egories, but we cannot order them. A categorical variable that classifies
sizes into small, medium, and large is ordinal data. We can order the
sizes, but we cannot say that the difference between small and medium
is the same as the difference between medium and large. Temperature
in Celsius is interval data. We can order temperatures, and we can say
that the difference between 10 and 20 degrees is the same as the differ-
ence between 20 and 30 degrees. However, we cannot say that 20 de-
grees is twice as hot as 10 degrees. Finally, weight is ratio data. We can
order weights, we can say that the difference between 10 and 20 kilo-
grams is the same as the difference between 20 and 30 kilograms, and
58 CHAPTER 4. STRUCTURED DATA

we can say that 20 kilograms is twice as heavy as 10 kilograms.


Nonetheless, Stevens’ types do not exhaust all possibilities for data
types. For example, probabilities are bounded at both ends, and thus
do not tolerate arbitrary scale shifts. Velleman and Wilkinson (1993)1
provide interesting insights about data types. Although I do not agree
with all his points, I think it is a good read. In particular, I agree with
his criticism of statements that data types are evident from the data in-
dependent of the questions asked. The same data can be interpreted in
different ways depending on the context and the goals of the analysis.
However, I do not agree with the idea that good data analysis does
not assume data types. I think that data scientists should be aware of the
data types they are working with and how they affect the analysis. With
no bias, there is no learning. There is no such thing as a “bias-free” anal-
ysis; the amount of possible combinations of assumptions easily grows
out of control. The data scientist must take responsibility for the conse-
quences of their assumptions. Good assumptions and hypotheses are a
key part of the data science methodology.
When we work with structured data, two concepts are very impor-
tant: database normalization and tidy data. Database normalization is
mainly focused on the data storage. Tidy data is mainly focused on the
requirements of data for analysis. Both concepts have their mathemati-
cal and logical foundations and tools for data handling.

4.2 Database normalization

Database normalization is the process of organizing the columns and ta-


bles of a relational database to minimize data redundancy and improve
data integrity. The need for database normalization comes from the fact
that the same data can be stored in many different ways.
Normal form is a state of a database that is free of certain types of
data redundancy. Before studying normal forms, we need to understand
basic concepts in database theory and the basic operations in relational
algebra.

1P. F. Velleman and L. Wilkinson (1993). “Nominal, Ordinal, Interval, and Ratio Ty-
pologies are Misleading”. In: The American Statistician 47.1, pp. 65–72. doi: 10.1080/
00031305.1993.10475938.
4.2. DATABASE NORMALIZATION 59

4.2.1 Relational algebra


Relational algebra is a theory that uses algebraic structures to manipu-
late relations. Consider the following concepts.

Relation A relation is a table with rows and columns that represent


an entity. Each row, or tuple, is assumed to appear only once in the
relation. Each column, or attribute, is assumed to have a unique name.

Projection The projection of a relation is the operation that returns a


relation with only the columns specified in the projection. For example,
if we have a relation 𝑋[𝐴, 𝐵, 𝐶] and we perform the projection 𝜋𝐴,𝐶 (𝑋),
we get a relation with only the columns 𝐴 and 𝐶, i.e., 𝑋[𝐴, 𝐶]. The num-
ber of rows in the resulting relation might be less than the number of
rows in the original relation because of repeated rows.

Join The (natural) join of two relations is the operation that returns
a relation with the columns of both relations. For example, if we have
two relations 𝑆[𝑈 ∪ 𝑉] and 𝑇[𝑈 ∪ 𝑊 ], where 𝑈 is the common set of
attributes, the join 𝑆 ⋈ 𝑇 of 𝑆 and 𝑇 is the relation with tuples (𝑢, 𝑣, 𝑤)
such that (𝑢, 𝑣) ∈ 𝑆 and (𝑢, 𝑤) ∈ 𝑇. The generalized join is built up
out of binary joins: ⋈ {𝑅1 , 𝑅2 , … , 𝑅𝑛 } = 𝑅1 ⋈ 𝑅2 ⋈ ⋯ ⋈ 𝑅𝑛 . Since
the join operation is associative and commutative, we can parenthesize
however we want.

Functional dependency A functional dependency is a constraint be-


tween two sets of attributes in a relation. It is a statement that if two
tuples agree on certain attributes, then they must agree on another at-
tribute. Specifically, the functional dependency 𝑈 → 𝑉 holds in 𝑅 if and
only if for every pair of tuples 𝑡1 and 𝑡2 in 𝑅 such that 𝑡1 [𝑈] = 𝑡2 [𝑈], it
is also true that 𝑡1 [𝑉 ] = 𝑡2 [𝑉].

Multi-valued dependency A multi-valued dependency constrains


two sets of attributes in a relation. The multi-valued dependency 𝑈 ↠ 𝑉
holds in 𝑅 if and only if 𝑅 = 𝑅[𝑈𝑉 ] ⋈ 𝑅[𝑈𝑊 ], where 𝑊 are the re-
maining attributes. Note, however, that unlike functional dependen-
cies, multi-valued dependencies are not simple to interpret, so we re-
strict our discussion to its mathematical properties2.
2In fact, one might naively think that a multi-valued dependency is a functional de-
pendency between many attributes. However, this is not the case.
60 CHAPTER 4. STRUCTURED DATA

Join dependency A join dependency is a constraint between subsets


of attributes (not necessarily disjoint) in a relation. 𝑅 obeys the join
dependency ∗ {𝑋1 , 𝑋2 , … , 𝑋𝑛 } if 𝑅 = ⋈ {𝑅[𝑋1 ], 𝑅[𝑋2 ], … , 𝑅[𝑋𝑛 ]}.

4.2.2 Normal forms


The normal forms are a series of progressive conditions that a relation
must satisfy to be considered normalized. The normal forms are cumu-
lative, i.e., a relation that is in 𝑛-th normal form is also in (𝑛 − 1)-th
normal form. The normal forms are a way to reduce redundancy and
improve data integrity.

First normal form (1NF) A relation is in 1NF if and only if all at-
tributes are atomic. An attribute is atomic if it is not a set of attributes.
For example, the relation 𝑅[𝐴, 𝐵, 𝐶] is in 1NF if and only if 𝐴, 𝐵, and 𝐶
are atomic.

Second normal form (2NF) A relation is in 2NF if and only if it is in


1NF and every non-prime attribute is fully functionally dependent on
the primary key. A non-prime attribute is an attribute that is not part
of the primary key. A primary key is a set of attributes that uniquely
identifies a tuple. A non-prime attribute is fully functionally dependent
on the primary key if it is functionally dependent on the primary key and
not on any subset of the primary key. For example, the relation 𝑅[𝑈 ∪𝑉 ]
is in 2NF if and only if 𝑈 → 𝑋, ∀𝑋 ∈ 𝑉 and there is no 𝑊 ⊂ 𝑈 such
that 𝑊 → 𝑋, ∀𝑋 ∈ 𝑉 .

Third normal form (3NF) A relation is in 3NF if and only if it is


in 2NF and every non-prime attribute is non-transitively dependent on
the primary key. A non-prime attribute is non-transitively dependent
on the primary key if it is not functionally dependent on another non-
prime attribute. For example, the relation 𝑅[𝑈 ∪ 𝑉 ] is in 3NF if and only
if 𝑈 is the primary key and there is no 𝑋 ∈ 𝑉 such that 𝑋 → 𝑌 , ∀𝑌 ∈ 𝑉 .

Boyce-Codd normal form (BCNF) A relation 𝑅 with attributes 𝑋 is


in BCNF if and only if it is in 2NF and for each nontrivial functional
dependency 𝑈 → 𝑉 in 𝑅, the functional dependency 𝑈 → 𝑋 is in 𝑅. In
other words, a relation is in BCNF if and only if every functional depen-
dency is the result of keys.
4.2. DATABASE NORMALIZATION 61

Fourth normal form (4NF) A relation 𝑅 with attributes 𝑋 is in 4NF


if and only if it is in 2NF and for each nontrivial multi-valued depen-
dency 𝑈 ↠ 𝑉 in 𝑅, the functional dependency 𝑈 → 𝑋 is in 𝑅. In other
words, a relation is in 4NF if and only if every multi-valued dependency
is the result of keys.

Projection join normal form (PJNF) A relation 𝑅 with attributes


𝑋 is in PJNF3 if and only if it is in 2NF and the set of key dependen-
cies4 of 𝑅 implies each join dependency of 𝑅. The PJNF guarantees that
the table cannot be decomposed without losing information (except by
decompositions based on keys).
The idea behind the definition of BCNF and 4NF is slightly differ-
ent from the PJNF. In fact, if we consider that for each key dependency
implies a join dependency, the relation is in the so-called overstrong
projection-join normal form5. Such a level of normalization does not
improve data storage or eliminate inconsistencies. In practice, it means
that if a relation is in PJNF, careless joins — i.e., those that violate a join
dependency — produce inconsistent results.

Simple example Consider the 2NF relation 𝑅[𝐴, 𝐵, 𝐶, 𝐷] with func-


tional dependencies 𝐴 → 𝐵, 𝐵 → 𝐶, 𝐶 → 𝐷. The relation is not in
3NF because 𝐶 is transitively dependent on 𝐴. To normalize it, we can
decompose it into the relations 𝑅1 [𝐴, 𝐵, 𝐶] and 𝑅2 [𝐶, 𝐷]. Now, 𝑅2 is in
3NF and 𝑅1 is in 2NF, but not in 3NF. We can decompose 𝑅1 into the re-
lations 𝑅3 [𝐴, 𝐵] and 𝑅4 [𝐵, 𝐶]. The original relation can be reconstructed
by ⋈ {𝑅2 , 𝑅3 , 𝑅4 }.

Illustrative example of data integrity Consider a relation of stu-


dents and their grades. The relation contains the attributes “student”,
“course”, “course credits”, and “grade”. The primary key is the composite
of “student” and “course”. The functional dependencies are “student”
and “course” determine “grade”, and “course” determines “course cred-
its”. The relation is in 2NF but not 3NF.
3Also known as fifth normal form (5NF). The authors themselves prefer the term
PJNF because it emphasizes the operations to which the normal form applies.
4Key dependency is a functional dependency in the form 𝐾 → 𝑋, where 𝑋 encom-
passes all attributes of the relation.
5R. Fagin (1979). “Normal forms and relational database operators”. In: Proceedings
of the 1979 ACM SIGMOD International Conference on Management of Data. SIGMOD
’79. Boston, Massachusetts: Association for Computing Machinery, pp. 153–160. isbn:
089791001X. doi: 10.1145/582095.582120.
62 CHAPTER 4. STRUCTURED DATA

Table 4.2: Student grade relation.

Student Course Course credits Grade


Alice Math 4 A
Alice Physics 3 B
Bob Math 4 B
Bob Physics 3 A

An example of a relation of students and their grades in 2NF.

Table 4.2 shows an example of possible values of the relation. If we


decide to change the course credits of the course “Math” to 5, we must
update the two rows; otherwise, the relation will be inconsistent. A 3NF
relation (see table 4.3) would have the attributes “course” and “course
credits” in a separate relation, avoiding the possibility of data inconsis-
tency. If needed, the relation would be reconstructed by a join opera-
tion.

Table 4.3: Student grade relation in 3NF.

Student Course Grade


Course Course credits
Alice Math A
Math 4 Alice Physics B
Physics 3 Bob Math B
Bob Physics A

An example of a relation of students and their grades in 3NF.

Invalid join example Consider the 2NF relation 𝑅[𝐴𝐵𝐶]6 such that
the primary key is the composite of 𝐴, 𝐵, and 𝐶. The relation is thus in
the 4NF, as no column is a determinant of another column. Suppose,
however, the following constraint: if (𝑎, 𝑏, 𝑐′ ), (𝑎, 𝑏′ , 𝑐), and (𝑎′ , 𝑏, 𝑐) are
in 𝑅, then (𝑎, 𝑏, 𝑐) is also in 𝑅. This can be illustrated if we consider 𝐴
6Here we abbreviate 𝐴, 𝐵, 𝐶 as 𝐴𝐵𝐶.
4.3. TIDY DATA 63

as a agent, 𝐵 as a product, and 𝐶 as a company. If an agent 𝑎 represents


companies 𝑐 and 𝑐′ , and product 𝑏 is in his portfolio, then assuming both
companies make 𝑏, 𝑎 must offer 𝑏 from both companies.
The relation is not in PJNF, as the join dependency ∗ {𝐴𝐵, 𝐴𝐶, 𝐵𝐶}
is not implied by the primary key. (In fact, the only functional depen-
dency is the trivial 𝐴𝐵𝐶 → 𝐴𝐵𝐶.) In this case, to avoid redundancies
and inconsistencies, we must split the relation into the relations 𝑅1 [𝐴𝐵],
𝑅2 [𝐴𝐶], and 𝑅3 [𝐵𝐶].
It is interesting to notice that in this case, the relation 𝑅1 ⋈ 𝑅2 might
contain tuples that do not make sense in the context of the original rela-
tion. For example, if 𝑅1 contains (𝑎, 𝑏) and 𝑅2 contains (𝑎, 𝑐′ ), the join
contains (𝑎, 𝑏, 𝑐′ ), which might not be a valid tuple in the original rela-
tion if (𝑏, 𝑐′ ) is not in 𝑅3 .

Important note on PJNF

This is very important to notice, as it is a common mistake to assume


that the join of the decomposed relations always contains valid tu-
ples.

Valid joins example Consider the 2NF relation 𝑅[𝐴, 𝐵, 𝐶, 𝐷, 𝐸] with


the functional dependencies 𝐴 → 𝐷, 𝐴𝐵 → 𝐶, and 𝐵 → 𝐸. To make it
PJNF, we can decompose it into the relations 𝑅1 [𝐴, 𝐷], 𝑅2 [𝐴, 𝐵, 𝐶], and
𝑅3 [𝐵, 𝐸]. The original relation can be reconstructed by ⋈ {𝑅1 , 𝑅2 , 𝑅3 }.
However, unlike the previous example, the join of the decomposed rela-
tions always contains valid tuples — excluding degenerate joins, where
there are no common attributes. The reason is that all join dependencies
implied by the key dependencies are trivial when reduced7.

4.3 Tidy data


It is estimated that 80% of the time spent on data analysis is spent on
data preparation. Usually, the same process is repeated many times in
different datasets. The idea is that organized data carries the meaning

7I am investigating a formal proof based on M. W. Vincent (1997). “A corrected 5NF


definition for relational database design”. In: Theoretical Computer Science 185.2. Theo-
retical Computer Science in Australia and New Zealand, pp. 379–391. issn: 0304-3975.
doi: 10.1016/S0304-3975(97)00050-9.
64 CHAPTER 4. STRUCTURED DATA

of the data, reducing the time spent on handling the data to get it into
the right format for analysis.
Tidy data, proposed by Wickham (2014)8, is a data format that pro-
vides a standardized way to organize data values within a dataset. The
main advantage of tidy data is that it provides clear semantics with a
focus on only one view of the data.
Many data formats might be ideal for particular tasks, such as raw
data, dense tensors, or normalized databases. However, most statistical
and machine learning methods require a particular data format. Tidy
data is a data format that is suitable for those tasks.
In an unrestricted table, the meaning of rows and columns is not
fixed. In a tidy table, the meaning of rows and columns is fixed. The
semantics are more restrictive than usually required for general tabular
data.

Table 4.4: Example of same data in different formats.

Cases (2019) Cases (2020)


Brazil 100 200
USA 400

Brazil USA
Cases (2019) 100
Cases (2020) 200 400

The same data in different formats. Both are considered messy by


Wickham.

Table 4.4 shows an example of the same data in different formats.


Although they emphasize different aspects (especially for visualization)
of the data, both contain the same amount of information. They are
considered messy by Wickham because the meaning of the rows and
columns is not fixed.
It is based on the idea that a dataset is a collection of values, where:

• Each value belongs to a variable and an observation.


8H. Wickham (2014). “Tidy Data”. In: Journal of Statistical Software 59.10, pp. 1–23.
doi: 10.18637/jss.v059.i10.
4.3. TIDY DATA 65

• Each variable, represented by a column, contains all values that


measure the same attribute across (observational) units.
• Each observation, represented by a row, contains all values mea-
sured on the same unit across attributes.
• Attributes are the characteristics of the units, e.g., height, temper-
ature, duration.
• Observational units are the individual entities being measured, for
instance, a person, a day, an experiment.
Table 4.5 summarizes the main concepts.

Table 4.5: Tidy data concepts.

Concept Structure Contains Across


Variable Column Same attribute Units
Observation Row Same unit Attributes

Table 4.6: Example of tidy data.

Country Year Cases


Brazil 2019 100
Brazil 2020 200
USA 2019
USA 2020 400

An example of tidy data from the data in table 4.4.

If we follow this structure, the meaning of the data is implicit in the


table. Table 4.6 shows the same data in a tidy format. The table is now
longer, but the variables and observations are clear from the table itself.
However, it is not always trivial to organize data in a tidy format.
Usually, we have more than one level of observational units, each one
represented by a table. Moreover, there might exist more than one way
to define what the observational units in a dataset are9.
9Although Wickham himself implies that there is only one possible way to define the
observational units of the dataset.
66 CHAPTER 4. STRUCTURED DATA

To organize data in a tidy format, one can consider that:


• Attributes are functionally related among themselves — e.g., Z is
a linear combination of X and Y, or X and Y are correlated, or
𝑃(𝑋, 𝑌 ) follows some joint distribution.
• Units can be grouped or compared — e.g., person A is taller than
person B, or the temperature in day 1 is higher than in day 2.
A particular point that tidy data do not address is that values in a col-
umn might not be in the same scale or unit of measurement10. For ex-
ample, a column might contain the temperature in an experiment, and
another column might contain the unit of measurement that was used
to measure the temperature. This is a common problem in databases,
and it must be addressed for machine learning and statistical methods
to work properly.
Note that the order of the rows and columns is not important. How-
ever, it might be convenient to sort data in a particular way to facilitate
understanding. For instance, one usually expects that the first columns
are fixed variables11 — i.e., variables that are not the result of a measure-
ment but that describe the experimental design —, and the last columns
are measured variables. Also, arranging rows by some variable might
highlight some pattern in the data.
Usually, columns are named — the collection of all column names
is called the header, while rows are usually numerically indexed.

4.3.1 Common messy datasets


Wickham (2014)12 lists some common problems with messy datasets
and how to tidy them. In this subsection, we focus on the problems
and the tidy solutions. The data handling operations that enable us to
tidy the data are presented in chapter 5. Readers interested in a step-by-
step guide for data tidying are encouraged to read Wickham, Çetinkaya-
Rundel, and Grolemund (2023)13.
The problems are summarized in the following.

10Observational unit is not the same concept as unit of measurement.


11Closely related (and potentially the same as) key in database theory.
12H. Wickham (2014). “Tidy Data”. In: Journal of Statistical Software 59.10, pp. 1–23.
doi: 10.18637/jss.v059.i10.
13H. Wickham, M. Çetinkaya-Rundel, and G. Grolemund (2023). R for Data Science:
Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media.
4.3. TIDY DATA 67

Headers are values, not variable names


For example, consider table 4.7. This table is not tidy because the col-
umn headers are values, not variable names. This format is frequently
used in presentations since it is more compact. It is also useful to per-
form matrix operations. However, it is not appropriate for general anal-
ysis.

Table 4.7: Messy table, from Pew Forum dataset, where headers
are values, not variable names.

Religion <$10k $10-20k $20-30k …


Agnostic 27 34 60 …
Atheist 12 27 37 …
Buddhist 27 21 30 …
… … … … …

To make it tidy, we can transform it into the table 4.8 by explicitly


introducing variables Income and Frequency. Note that the table is now
longer, but it is also narrower. This is a common pattern when fixing
this kind of issue. The table is now tidy because the column headers are
variable names, not values.

Table 4.8: Tidy version of table 4.7 where values are correctly
moved.

Religion Income Frequency


Agnostic <$10k 27
Agnostic $10-20k 34
Agnostic $20-30k 60
… … …
Atheist <$10k 12
Atheist $10-20k 27
Atheist $20-30k 37
… … …
68 CHAPTER 4. STRUCTURED DATA

Multiple variables are stored in one column


For example, consider the table 4.9. This table is not tidy because the
column — interestingly called column — contains multiple variables.
This format is frequent, and sometimes the column name contains the
names of the variables. Sometimes it is very hard to separate the vari-
ables.

Table 4.9: Messy table, from TB dataset, where multiple variables


are stored in one column.

country year column cases …


AD 2000 m014 0 …
AD 2000 m1524 0 …
AD 2000 m2534 1 …
AD 2000 m3544 0 …
… … … …

To make it tidy, we can transform it into the table 4.10. Two columns
are created to contain the variables Sex and Age, and the old column
is removed. The table keeps the same number of rows, but it is now
wider. This is a common pattern when fixing this kind of issue. The
new version usually fixes the issue of correctly calculating ratios and
frequency.

Table 4.10: Tidy version of table 4.9 where values are correctly
moved.

country year sex age cases …


AD 2000 m 0–14 0 …
AD 2000 m 15–24 0 …
AD 2000 m 25–34 1 …
AD 2000 m 35–44 0 …
… … … … …
4.3. TIDY DATA 69

Variables are stored in both rows and columns


For example, consider the table 4.11. This is the most complicated case
of messy data. Usually, one of the columns contains the names of the
variables, in this case the column element.

Table 4.11: Messy table, adapted from airquality dataset, where


variables are stored in both rows and columns.

id year mo. element d1 d2 … d31


MX17004 2010 1 tmax 24 … 27
MX17004 2010 1 tmin 14 …
MX17004 2010 2 tmax 27 24 … 27
MX17004 2010 2 tmin 14 … 13
… … … … … … … …

To fix this issue, we must first decide which column contains the
names of the variables. Then, we must lengthen the table in function of
the variables (and potentially their names), as seen in table 4.12.

Table 4.12: Partial solution to tidy table 4.11. Note that the table
is now longer.

id date element value


MX17004 2010-01-01 tmax
MX17004 2010-01-01 tmin 14
MX17004 2010-01-02 tmax 24
MX17004 2010-01-02 tmin
… … … …
70 CHAPTER 4. STRUCTURED DATA

Afterwards, we widen the table in function of their names. Finally,


we remove implicit information, as seen in table 4.13.

Table 4.13: Tidy version of table 4.11 where values are correctly
moved.

id date tmin tmax


MX17004 2010-01-01 14
MX17004 2010-01-02 24
… … … …

Multiple types of observational units are stored in the same


table
For example, consider the table 4.14. It is very common during data
collection that many observational units are registered in the same table.

Table 4.14: Messy table, adapted from billboard dataset, where


multiple types of observational units are stored in the same table.

year artist track date rank


2000 2 Pac Baby Don’t Cry 2000-02-26 87
2000 2 Pac Baby Don’t Cry 2000-03-04 82
2000 2 Pac Baby Don’t Cry 2000-03-11 72
2000 2 Pac Baby Don’t Cry 2000-03-18 77
… … … … …
2000 2Ge+her The Hardest… 2000-09-02 91
2000 2Ge+her The Hardest… 2000-09-09 87
2000 2Ge+her The Hardest… 2000-09-16 92
… … … … …

To fix this issue, we must ensure that each observation unit is moved
to a different table. Sometimes, it is useful to create unique identifiers
for each observation. The separation avoids several types of potential
inconsistencies. However, take into account that during data analysis,
it is possible that we have to denormalize them. The two resulting tables
are shown in table 4.15 and table 4.16.
4.3. TIDY DATA 71

Table 4.15: Tidy version of table 4.14 containing the observational


unit track.

track id artist track


1 2 Pac Baby Don’t Cry
2 2Ge+her The Hardest Part Of Breaking Up
… … …

Table 4.16: Tidy version of table 4.14 containing the observational


unit rank of the track in a certain week.

track id date rank


1 2000-02-26 87
1 2000-03-04 82
1 2000-03-11 72
1 2000-03-18 77
… … …
2 2000-09-02 91
2 2000-09-09 87
2 2000-09-16 92
… … …

A single observational unit is stored in multiple tables


For example, consider tables 4.17 and 4.18. It is very common during
data collection that a single observational unit is stored in multiple ta-
bles. Usually, the table (or file) itself represents the value of a variable.
When columns are compatible, it is straightforward to combine the ta-
bles.
To fix this issue, we must first make the columns compatible. Then,
we can combine the tables adding a new column that identifies the ori-
gin of the data. The resulting table is shown in table 4.19.
72 CHAPTER 4. STRUCTURED DATA

Table 4.17: Messy tables, adapted from nycflights13 dataset, where


a single observational unit is stored in multiple tables. Assume
that the origin filename is called 2013.csv.

month day time …


1 1 517 …
1 1 533 …
1 1 542 …
1 1 544 …
… … … …

Table 4.18: Messy tables, adapted from nycflights13 dataset, where


a single observational unit is stored in multiple tables. Assume
that the origin filename is called 2014.csv.

month day time …


1 1 830 …
1 1 850 …
1 1 923 …
1 1 1004 …
… … … …

Table 4.19: Tidy data where tables 4.17 and 4.18 are combined.

year month day time …


2013 1 1 517 …
2013 1 1 533 …
2013 1 1 542 …
2013 1 1 544 …
… … … … …
2014 1 1 830 …
2014 1 1 850 …
2014 1 1 923 …
2014 1 1 1004 …
… … … … …
4.4. BRIDGING NORMALIZATION, TIDINESS, AND DATA THEORY73

4.4 Bridging normalization, tidiness, and data


theory
First and foremost, both concepts, normalization and tidy data, are not
in conflict.
In data normalization, given a set of functional, multivalued, and
join dependencies, there exists a normal form that is free of redundancy.
In tidy data, Wickham, Çetinkaya-Rundel, and Grolemund (2023)14 also
state that there is only one way to organize the given data.
Wickham (2014)15 state that tidy data is 3NF. However, he does not
provide a formal proof. Since tidy data focuses on data analysis and not
on data storage, I argue that there is more than one way to organize the
data in a tidy format. It actually depends on what you define as the
observational unit.
Moreover, both of them are related to the philosophical concept of
substance (οὐ σία) — see section 2.3.1. Entities and observational units
are substances while attributes are predicates. Each tuple or observation
is a primary substance, i.e., a substance that contrasts with everything
else, particular, individual.
We can also understand primary keys and fixed variables as the same
concept. They both describe the sample uniquely. They connect the en-
tities/observational units to the remaining attributes. They also should
never be fed into a learning machine (more details in chapter 6), since
they are individual and thus not appropriate to generalize.
Table 4.20 summarizes the equivalence (or similarity) of terms in
different contexts.

4.4.1 Tidy or not tidy?


Consider the following example. We want to study the phenomenon of
temperature in a certain city. We fix three sensors in different locations
to measure the temperature. We collect data three times a day. If we
consider as the observational unit the event of measuring the tempera-
ture, we can organize the data in a tidy format as shown in table 4.21.
However, since the sensors are fixed, we can consider the observa-
tional unit as the temperature at some time. In this case, we can organize
the data in a tidy format as shown in table 4.22.
14H. Wickham, M. Çetinkaya-Rundel, and G. Grolemund (2023). R for Data Science:
Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media.
15H. Wickham (2014). “Tidy Data”. In: Journal of Statistical Software 59.10, pp. 1–23.
doi: 10.18637/jss.v059.i10.
74 CHAPTER 4. STRUCTURED DATA

Table 4.20: Terms in different contexts.

Relations Tidy data Philosophy


Entities Observational units Substance
Tuple Observation Primary substance
Primary key Fixed variables Univocal name
Non-prime attr. Measured variable Predicate

Equivalence (or similarity) of data-related terms in different con-


texts. The ontological understanding of the data influences the
way it is organized.

Table 4.21: Tidy data where the observational unit is the event of
measuring the temperature.

date time sensor temperature


2023-01-01 00:00 1 20
2023-01-01 00:00 2 21
2023-01-01 00:00 3 22
2023-01-01 08:00 1 21
2023-01-01 08:00 2 22
2023-01-01 08:00 3 23
… … … …

Table 4.22: Tidy data where the observational unit is the tempera-
ture at some time.

date time temp. 1 temp. 2 temp. 3


2023-01-01 00:00 20 21 22
2023-01-01 08:00 21 22 23
… … … … …
4.4. BRIDGING NORMALIZATION, TIDINESS, AND DATA THEORY75

In both cases, one can argue that the data is also normalized. In the
first case, the primary key is the composite of the columns date, time,
and sensor. In the second case, the primary key is the composite of the
columns date and time.
One can state that the first form is more appropriate, since it is flexi-
ble enough to add more sensors or sensor-specific attributes (using an ex-
tra table). However, the second form is very natural for machine learn-
ing and statistical methods. Given the definition of tidy data, I believe
both forms are correct. It is just a matter of what ontological view you
have of the data.

Table 4.23: Tidy data for measurements of a person’s body.

name chest waist hip


Alice 90 70 100
Bob 100 110 110
… … … …

Still, one can argue that the sensors share the same nature and thus
only the first form is correct (or can even insist that the more flexible
form is the correct one). Consider however the data in table 4.23. The
observational unit is the person, and the attributes are the body mea-
surements.

Table 4.24: Another tidy data for measurements of a person’s body.

name body part measurement


Alice chest 90
Alice waist 70
Alice hip 100
Bob chest 100
Bob waist 110
Bob hip 110
… … …

If we apply the same logic of table 4.21, data in table 4.23 becomes ta-
ble 4.24. Now, the observational unit is the measurement of a body part
76 CHAPTER 4. STRUCTURED DATA

of a given person. Now, we can easily include more body parts. Let us
say that we want to add the head circumference. We just need to include
rows such as “Alice, head, 50” and “Bob, head, 55”. Moreover, what if
we want to add the height of the person? Should we create another table
(with “name” and “height”) or should we consider “height” as another
body part (even though it seems weird to consider the full body a part
of the body)?
In the first version of the data (table 4.21), it would be trivial to in-
clude head circumference and height. In the second version, the choice
becomes inconvenient. This table seems “overly tidy”. If the first fits
well for the analysis, it should be preferred.
In summary, tidiness is a matter of perspective.

4.4.2 Change of observational unit


Another very interesting conjecture is whether we can formalize the
eventual change of observational unit in terms of the order that joins
and grouping operations are performed.
Consider the following example: the relation 𝑅[𝐴, 𝐵, 𝐶, 𝐷, 𝐸] and the
functional dependencies 𝐴 → 𝐷, 𝐵 → 𝐸, and 𝐴𝐵 → 𝐶. The relation can
be normalized up to 3NF by following one of the decomposition trees
shown in fig. 4.1. Every decomposition tree must take into account that
the join of the projections is lossless and dependency preserving.

Figure 4.1: Decomposition trees for the relation 𝑅[𝐴𝐵𝐶𝐷𝐸] and


the functional dependencies 𝐴 → 𝐷, 𝐵 → 𝐸, and 𝐴𝐵 → 𝐶 to
reach 3NF.

ABCDE ABCDE

AD ABCE BE ABCD

BE ABC AD ABC

Note that the decomposition that splits first 𝑅[𝐴𝐵𝐶] is not valid,
since the resulting relation 𝑅[𝐴𝐵] is not a consequence of a functional
dependency; see fig. 4.2.
4.4. BRIDGING NORMALIZATION, TIDINESS, AND DATA THEORY77

Figure 4.2: Invalid decomposition trees for the relation


𝑅[𝐴𝐵𝐶𝐷𝐸].

ABCDE ABCDE

ABC ABDE ABC ABDE

AD ABE BE ABD

BE AB AD AB

We consider the functional dependencies 𝐴 → 𝐷, 𝐵 → 𝐸, and


𝐴𝐵 → 𝐶. Note that 𝑅[𝐴𝐵] is not a consequence of a functional
dependency.

In this kind of relation schema, we have a set of key attributes, here


𝒦 = 𝐴𝐵, and a set of non-prime attributes, here 𝒩 = 𝐶𝐷𝐸. Note that
the case 𝒦 ∩ 𝒩 = ∅ is the simplest one we can have.
Observe, however, that transitive dependencies16 and complex join
dependencies restrict even further the joins we are allowed to perform.
Now, consider a very common case: in our dataset, keys are un-
known. Let 𝐴 be a student id, 𝐵 be the course id, 𝐷 be the student age, 𝐸
be the course load, and 𝐶 be the student grade at the course. If only 𝐶𝐷𝐸
is known, the table 𝑅[𝐶𝐷𝐸] is already tidy — and the observational unit
is the enrollment — once there is no key to perform any kind of normal-
ization. This happens in many cases where privacy is a concern.
But we can also consider that the observational unit is the student.
In this case, we must perform joins traversing the leftmost decomposi-
tion tree in fig. 4.1 from bottom to top. After each join, a summariza-
tion operation is performed on the relation considering the student as
the observational unit, i.e., over attribute 𝐴. The first join results in re-
lation 𝑅[𝐴𝐵𝐶𝐸] and the summarization operation results in a new re-
16Actually, when an attribute is both key and non-prime, some joins may generate
invalid tables.
78 CHAPTER 4. STRUCTURED DATA

Table 4.25: Example of a dataset where the observational unit is


the student.

A (student) B (course) C (grade) E (load)


1 1 7 60
1 2 8 30
2 1 7 60
2 3 9 40
… … … …

A (student) F (average grade) G (total load)


1 7.5 90
2 8 100
… … …

Relation 𝑅[𝐴𝐵𝐶𝐸] becomes 𝑅[𝐴𝐹𝐺] after the summarization op-


eration. Now each row represents a student (values in 𝐴 are
unique).

lation 𝑅[𝐴𝐹𝐺] where 𝐹 is the average grade and 𝐺 is the total course
load taken by the student (see table 4.25). They are all calculated based
on the rows that are grouped in function of 𝐴. It is important to notice
that, after the summarization operation, all observations must contain
a different value of 𝐴. The second join results in relation 𝑅[𝐴𝐷𝐹𝐺] =
𝑅[𝐴𝐷] ⋈ 𝑅[𝐴𝐹𝐺]. This relation has functional dependency 𝐴 → 𝐷𝐹𝐺,
and it is in 3NF (which is also tidy).
Unfortunately, it is not trivial to calculate all possible decomposition
trees for a given dataset. It is up to the data scientist to decide which
directions to follow. However, it is important to notice that the order of
the joins and summarization operations are crucial to the final result.

4.5 Data semantics and interpretation


In the rest of the book, we focus on a statistical view of the data. Besides
the functional dependencies, we also consider the statistical dependen-
cies of the data. For instance, attributes 𝐴 and 𝐵 might not be function-
ally dependent, but they might exist in an unknown 𝑃(𝐴, 𝐵) that we can
4.6. UNSTRUCTURED DATA 79

estimate from the data. Each observed value of a key can represent an
instance of a random variable, and the other attributes can represent
measured attributes or calculated properties.
For data analysis, it is very important to understand the relationships
between the observations. For example, we might want to know if the
observations are independent, if they are identically distributed, or if
there is a known selection bias. We might also want to know if the ob-
servations are dependent on time, and if there are hidden variables that
affect the observations.
Following wrong assumptions can lead to wrong conclusions. For
example, if we assume that the observations are independent, but they
are not, we might underestimate the variance of the estimators.
Although we do not focus on time series, we must consider the tem-
poral dependence of the observations. For example, we might want to
know how the observation 𝑥𝑡 is affected by 𝑥𝑡−1 , 𝑥𝑡−2 , and so on. We
might also want to know if the Markov property holds, and if there is
periodicity and seasonality in the data.
For the sake of the scope of this book, we suggest that any predic-
tion on temporal data should be done in the state space, where it is
safer to assume that observations are independent and identically dis-
tributed. This is a common practice in reinforcement learning and deep
learning. Takens’ theorem17 allows you to reconstruct the state space
of a dynamical system using time-delay embedding. Given a single ob-
served time series, you can create a multidimensional representation
of the underlying dynamical system by embedding the time series in a
higher-dimensional space. This embedding can reveal the underlying
dynamics and structure of the system.

4.6 Unstructured data


Unstructured data are data that do not have a predefined data model or
are not organized in a predefined manner. For example, text, images,
and videos are unstructured data.
Every unstructured dataset can be converted into a structured one.
However, the conversion process is not always straightforward nor loss-
less. For example, we can convert a text into a structured dataset by

17F. Takens (2006). “Detecting strange attractors in turbulence”. In: Dynamical Sys-
tems and Turbulence, Warwick 1980: proceedings of a symposium held at the University of
Warwick 1979/80. Springer, pp. 366–381.
80 CHAPTER 4. STRUCTURED DATA

counting the number of occurrences of each word18. However, we lose


the order of the words in the text.
The study of unstructured data is, for the moment, out of the scope
of this book.

18This is called a bag-of-words approach.


5
Data handling

† It’s dangerous to go alone! Take this.


— Unnamed Old Man, The Legend of Zelda

In the previous chapter, I discussed the relationship between data for-


mat and data semantics. We also saw in chapter 3 that data tasks —
specifically integration and tidying — must adjust the available data to
reflect the kind of input we expect in production. Data handling consists
of operating on this data.
For those tasks, we must be careful with the operations we perform
on the data. At the stage of data preparation, for example, we should
never parametrize our data handling pipeline in terms of information
retrieved1 by the values of the data. This is because such operations lead
to data leakage during evaluation and other biases in our conclusions.
In this chapter, we consider that tables are rectangular data struc-
tures in which values of the same column share the same properties (i.e.
the same type, same restrictions, etc.) and each column has a name.
Moreover, we assume that any value is possibly missing.
From a mathematical definition of such tables, we can define a set of
operations that can be applied to them. These operations are the build-
ing blocks of data handling pipelines: combinations of operations that
transform a dataset into another dataset.
Finally, I highlight some important properties of these operations.
Especially, the split-invariance property, which ensures that the opera-
tions do not add bias to the data due to the way the data was collected.

1For instance, imputation by the mean of a column.

81
82 CHAPTER 5. DATA HANDLING

Chapter remarks

Contents
5.1 Formal structured data . . . . . . . . . . . . . . . . . . . 83
5.1.1 Splitting and binding . . . . . . . . . . . . . . . . 84
5.1.2 Split invariance . . . . . . . . . . . . . . . . . . . 86
5.1.3 Illustrative example . . . . . . . . . . . . . . . . . 87
5.2 Data handling pipelines . . . . . . . . . . . . . . . . . . . 90
5.3 Split-invariant operations . . . . . . . . . . . . . . . . . . 91
5.3.1 Tagged splitting and binding . . . . . . . . . . . . 92
5.3.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Joining . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.4 Selecting . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.5 Filtering . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.6 Mutating . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.7 Aggregating . . . . . . . . . . . . . . . . . . . . . 102
5.3.8 Ungrouping . . . . . . . . . . . . . . . . . . . . . 102
5.4 Other operations . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Projecting or grouping . . . . . . . . . . . . . . . 105
5.4.2 Grouped and arranged operations . . . . . . . . . 107
5.5 An algebra for data handling . . . . . . . . . . . . . . . . 109

Context

• Data handling consists of operating on tables.


• Properties of the operations are important to avoid bias.
• Data handling pipelines are a way to organize these operations.

Objectives

• Define a formal structure for tables.


• Define a set of operations that can be applied to tables.

Takeaways

• Split-invariant operations avoid sampling bias.


• One must understand the properties and premises of the opera-
tions.
5.1. FORMAL STRUCTURED DATA 83

5.1 Formal structured data


In this section, I present a formal definition of structured data. This def-
inition is compatible with the relational model and tidy data presented
in chapter 4. My definition takes into account the index2 of the table,
which is a key concept in data handling. We also consider that values
can be missing. Repeated rows are represented by allowing cells to con-
tain sets of values. In this chapter, we consider dataset and table as syn-
onyms.

Definition 5.1: (Indexed table)

An indexed table 𝑇 is a tuple (𝐾, 𝐻, 𝑐), where 𝐾 =


{𝐾𝑖 ∶ 𝑖 = 1, … , 𝑘} is the set of index columns, 𝐻 is the set of
(non-index) columns, and 𝑐 ∶ 𝒟(𝐾1 ) × ⋯ × 𝒟(𝐾𝑘 ) × 𝐻 → 𝒱 is the
cell function. Here, 𝒱 represents the space of all possible tuples
of values, which may include missing values ?. Values have
arbitrary types, such as integers, real numbers, strings, etc. Each
index column 𝐾𝑖 has a domain 𝒟(𝐾𝑖 ), which is an enumerable
set of values.

A possible row 𝑟 of the table is indexed by a tuple 𝑟 = (𝑘1 , … , 𝑘𝑘 ),


where 𝑘𝑖 ∈ 𝒟(𝐾𝑖 ). Each row has a cardinality card(𝑟), which represents
how many times the entity represented by the row is present in the table.
A row 𝑟 with card(𝑟) = 0 is considered to be missing.
A cell is then represented by a row 𝑟 and a column ℎ ∈ 𝐻. The value
of the cell, v = 𝑐(𝑟, ℎ) is a tuple of values in the domain 𝒟(ℎ) ∪ {?}, such
that |v| = card(𝑟). We say that 𝒟(ℎ) is the valid domain of the column
ℎ. The order of the elements in the tuple v is arbitrary but fixed.
We can stack nested rows to form a matrix of values. This matrix is
called the value matrix of the row 𝑟.
We assume that value matrices — and consequently row cardinali-
ties — are minimal. This means that there are no nested rows

𝑣 𝑖,1 , … , 𝑣 𝑖,|𝐻|

in the value matrices such that 𝑣 𝑖,𝑗 = ? for all 𝑗.


From these concepts, we can define the basic operations and prop-
erties that can be applied to tables.
2Also called grouping variables.
84 CHAPTER 5. DATA HANDLING

Definition 5.2: (Nested row)

A nested row consists of a tuple of values that associates different


columns with the same repetition of the entity, i.e.

[𝑣ℎ𝑖 ∶ ℎ ∈ 𝐻],

where 𝑐(𝑟, ℎ) = [𝑣ℎ𝑖 ∶ 𝑖 = 1, … , card(𝑟)], assuming an arbitrary


fixed order of the columns ℎ.

Definition 5.3: (Value matrix)

The value matrix 𝑉 = (𝑣 𝑖,𝑗 ) of the row 𝑟 is

[𝑐(𝑟, ℎ) ∶ ℎ ∈ 𝐻],

with dimensions card(𝑟) × |𝐻|.

5.1.1 Splitting and binding


Split and bind are very basic operations that can be applied to tables.
They are inverses of each other and are used to divide and combine ta-
bles, respectively. They are important in the data science process be-
cause they play a key role in data semantics and validation of solutions.

Definition 5.4: (Split operation)

Given an indicator function 𝑠 ∶ 𝒟(𝐾1 ) × ⋯ × 𝒟(𝐾𝑘 ) → {0, 1}, the


split operation creates two tables, 𝑇0 and 𝑇1 , that contain only the
rows for which 𝑠(𝑟) = 0 and 𝑠(𝑟) = 1, respectively.
Mathematically, the split operation is defined as

split(𝑇, 𝑠) = (𝑇0 , 𝑇1 ) ,

where 𝑇 = (𝐾, 𝐻, 𝑐), 𝑇𝑖 = (𝐾, 𝐻, 𝑐 𝑖 ), and

𝑐(𝑟, ℎ) if 𝑠(𝑟) = 𝑖
𝑐 𝑖 (𝑟, ℎ) = {
() otherwise.
5.1. FORMAL STRUCTURED DATA 85

Note that, by definition, the split operation never “breaks” a row. So,
the indices define the indivisible entities of the table. The resulting tables
are disjoint:

Definition 5.5: (Disjoint tables)

Two tables 𝑇0 = (𝐾, 𝐻, 𝑐 0 ) and 𝑇1 = (𝐾, 𝐻, 𝑐 1 ) are said to be dis-


joint if card(𝑟; 𝑐 0 ) = 0 whenever card(𝑟; 𝑐 1 ) > 0 for any row 𝑟, and
vice-versa.

The binding operation is the inverse of the split operation. Given two
disjoint tables 𝑇0 = (𝐾, 𝐻, 𝑐 0 ) and 𝑇1 = (𝐾, 𝐻, 𝑐 1 ), the binding operation
creates a new table 𝑇 that contains all the rows of 𝑇0 and 𝑇1 .

Definition 5.6: (Bind operation)

Mathematically, the binding operation is defined as

bind(𝑇0 , 𝑇1 ) = (𝐾, 𝐻, 𝑐),

where 𝑇𝑖 = (𝐾, 𝐻, 𝑐 𝑖 ) and

𝑐(𝑟, ℎ) = 𝑐 0 (𝑟, ℎ) + 𝑐 1 (𝑟, ℎ).

The operator + stands for the tuple concatenation operatora.


aThe order of the concatenation here is not an issue since we guarantee that
at least one of the operands is empty.

Thus, a requirement for the binding operation is that the tables are
disjoint in terms of the row entities they have.

Premises in real-world applications One important aspect of these


functions is that we assume that the entities represented by the rows are
indivisible, and that any binding operation will never occur for tables
that share the same entities.
In real-world applications, this is not always true. Many times, we
do not know the process someone else has used to collect the data. In
these cases, we must be careful about the guarantees we discuss in this
chapter. On the other hand, one can consider the premises we use as a
guideline to design good data collection processes.
86 CHAPTER 5. DATA HANDLING

We can see data collection as the result of a splitting operation in the


universe set of all possible entities. This is a good way to think about
data collection, as we can try to ensure that we collect all possible infor-
mation about the entities we are interested in.
This, of course, depends on what we define as the index columns of
the table. Consider the example of collecting information about grades
of students. If we define the student’s name and year as the indexes, we
must ensure that we collect all the grades of all subjects a student has
taken in a year. We do not need, though, to collect information from all
students or all years. On the other hand, if we define only the student’s
name as the index column, we must collect all the grades of all subjects
a student has taken in all years.
In summary, the fewer variables we define as index columns, the
more information we must collect about each entity. However, in the
next sections, we show that assuming many index columns leads to re-
strictions in the operations we can perform on the table.
This conceptual trade-off is important to understand when structur-
ing the problem we are trying to solve. Neglecting these issues can lead
to strong statistical biases and incorrect conclusions.

5.1.2 Split invariance


One property we can study about data handling operations is whether
they are distributive over the bind operation. This property is called split
invariance.
From now on, we will denote

𝑇0 + 𝑇1 = bind(𝑇0 , 𝑇1 ),

for any tables 𝑇0 and 𝑇1 to simplify the notation.

Definition 5.7: (Split invariance)

An arbitrary data handling operation 𝑓(𝑇) is said to be split-


invariant if, for any table 𝑇 and split function 𝑠, the following
equation holds

𝑓(𝑇0 + 𝑇1 ) = 𝑓(𝑇0 ) + 𝑓(𝑇1 ) ,

where 𝑇0 , 𝑇1 = split(𝑇; 𝑠).


5.1. FORMAL STRUCTURED DATA 87

Split invariance is a desirable property for data handling operations


during the data tasks described in chapter 3: integration and tidying.
Even while exploring data, we should take effort to use split-invariant
operations.
The reason is that split invariance ensures that the operation does
not depend on the split performed (usually unknown to us) to create the
table we have in hand. This property is important to avoid data leakage
or to bias the results of the analysis.

5.1.3 Illustrative example

Table 5.1: Data table of student grades.

student subject year grade


Alice Chemistry 2020 6
Alice Math 2019 8
Alice Physics 2019 7
Bob Chemistry 2018 ?
Bob Chemistry 2019 7
Bob Math 2019 9
Bob Physics 2019 4
Bob Physics 2020 8
Carol Biology 2020 8
Carol Chemistry 2020 3
Carol Math 2020 10

Data collected about student grades. All information that is avail-


able is presented.

Consider the example of data collected about student grades. Ta-


ble 5.1 exemplifies all information we can possibly have about the grades
of students. A missing value in a cell of that table indicates that, for some
reason, the information is not retrievable.
The domains of the variables are:
• 𝒟(student) = {Alice, Bob, Carol};
• 𝒟(subject) = {Biology, Chemistry, Math, Physics};
• 𝒟(year) = ℤ; and
88 CHAPTER 5. DATA HANDLING

• 𝒟(grade) = [0, 10] ∪ {?}.

Of course, in practice, we have no guarantee that the data we have


is complete nor a clear specification of the domain of the variables. In-
stead, we must choose good premises about the data we are working
with.
Knowing that the data is complete, we can safely assume that:

1. Alice has never taken Biology;


2. Bob passed Physics, although at the second attempt;
3. Carol has only taken classes in 2020.

Table 5.2: Data table of student grades assuming student and sub-
ject as indices.

s student subject year grade


0 Alice Chemistry (2020) (6)
1 Alice Math (2019) (8)
1 Alice Physics (2019) (7)
0 Bob Chemistry (2018, 2019) (?, 7)
0 Bob Math (2019) (9)
1 Bob Physics (2019, 2020) (4, 8)
0 Carol Biology (2020) (8)
0 Carol Chemistry (2020) (3)
1 Carol Math (2020) (10)

Indexed table with data from table 5.1 assuming student and sub-
ject as indices. The column 𝑠 is the split indicator.

Now consider an arbitrary collection mechanism that considers stu-


dent and subject as the indices of the table. Table 5.2 shows the table we
have in hand. The column 𝑠 is the split indicator. Only rows with 𝑠 = 1
are available to us.
Now, about the statements we made before:

1. There is no way we can know if Alice has taken Biology or not. It


could be that the data collection mechanism failed to collect this
information or that the information simply does not exist.
5.1. FORMAL STRUCTURED DATA 89

2. We can safely assume that Bob has passed Physics in his second
attempt, once all information about (Bob, Physics) is assumed to
be available.
3. There is no guarantee that Carol has only taken classes in 2020. It
could be that some row (Carol, subject) with a year different from
2020 is missing in the table.

Table 5.3: Data table of student grades assuming student as the


index.

s student subject year grade


1 Alice (Chemistry, (2020, 2019, (6, 8, 7)
Math, Physics) 2019)
0 Bob (Chemistry, (2018, 2019, (?, 7, 9,
Chemistry, Math, 2019, 2019, 4, 8)
Physics, Physics) 2020)
1 Carol (Biology, Chem- (2020, 2020, (8, 3,
istry, Math) 2020) 10)

Indexed table with data from table 5.1 assuming student as index.
The column 𝑠 is the split indicator and only rows with 𝑠 = 1 are
available to us.

Now consider an arbitrary collection mechanism that considers stu-


dent as the index of the table. Imposing this restriction would difficult
the data collection process, but it would guarantee that we have all infor-
mation about each student. Table 5.3 shows the table we have in hand.
As before, the column 𝑠 is the split indicator and only rows with 𝑠 = 1
are available to us.
Our conclusions may change again:

1. We can safely assume that Alice has never taken the Biology class,
as Biology ∉ 𝑐(Alice, subject).
2. There is no information about Bob’s grades, so we can not affirm
nor deny anything about his grades.
3. We can safely assume that Carol has only taken classes in 2020, as
𝑐(Carol, year) contains only values with 2020.
90 CHAPTER 5. DATA HANDLING

It is straightforward to see that the fewer index columns we have,


the more information we have about the present entities. Also, it is
clear how important the assumptions on the index columns are to the
conclusions we can draw from the data. Consequently, split-invariant
operations can preserve valid conclusions about the data even when in-
formation is missing3.

5.2 Data handling pipelines


In the literature and in software documentation, you will find a variety
of terms used to describe data handling operations4. They often refer to
the same or similar operations, but the terminology can be confusing.
In this section, I present a summary of these operations mostly based
on Wickham, Çetinkaya-Rundel, and Grolemund (2023) definitions5.
During the preparation of data for our project, we will need to per-
form a set of operations on possibly multiple datasets. These operations
are organized in a pipeline, where the outputs of one operation are the
inputs of the next one. Operations are extensively parameterized, for
instance, most of them can use predicates to define the groups, arrange-
ments, or conditions under which they should be applied.
In fig. 5.1, we show an example of a data handling pipeline. The pipe-
line starts with two source datasets, Source 1 and Source 2. The datasets
are processed by a set of operations, 𝑓1 , 𝑓2 , 𝑓3 , 𝑓4 , 𝑓5 , and the output is a
single dataset, Data. Our goal at the data tasks — see section 3.5.3 —
is to create a dataset that is representative of the observational unit we
are interested in. Representative here means that the dataset is tidy6
and that the priors, i.e. the distribution of the data is faithful to the real
distribution of the phenomenon.
A pipeline is more flexible than a chain of operations because it can
handle more complex structures, where different branches (forks) of
processing occur simultaneously, and then come together (merges) later
3Absence can be due to incomplete data collection or artificial splitting for validation;
consult chapter 8.
4The terminology “data handling” itself is not universal. Some authors and libraries
call it “data manipulation”, “data wrangling”, “data shaping”, or “data engineering”. I use
the term “data handling” because it seems more generic. Also, it avoids confusion with
the term “data manipulation” which has a negative connotation in some contexts.
5Which they call verbs.
6Remember that our definition of tidiness depends on the observational unit. That
means, in practice, that if the original data sources are in a observational unit different
from the one we are interested in, after joining them, the connecting variables may need
to be removed to eliminate transitive dependencies. Consult sections 4.4.1 and 4.4.2.
5.3. SPLIT-INVARIANT OPERATIONS 91

Figure 5.1: Example of data handling pipeline.

Source 1 𝑓1 𝑓2

𝑓5 Data

Source 2 𝑓3 𝑓4

A data handling pipeline is a set of operations that transform a


dataset into another dataset. We can have more than one source
dataset and the output is a single dataset where each row repre-
sents a sample in the observational unit we are interested in.

in the workflow. For instance, the output of 𝑓1 is the input of 𝑓2 and 𝑓4


(fork), and 𝑓5 has as input the outputs of 𝑓2 and 𝑓4 (merge).
Pipelines are great conceptual tools to organize the data handling
process. They allow for the separation of concerns, where each opera-
tion is responsible for a single task. Also, declaring the whole pipeline at
once allows for the optimization of the operations and the use of paral-
lel processing. This is important when dealing with large datasets. The
declarative approach, as opposed to the imperative one, makes it easier
to reason about and maintain the code7.

5.3 Split-invariant operations


In this section, I present a set of operations that are split-invariant. One
can safely apply these operations to the data without worrying about
biasing the dataset.
For each operation, we discuss its application on some tidying issues
presented in section 4.3.1. The issues I address here8:
• Headers are values, not variable names;
7Tidyverse and Polars are examples of libraries that use a declarative approach to data
handling.
8The issue of multiple types of observational units stored in the same table is better
dealt with by database normalization. More on this subject is discussed in section 5.4.1.
92 CHAPTER 5. DATA HANDLING

• Multiple variables are stored in one column;


• Variables are stored in both rows and columns;
• A single observational unit is stored in multiple tables.

5.3.1 Tagged splitting and binding


We saw that one trivial, yet important, operation is to bind datasets. This
is the process of combining two or more datasets into a single dataset.
In order to make the operation reversible, we can parametrize it with a
split column that indicates the source of each row.

Definition 5.8: (Tagged bind operation)

Given two or more disjoint tables 𝑇𝑖 = (𝐾, 𝐻, 𝑐 𝑖 ), 𝑖 = 0, 1, … , the


tagged bind operation creates a new table 𝑇 = (𝐾, 𝐻 ∪ {𝑠}, 𝑐) that
contains all the rows of tables 𝑇𝑖 . The split column 𝑠 is a new
column that indicates the source of each row.
Mathematically, the tagged bind operation is defined as

bind𝑠 (𝑇0 , 𝑇1 , … ) = 𝑇,

where 𝑐(𝑟, ℎ) = 𝑐 0 (𝑟, ℎ) + 𝑐 1 (𝑟, ℎ) + … if ℎ ∈ 𝐻 and


𝑑
𝑐(𝑟, 𝑠) = [𝑖] ,

where 𝑖 is the index of the table 𝑇𝑖 that contains the row 𝑟, i.e.
𝑑 = card(𝑟; 𝑐 𝑖 ) > 0.

When binding datasets by rows, the datasets must have the same
columns. In practice, one can assume, if a column is missing, that all
values in that column are missing.
The indication of the source table usually captures some hidden se-
mantics that has split the tables in the first place. For instance, if each
table represents data collected in a different year, one can create a new
column year that contains the year of the data. It is important to pay at-
tention to the semantics of the split column, as it can also contain extra
information.
Consider table 5.4, which contains monthly gas usage data from US
and Brazil residents. From the requirements described in the previous
section, we can safely bind these datasets — as they are disjoint. We
5.3. SPLIT-INVARIANT OPERATIONS 93

Table 5.4: Gas usage datasets.

month gas distance month gas distance


1 48.7 1170 1 143.7 1470
2 36.7 1100 2 156.7 1700
3 37.8 970 3 170.8 1870
… … … … … …

Monthly gas usage data from US (left) and Brazil (right) residents.

can use as a tag a new column to represent the country. However, an


attentive reader will notice that the unit of measurement of the gas usage
and distance are different in each table: gallons and miles in the US
dataset and liters and kilometers in the Brazil dataset. Ideally, thus, we
should create two other columns to represent the units of measurement.
It is straightforward to see that this operation solves the issue of a
single observational unit being stored in multiple tables described in
section 4.3.1.
The reverse function consists of splitting the dataset using as a pred-
icate the split column.

Definition 5.9: (Tagged split operation)

Let 𝑠 be a non-index column of a table 𝑇 = (𝐾, 𝐻 ∪ {𝑠}, 𝑐) with


𝒟(𝑠) known and finite, and such that 𝑐(𝑟, 𝑠) contains only unique
values. The tagged split operation parametrized by 𝑠 creates dis-
joint tables 𝑇𝑖 = (𝐾, 𝐻, 𝑐 𝑖 ) that contain only the rows 𝑟 for which
𝑐(𝑟, 𝑠) = 𝑖.
Mathematically, the tagged split operation is defined as

split𝑠 (𝑇) = (𝑇0 , 𝑇1 , … ) ,

where 𝑐 𝑖 (𝑟, ℎ) = 𝑐(𝑟, ℎ) if 𝑖 ∈ 𝑐(𝑟, 𝑠) and 𝑐 𝑖 (𝑟, ℎ) = () otherwise.

Note that the tagged split is split-invariant by definition, since we as-


sume that the nested rows of the input table 𝑇 contain only one value
94 CHAPTER 5. DATA HANDLING

for column 𝑠 for all rows9. Failing to meet this assumption can lead to a
biased split. Also, in practice, it is good practice to keep the column 𝑠 in
the output tables to preserve information about the source of the rows.
In terms of storage, smart strategies can be used to avoid the unneces-
sary repetition of the same value in column 𝑠.

5.3.2 Pivoting
Another important operation is pivoting datasets. There are two types
of pivoting: long-to-wide and wide-to-long. These operations are re-
versible and are the inverse of each other.
Pivoting long-to-wide requires a name column — whose discrete
and finite possible values will become the names of the new columns
— and a value column — whose values will be spread across the rows.
Other than these columns, all remaining columns must be indexes.

Definition 5.10: (Pivot long-to-wide operation)

Let 𝑇 = (𝐾 ∪ {name}, {value}, 𝑐). The pivot long-to-wide operation


is defined as
pivotname (𝑇) = 𝑇 ′ ,
where 𝑇 ′ = (𝐾, 𝒟(name) , 𝑐′ ) and

𝑐′ (𝑟, ℎ) = 𝑐 (𝑟 + (ℎ), value) ,

for all valid row 𝑟 and ℎ ∈ 𝒟(name).

Note however that the operation only works if card(𝑟 + (ℎ); 𝑐) is con-
stant for all ℎ ∈ 𝒟(name). If this is not the case, one must aggregate
the rows before applying the pivot operation. This is discussed in sec-
tion 5.3.7.
Pivoting wide-to-long10 is the reverse operation. One must specify
all the columns whose names are the values of the previously called
“name column.” The values of these columns will be gathered into a
new column. As before, all remaining columns are indexes.

9We consider a slightly different definition of split invariance here, where the binding
operation is applied to each element of the output of the split operation.
10Also known as unpivot.
5.3. SPLIT-INVARIANT OPERATIONS 95

Definition 5.11: (Pivot wide-to-long operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table. The pivot wide-to-long operation is


defined as
−1
pivot (𝑇) = 𝑇 ′ ,
where 𝑇 ′ = (𝐾 ∪ {name}, {value}, 𝑐′ ), 𝒟(name) = 𝐻 and

𝑐′ ((𝑟, ℎ), value) = 𝑐(𝑟, ℎ),

for all valid row 𝑟 and ℎ ∈ 𝐻.

In practical applications, where not all remaining columns are in-


dexes, one must aggregate rows or drop extra non-indexed columns be-
forehand. This is discussed in sections 5.3.4 and 5.3.7.

Table 5.5: Pivoting example.

city year qty.


A 2019 1
city 2019 2020 2021
A 2020 2
A 2021 3 A 1 2 3
B 2019 4 B 4 5 6
B 2020 5
B 2021 6

The left-hand-side table is in the long format and the right-hand-


side table is in the wide format.

Table 5.5 shows an example of pivoting. Here, we can consider city


and year as the index columns. The left-hand-side table is in the long
format and the right-hand-side table is in the wide format. Using the
pivot long-to-wide operation with year as the name column and qty. as
the value column, we can obtain the right-hand-side table. The reverse
operation will give us the left-hand-side table.
To show that the pivot operation is split-invariant, one can see that,
given 𝑇0 = (𝐾, 𝐻, 𝑐 0 ) and 𝑇1 = (𝐾, 𝐻, 𝑐 1 ) disjoint tables,

pivotname (𝑇0 ) + pivotname (𝑇1 ) = (𝐾, 𝒟(name) , 𝑐′0 ) + (𝐾, 𝒟(name) , 𝑐′1 ),
96 CHAPTER 5. DATA HANDLING

where 𝑐′𝑖 (𝑟, ℎ) = 𝑐 𝑖 (𝑟 + (ℎ), value). However, by the disjoint property of


the tables, we have that

𝑐 0 (𝑟 + (ℎ), value) + 𝑐 1 (𝑟 + (ℎ), value) = 𝑐(𝑟 + (ℎ), value),

for the table 𝑇 = (𝐾, 𝐻, 𝑐) = 𝑇0 + 𝑇1 . So,

(𝐾, 𝒟(name) , 𝑐′0 ) + (𝐾, 𝒟(name) , 𝑐′1 ) = (𝐾, 𝒟(name) , 𝑐′ ) =


pivotname (𝑇) ,

where 𝑐′ (𝑟, ℎ) = 𝑐(𝑟 + (ℎ), value).


Similarly, the reverse operation is also split-invariant.
Using the pivot operation, we can solve the issues of headers being
values, not variable names and variables being stored in both rows and
columns. In the first case, we can pivot the table to have the headers as
the domain of a new index (name column). In the second case, we have
to pivot both long-to-wide and wide-to-long to solve the issue.

5.3.3 Joining
Joining is the process of combining two datasets into a single dataset
based on common columns. This is one of the two fundamental opera-
tions in relational algebra. We will see the conditions under which the
operation is split invariant. However, the join operation has some other
risks you should be aware of; consult section 4.2 for more details.
Adapting the definitions of join in our context, we can define it as
follows. For the sake of simplicity, we denote 𝑟[𝑈] as the row 𝑟 restricted
to the index columns in 𝑈, i.e.

𝑟[𝑈] = (𝑘𝑖 ∶ 𝑘𝑖 ∈ 𝒟(𝐾𝑖 ) ∀𝐾𝑖 ∈ 𝑈).

The join of two tables is the operation that returns a new table with
the columns of both tables. Let 𝑈 be the common set of index columns.
For each occurring value of 𝑈 in the first table, the operation will look
for the same value in the second table. If it finds it, it will create a new
row with the columns of both tables. If it does not find it, no row will
be created.
Note that, like in pivoting long-to-wide, one must ensure that the
cardinality of the joined rows is constant for all ℎ ∈ 𝐻 ′ ∪ 𝐻 ″ . If this
is not the case, one must aggregate the rows before applying the join
operation. This is discussed in section 5.3.7.
5.3. SPLIT-INVARIANT OPERATIONS 97

Definition 5.12: (Join operation)

Let 𝑇 ′ = (𝐾 ′ , 𝐻 ′ , 𝑐′ ) and 𝑇 ″ = (𝐾 ″ , 𝐻 ″ , 𝑐″ ) be two tables such that


𝐾 ′ ∩ 𝐾 ″ ≠ ∅ and 𝐻 ′ ∩ 𝐻 ″ = ∅. The join operation is defined as

join(𝑇 ′ , 𝑇 ″ ) = 𝑇,

where 𝑇 = (𝐾 ′ ∪ 𝐾 ″ , 𝐻 ′ ∪ 𝐻 ″ , 𝑐) and

𝑐(𝑟, ℎ) = ()

if card(𝑟[𝐾 ′ ]; 𝑐′ ) = 0 or card(𝑟[𝐾 ″ ]; 𝑐″ ) = 0, for all ℎ. Otherwise:

𝑐′ (𝑟[𝐾 ′ ], ℎ) if ℎ ∈ 𝐻 ′ ,
𝑐(𝑟, ℎ) = { ″
𝑐 (𝑟[𝐾 ″ ], ℎ) if ℎ ∈ 𝐻 ″ .

Before we discuss whether the join operation is split-invariant11, we


can discuss a variation of the join operation: the left join. The left join
is the same as the join operation, but if the value of 𝑈 is missing in the
second table, the operation will create a new row with the columns of
the first table and missing values for the columns of the second table.
In our context, this operation is a unary operation, where the second
table is a fixed parameter.
The left join operation is split-invariant. To see this, consider two
disjoint tables 𝑇0 = (𝐾, 𝐻, 𝑐 0 ) and 𝑇1 = (𝐾, 𝐻, 𝑐 1 ), and a third table
𝑇 ′ = (𝐾 ′ , 𝐻 ′ , 𝑐′ ) such that 𝐾 ∩ 𝐾 ′ ≠ ∅ and 𝐻 ∩ 𝐻 ′ = ∅. We have that

join𝑇 ′ (𝑇0 ) + join𝑇 ′ (𝑇1 ) = 𝑇0′ + 𝑇1′ =


(𝐾 ∪ 𝐾 ′ , 𝐻 ∪ 𝐻 ′ , 𝑐′0 ) + (𝐾 ∪ 𝐾 ′ , 𝐻 ∪ 𝐻 ′ , 𝑐′1 ),
where the meaning of each term is clear from the definition 5.13. It is
straightforward to see that rows in 𝑇0′ and 𝑇1′ are disjoint, since at least
part of the indices in 𝐾 ∪ 𝐾 ′ are different between them.
Moreover,
join𝑇 ′ (𝑇0 + 𝑇1 ) = (𝐾 ∪ 𝐾 ′ , 𝐻 ∪ 𝐻 ′ , 𝑐′ )
with 𝑐′ (𝑟, ℎ) = () only if both card(𝑟[𝐾]; 𝑐 0 ) = 0 and card(𝑟[𝐾]; 𝑐 1 ) = 0.
Otherwise, 𝑐′ (𝑟, ℎ) = 𝑐 0 (𝑟[𝐾], ℎ) + 𝑐 1 (𝑟[𝐾], ℎ) if ℎ ∈ 𝐻 and 𝑐′ (𝑟, ℎ) =
𝑐 0 (𝑟[𝐾], ℎ) if ℎ ∈ 𝐻 ′ .
11Note that up to this point, we have defined this property only for unary operations.
98 CHAPTER 5. DATA HANDLING

Definition 5.13: (Left join operation)

Let 𝑇 ′ = (𝐾 ′ , 𝐻 ′ , 𝑐′ ) and 𝑇 ″ = (𝐾 ″ , 𝐻 ″ , 𝑐″ ) be two tables such that


𝐾 ′ ∩ 𝐾 ″ ≠ ∅ and 𝐻 ′ ∩ 𝐻 ″ = ∅. The left join operation is defined
as
join(𝑇 ′ ; 𝑇 ″ ) = join𝑇 ″ (𝑇 ′ ) = 𝑇,
where 𝑇 = (𝐾 ′ ∪ 𝐾 ″ , 𝐻 ′ ∪ 𝐻 ″ , 𝑐) and

𝑐(𝑟, ℎ) = ()

if card(𝑟[𝐾 ′ ]; 𝑐′ ) = 0 for all ℎ. Otherwise:

𝑐′ (𝑟[𝐾 ′ ], ℎ) if ℎ ∈ 𝐻 ′ ,
𝑐(𝑟, ℎ) = { ″
𝑐 (𝑟[𝐾 ″ ], ℎ) if ℎ ∈ 𝐻 ″ .

Thus,
join𝑇 ′ (𝑇0 + 𝑇1 ) = join𝑇 ′ (𝑇0 ) + join𝑇 ′ (𝑇1 ).

Our conclusion is that the left join operation given a fixed table is
split-invariant. So we can safely use it to join tables without worrying
about biasing the dataset once we fix the second table.
I conjecture that the (inner) join operation shares similar properties
but it is not as safe; nonetheless, a clear definition of split invariance for
binary operations is needed. This is left as a thought exercise for the
reader. Notice that the traditional join has the ability to “erase” rows
from any of the tables involved in the operation. This is a potential
source of bias in the data. This further emphasizes the importance of
understanding the semantics of the data schema before joining tables
— consult section 4.2.

5.3.4 Selecting
Selecting is the process of choosing a subset of non-index columns from
a dataset. The remaining columns are discarded. Rows of the table re-
main unchanged.
Although very simple, the selection operation is useful for removing
columns that are not relevant to the analysis. Also, it might be needed
before other operations, such as pivoting, to avoid unnecessary columns
(wide-to-long) and to keep only the value column (long-to-wide).
5.3. SPLIT-INVARIANT OPERATIONS 99

Definition 5.14: (Selection operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table and 𝐻 ′ ⊆ 𝐻. The selection operation


is defined as
select𝐻 ′ (𝑇) = 𝑇 ′ ,
where 𝑇 ′ = (𝐾, 𝐻 ′ , 𝑐).

Sometimes, it is useful to select columns based on a function of the


column properties. In other words, the selection operation can be pa-
rameterized by a predicate. The predicate is a function that returns a
logical value given the column.

Definition 5.15: (Predicate selection operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table and 𝑃 ∶ 𝐻 → {0, 1} be a predicate. The


predicate selection operation is defined as

select𝑃 (𝑇) = 𝑇 ′ ,

where 𝑇 ′ = (𝐾, 𝐻 ′ , 𝑐) and 𝐻 ′ = {ℎ ∈ 𝐻 ∶ 𝑃(ℎ) = 1}.

It is trivial to see that, if 𝑃 does not depend on the values of the


columns (i.e., has no access to 𝑐), the predicate selection operation is
split-invariant. This is because the operation does not change the rows
of the table nor does it depend on the values of the rows.
One example of the use of the predicate selection operation is to keep
columns whose values are in a specific domain. For instance, to keep
only columns that contain real numbers, we choose 𝑃(ℎ) = 1 if 𝒟(ℎ) =
ℝ, and 𝑃(ℎ) = 0 otherwise.
The case where the predicate depends on the values of the columns
is discussed in section 5.4.2.

5.3.5 Filtering
Filtering is the process of selecting a subset of rows from a dataset based
on a predicate.
A predicate can be a combination of other predicates using logical
operators, such as logical disjunction (or) or logical conjunction (and).
100 CHAPTER 5. DATA HANDLING

In the general case, predicates need to be robust enough to deal with


value matrices of any size and those that contain missing values.
After filtering, the dataset will contain only the rows that satisfy the
predicate. Columns remain unchanged.
In its simplest form, we can assume that card(𝑟) ≤ 1 for all 𝑟 and that
the predicates are applied to each row independently. In this case, the
value matrix 𝑉(𝑟) is just a tuple with a single value for each non-index
column.
Without loss of generality, we can assume that predicates are com-
bined using logical disjunction (or)12.
For instance, the predicate age > 18 will select all rows where the
value in the age column is greater than 18. Keeping each row indepen-
dent, we can also generalize predicates to deal with the values of the
indexes as well.

Definition 5.16: (Filtering operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table and 𝑃1 , … 𝑃𝑛 be predicates. The filtering


operation is defined as

filter𝑃1 ,…,𝑃𝑛 (𝑇) = 𝑇 ′ ,

where 𝑇 ′ = (𝐾, 𝐻, 𝑐′ ) and


𝑛
⎧𝑐(𝑟, ℎ) if 𝑃𝑖 (𝑟, 𝑉 (𝑟)) = 1,
𝑐′ (𝑟, ℎ) = ⋁
𝑖=1

⎩() otherwise,

where predicate

𝑃𝑖 ∶ 𝐾𝑖 × (𝒟(ℎ) ∪ {?}) → {0, 1}


⨉ ⨉
𝐾𝑖 ℎ∈𝐻

is applied to the value matrix 𝑉(𝑟) of the row 𝑟.

It is also trivial to see that the filtering operation is split-invariant,


even in its generalized form where the value matrix has many rows. This

12The reason is that sequential application of filtering is equivalent to combining the


predicates using logical conjunction (and).
5.3. SPLIT-INVARIANT OPERATIONS 101

property comes from the fact that rows are treated independently. More
complex cases are discussed in section 5.4.2.

5.3.6 Mutating
Mutating is the process of creating new columns in a table. The oper-
ation is reversible, as the original columns are kept. The new columns
are added to the dataset.
The values in the new column are determined by a function of the
rows. The expression is a function that returns a vector of values given
the values in the other columns. Similarly to filtering, in its simplest
form, we can assume that card(𝑟) ≤ 1 for all 𝑟 and that the predicates
are applied to each row independently.

Definition 5.17: (Mutation operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table and 𝑓 be a transformation function.


The mutating operation is defined as

mutate𝑓 (𝑇) = 𝑇 ′ ,

where 𝑇 ′ = (𝐾, 𝐻 ∪ {ℎ′ }, 𝑐′ ) and

𝑐(𝑟, ℎ) if ℎ ∈ 𝐻,
𝑐′ (𝑟, ℎ) = {
𝑓(𝑟, 𝑉 (𝑟)) if ℎ = ℎ′ ,

where the function

𝑓∶ 𝐾𝑖 × (𝒟(ℎ) ∪ {?}) → 𝒟(ℎ′ ) ∪ {?}


⨉ ⨉
𝐾𝑖 ∈𝐾 ℎ∈𝐻

is applied to the value matrix 𝑉(𝑟) of the row 𝑟.

The expression can be a simple function, such as y = x + 1 , or a


more complex function, such as

y = ifelse(x > 0, 1, 0) .

Here, x and y are the names of an existing and the new column, respec-
tively. The ifelse(a, b, c) function is a conditional function that
returns 1 if the condition is true and 0 otherwise.
102 CHAPTER 5. DATA HANDLING

This function solves the issue of multiple variables stored in one col-
umn described in section 4.3.1.
As with filtering, the mutating operation is split-invariant even if
card(𝑟) > 1 for any 𝑟13. This is because the operation is applied to each
row independently. In this general case, an extra requirement is that the
function 𝑓 must return tuples with the same cardinality as the row it is
applied to.

5.3.7 Aggregating
Many times, it is easier to reason about the table when all rows have
cardinality 1. Aggregation ensures that the table has this property.

Definition 5.18: (Aggregation operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table and 𝑓 be an aggregation function. The


aggregation operation is defined as

aggregate𝑓 (𝑇) = 𝑇 ′ ,

where 𝑇 ′ = (𝐾, 𝐻, 𝑐′ ) and

𝑐′ (𝑟, ℎ) = 𝑓(𝑟, 𝑉 (𝑟))[ℎ],

where function 𝑓 is applied to the value matrix 𝑉 (𝑟) of the row 𝑟


and it has an image

(𝒟(ℎ) ∪ {?} ∶ ℎ ∈ 𝐻) ,

independently of the input size. The notation 𝑣[ℎ] refers to the


value corresponding to the column ℎ in the output tuple.

As with mutation, aggregation is split-invariant as it treats each row


independently, even if the function 𝑓 considers order semantics of the
values in the matrix.

5.3.8 Ungrouping
We discussed that the fewer index columns a table has — assuming we
guarantee that all information about that entity is present — the safer
13It just changes the input space of function 𝑓.
5.3. SPLIT-INVARIANT OPERATIONS 103

it is to infer conclusions from the data. Thus, reducing the number of


indices must be done very carefully — more on that in section 5.4.1.
On the other hand, sometimes it is useful to increase the number of
index columns. For instance, pivoting long-to-wide requires all columns
except one to be indexes. The operation that transforms some of the
columns in the table into indexes is called ungrouping. The reason for
the name is that the operation decreases the cardinality of rows by cre-
ating new rows, effectively ungrouping the values.

Definition 5.19: (Ungrouping operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table and ℎ′ ∈ 𝐻 such that 𝒟(ℎ′ ) is known


and finite. The ungrouping operation is defined as

ungroupℎ′ (𝑇) = 𝑇 ′ ,

where 𝑇 ′ = (𝐾 ∪ {ℎ′ }, 𝐻 ∖ {ℎ′ }, 𝑐′ ), and

𝑐′ (𝑟 + 𝑟′ , ℎ) = (𝑣 𝑖,ℎ ∶ 𝑖),

where 𝑟 refers to values of the indices in 𝐾, 𝑟′ refers to the value of


the new index ℎ′ , and 𝑣 𝑖,ℎ is the value of the column ℎ in any 𝑖-th
nested row of the value matrix 𝑉 (𝑟; 𝑇) in the original table such
that
𝑣 𝑖,ℎ′ = 𝑟′ .

Note that the operation requires that the column ℎ′ has no missing
values.
Table 5.6 shows an example of ungrouping. In the top table, there
are two rows, one with cardinality 4 and the other with cardinality 3.
The column year is ungrouped, creating new rows for each value in the
nested row. The bottom table is the result of ungrouping the column
year. Although there were 7 nested rows in the original table, the bottom
table has 6 rows — the number of nested rows is preserved however. The
reason is that the row (A, 2020) has cardinality 2.
The ungrouping operation is split-invariant. To see this, consider
two disjoint tables 𝑇0 = (𝐾, 𝐻, 𝑐 0 ) and 𝑇1 = (𝐾, 𝐻, 𝑐 1 ), we have

ungroupℎ′ (𝑇0 ) + ungroupℎ′ (𝑇1 ) =


(𝐾 ∪ {ℎ′ }, 𝐻 ∖ {ℎ′ }, 𝑐′0 ) + (𝐾 ∪ {ℎ′ }, 𝐻 ∖ {ℎ′ }, 𝑐′1 ) ,
104 CHAPTER 5. DATA HANDLING

Table 5.6: Ungrouping example.

city year qty.


A (2019, 2020, 2020, 2021) (1, 2, 3, 4)
B (2019, 2020, 2021) (5, 6, 7)

city year qty.


A 2019 1
A 2020 (2, 3)
A 2021 4
B 2019 5
B 2020 6
B 2021 7

The index of the top table is the column city. The bottom table is
the result of ungrouping the column year.

where 𝑐𝑗′ (𝑟 + 𝑟′ , ℎ) = (𝑣 𝑖,ℎ ∶ 𝑖 such that 𝑣 𝑖,ℎ′ = 𝑟′ ). Since the tables are
disjoint, the rows of the output tables are also disjoint. In other words,
For any 𝑟, either card(𝑟 + 𝑟′ ; 𝑐 0 ) = 0 or card(𝑟 + 𝑟′ ; 𝑐 1 ) = 0 indepen-
dently of the value of 𝑟′ . The reason is that there is no possible 𝑣 𝑖,ℎ′ = 𝑟′
if 𝑟 is not present in the table.
Then,

(𝐾 ∪ {ℎ′ }, 𝐻 ∖ {ℎ′ }, 𝑐′0 ) + (𝐾 ∪ {ℎ′ }, 𝐻 ∖ {ℎ′ }, 𝑐′1 ) =


(𝐾 ∪ {ℎ′ }, 𝐻 ∖ {ℎ′ }, 𝑐′ ) ,
where 𝑐′ (𝑟 + 𝑟′ , ℎ) = 𝑐′0 (𝑟 + 𝑟′ , ℎ) + 𝑐′1 (𝑟 + 𝑟′ , ℎ).

5.4 Other operations


We saw that, under reasonable premises, split-invariant operations are
safe to use in the context of tidying and data integration. However, data
handling does not happen only in the context of tidying and integrating
datasets14. It is also used in tasks like data exploration and data prepro-
cessing. In these cases, other operations are needed.
14And, sometimes, we may need to use other transformations even for these tasks.
5.4. OTHER OPERATIONS 105

In this section, we discuss some of these operations. Instead of fo-


cusing on the mathematical definitions, we will discuss the semantics
of the operations and some of their properties.

5.4.1 Projecting or grouping


Projection is one of the two fundamental operations in relational alge-
bra — consult section 4.2 for more details. In database normalization
theory, tables — called relations — are slightly different from the tables
we are discussing here. The major difference is that they are sets of tu-
ples, which means that each tuple is unique. In our scenario, this is
similar to what we call rows represented by the possible values of the
index columns of the table.
Adapting the definitions of projection to our context, we can define
it as follows.

Definition 5.20: (Projection operation)

Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table and 𝐾 ′ ⊂ 𝐾 a subset of the columns.


The projection operation is defined as

project𝐾 ′ (𝑇) = 𝑇 ′ ,

where 𝑇 ′ = (𝐾 ′ , 𝐻 ∪ (𝐾 ∖ 𝐾 ′ ), 𝑐′ ) and

∑ 𝑐(𝑟 + 𝑟′ , ℎ) if ℎ ∈ 𝐻

⎪ 𝑟′
𝑐′ (𝑟, ℎ) = ′

⎪ ′∑ 𝑘 if ℎ ∈ 𝐾 ∖ 𝐾 ′ ,
⎩𝑘 ∈𝒟(ℎ)
for all valid row 𝑟 considering the indices 𝐾 ′ and for all tuples
𝑟′ = (𝑘𝑖 ∶ 𝑖) such that 𝑘𝑖 ∈ 𝒟(𝐾𝑖 ) for all 𝐾𝑖 ∈ 𝐾 ∖ 𝐾 ′ .

We can see that projection for our tables is a little more complex
than the usual projection in relational algebra. Consider the example we
discussed in section 4.2 as well, where we have a table with the columns
student, subject, year, and grade.
Table 5.7 (top) shows that table adapted for our definitions. Suppose
we want to project the table to have only the entity course. Now each
row (bottom table) represents a course. The column student is not an
index column anymore, and the values in the column are exhaustive
106 CHAPTER 5. DATA HANDLING

Table 5.7: Student grade table.

student course course credits grade


Alice Math 4 A
Alice Physics 3 B
Bob Math 4 B
Bob Physics 3 A
Carol Math 4 C

course student course credits grade


Math (Alice, Bob, Carol) (4, 4, 4) (A, B, C)
Physics (Alice, Bob, Carol) (3, 3, ?) (B, A, ?)

(Top) An example of a table of students and their grades in courses.


The columns student and course are the index columns. (Bottom)
The same table projected into the entity course.

and unique, i.e., the whole set 𝒟(student) is represented in the column
for each row.
Thus, projection is a very useful operation when we want to change
the observational unit of the data, particularly to the entity represented
by a subset of the index columns. Semantically, projection groups the
rows by the values.
It is easy to see that the projection operation is not split-invariant.
Consider the following example. If we split the top table in table 5.7
so the first row (Alice, Math) is in one table and the second row (Alice,
Physics) is in another, the bind operation between the projection into the
entity student of these two tables is not allowed. The reason is that the
row (Alice) will be present in both tables, violating the disjoint property
of the tables.
The consequence is that a poor architecture of the data schema can
lead to incorrect conclusions in the face of missing information (due
to split). This is one of the reasons why database normalization is so
important. The usage of parts of the tables without fully denormalizing
them is a bad practice that can lead to spurious information.
5.4. OTHER OPERATIONS 107

5.4.2 Grouped and arranged operations


In practice, when we need more flexibility in the kind of operations we
can perform — for instance, in data preprocessing —, we use variations
of some operations in section 5.3 that are not split-invariant. These op-
erations are parametrized by the groups and the order of the rows.
We use the following terminology to refer to the data handling pa-
rameters:

• Aggregation function: a function that returns a single value


given a tuple of values; and
• Window function: a function that returns a tuple of values given
a tuple of values of the same size;

where the order of the values may play a role in the result of the function.
Examples of aggregation functions are sum (summation), mean
(average value), count (number of elements), and first (first ele-
ment of the tuple). Examples of window functions are cumsum (cumu-
lative sum), lag (a tuple with the previous values), and rank (the
rank of the value in the tuple given some ordering).
Here, we consider that the rows of the table have cardinality equal to
one — as discussed before, one can use ungrouping (section 5.3.8) and
aggregation (section 5.3.7) to ensure this property. Without loss of gen-
erality, we also assume that there is only one index, called row number,
such that each row has a unique value for this index15.

Mutating with groups and order


We can take as an example the operation of creating a new column. To
create a new column, we use an expression that depends on the values
of the other columns. If the expression depends on an aggregation or
window function, one must specify the groups and/or the order of the
rows.
For example, the expression

y = cumsum(x) group by category sort by date

will create a new column y with the cumulative sum of the x column
for each category given the order of the rows defined by the date
column.
15Since the operations we describe here are not split-invariant, we can assume a previ-
ous projection of the data, see section 5.4.1.
108 CHAPTER 5. DATA HANDLING

Figure 5.2: Mutating with groups and order.

Group Mutate Ungroup Select

Source Join

Result

The mutating operation with groups and order is implemented as


a pipeline.

This operation can be implemented as a pipeline. First, we group


(project) the table by the category column. Then, we sort all the tu-
ples by the date column. Finally, we apply the cumsum function to
the x column and ungroup everything. In the result table, we select
only columns category and y . Going back to the original table, we
can left-join the original table with the result table using the category
column. Now, we have the new column y in the original table. This is
shown in fig. 5.2.
Note that the trivial group would be the whole table, i.e., a column
with a single value. Thus, the grouping is always required, which makes
the operation not split-invariant. In practical applications, I suggest be-
ing as explicit as possible about the groups and order criteria. This helps
to avoid errors and to make the code more readable.
One important aspect about mutating sorted values is that one can
use nontrivial strategies — from completing missing values to rolling
windows — to deal with implicit missing values. This is a powerful tool
to deal with time series data. For example, one can use the lead func-
tion to create a new column with the next value of the x column sorted
by year .
If data contains both x = (1, 3) and year = (2019, 2021) ,
5.5. AN ALGEBRA FOR DATA HANDLING 109

the calculation of the lead will result in

x = (1, ?, 3) , year = (2019, 2020, 2021) , and


lead = (?, 3, ?) ,

since the missing value for the year 2020 was implicit.

Filtering with groups and order


It is easy to see that to filter rows of the table taking into account groups
and order, we just need to create a new column with the expression that
defines the predicate and then filter the rows based on this column. For
instance, the predicate age > mean(age) group by country will
select the rows where the value in the age column is greater than the
mean of the age for each country . Another example is the predicate
cumsum(price) < 100 sort by date , which selects the rows that
satisfy the condition that the cumulative sum of the price column is
less than 100 given the order of the rows defined by the date column.

5.5 An algebra for data handling


In recent years, some researchers have made an effort to create a formal
algebra for data transformations. The idea is to define a set of operations
that can be combined to create complex transformations and describe
their main properties.
Note that statistical data handling differs from relational algebra, be-
cause the former focuses on transformations and the latter on informa-
tion retrieval.
Song, Jagadish, and Alter (2021)16, for example, propose a formal
paradigm for statistical data transformation. They present a data model,
an algebra, and a formal language. Their goal is to create a standard for
statistical data transformation that can be used by different statistical
software.
However, in my opinion, the major deficiency of their work is that
they mostly try to “reverse engineer” the operations that are commonly
used in statistical software. This is useful for the translation of code
16J. Song, H. V. Jagadish, and G. Alter (2021). “SDTA: An Algebra for Statistical Data
Transformation”. In: Proc. of 33rd International Conference on Scientific and Statistical
Database Management (SSDBM 2021). Tampa, FL, USA: Association for Computing Ma-
chinery, p. 12. doi: 10.1145/3468791.3468811.
110 CHAPTER 5. DATA HANDLING

between different software, but it is not productive to advance the theo-


retical understanding of statistical transformations.
If one ought to tackle the challenge of formally expressing statisti-
cal transformations, I think one should start from the basic operations.
By basic operations, I mean that they are either irreducible — i.e., they
cannot be expressed as a sequence of other operations — or they are so
common and intuitive that they are worth being considered basic.
In this chapter, I try to shed some light on what could be a start for
a formal algebra for general data handling. I present a set of operations
and discuss their properties. I also present the novel concept of split in-
variance, which is a property that I think is important for the operations
in the algebra.
For future directions, I suggest that one should try to express com-
pleteness in the data handling context. Drawing a parallel with compu-
tation theory, one could define a computational model for data handling
and try to prove that the operations in the algebra are complete in the
sense that they can express any transformation that can be expressed in
the model. It would resemble a formal language for data handling that
is “Turing complete.”
A formal “complete” algebra for data handling would be a power-
ful tool for the development of new software and the translation of code
between different software. It would also benefit performance optimiza-
tions and pave the way for semantic analysis of data transformations. It
would be a step as significant as C was from assembly language!
Learning from data
6
To understand God’s thoughts we must study statistics, for
these are the measures of His purpose.
— Florence Nightingale, her diary

As we discussed before, in this book, I focus on the problem of infer-


ring a solution for a predictive task from data. In this chapter, we intro-
duce the basic concepts of the statistical learning theory (SLT), a general
framework for predictive learning tasks.
More specifically, we discuss the inductive learning approach, which
involves deriving general rules from specific observations.
We also formally establish the learning problem, and we define the
two most common predictive tasks: binary data classification and re-
gression estimation. We discuss the optimal solutions for these tasks in
an ideal (although unrealistic) scenario where the distributions of the
data are known.
Moreover, we discuss principles that guide the learning process if the
distribution of the data is unknown. From those principles, we discuss
the properties and limitations of the learning process.
Finally, we realize those concepts for simple linear problems, ex-
plaining two basic algorithms for the learning process: the perceptron
and the maximal margin classifier.

111
112 CHAPTER 6. LEARNING FROM DATA

Chapter remarks

Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 The learning problem . . . . . . . . . . . . . . . . . . . . 115
6.2.1 Learning tasks . . . . . . . . . . . . . . . . . . . . 115
6.2.2 A few remarks . . . . . . . . . . . . . . . . . . . . 117
6.3 Optimal solutions . . . . . . . . . . . . . . . . . . . . . . 118
6.3.1 Bayes classifier . . . . . . . . . . . . . . . . . . . . 118
6.3.2 Regression function . . . . . . . . . . . . . . . . . 120
6.4 ERM inductive principle . . . . . . . . . . . . . . . . . . 122
6.4.1 Consistency of the learning process . . . . . . . . 122
6.4.2 Rate of convergence . . . . . . . . . . . . . . . . . 123
6.4.3 VC entropy . . . . . . . . . . . . . . . . . . . . . . 123
6.4.4 Growing function and VC dimension . . . . . . . 124
6.5 SRM inductive principle . . . . . . . . . . . . . . . . . . . 126
6.5.1 Bias invariance trade-off . . . . . . . . . . . . . . 129
6.5.2 Regularization . . . . . . . . . . . . . . . . . . . . 131
6.6 Linear problems . . . . . . . . . . . . . . . . . . . . . . . 132
6.6.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . 132
6.6.2 Maximal margin classifier . . . . . . . . . . . . . 137
6.7 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . 139

Context

• Inductive reasoning is the process of deriving general rules from


specific observations.

Objectives

• Define the learning problem and the common predictive tasks.


• Understand the main principles that guide the learning process.

Takeaways

• Optimal solutions establish how good a solution can possibly be.


• Reducing error is not enough to guarantee a good solution.
• Controlling model complexity is crucial for generalization.
6.1. INTRODUCTION 113

6.1 Introduction
Several problems can be addressed by techniques that utilize data in
some way. Once we focus on one particular problem — inductive learn-
ing —, we need to define the scope of the tasks we are interested in. Let
us start from the broader fields to the more specific ones.
Artificial intelligence (AI) is a very broad field, including not only
the study of algorithms that exhibit intelligent behavior, but also the
study of the behavior of intelligent systems. For instance, it encom-
passes the study of optimization methods, bio-inspired algorithms, ro-
botics, philosophy of mind, and many other topics. We are interested in
the subfield of artificial intelligence that studies algorithms that exhibit
some form of intelligent behavior.
A more specific subfield of AI is machine learning (ML), which stud-
ies algorithms that enable computers to learn and improve their perfor-
mance on a task from experience automatically, without being explicitly
programmed by a human being.
Programming a computer to play chess is a good example of the dif-
ference between traditional AI and ML. In traditional AI, a human pro-
grammer would write a program that contains the rules of chess and the
strategies to play the game. The algorithm might even “search” among
the possible moves to find the best one. In ML, the programmer would
write a program that learns to play chess by playing against itself, against
other programs, or even from watching games played by humans. The
system would learn the rules of chess and the strategies to play the game
by itself.
This field is particularly useful when the task is too complex to be
solved by traditional programming methods or when we do not know
how to solve the task. Among the many tasks that can be addressed by
ML, we can specialize even more.
Predictive learning is the ML paradigm that focuses on making pre-
dictions about outcomes (sometimes about the future) based on histori-
cal data. Predictive tasks involve predicting the value of a target variable
based on the values of one or more input variables1.
Depending on the reasoning behind the learning algorithms, we can
divide the learning field into two main approaches: inductive learning
and transductive learning2.
1Descriptive learning, which is out of the scope of this book, focuses on describing
the relationships between variables in the data without the need for a target variable.
2Trasduction is the process of obtaining specific knowledge from specific observa-
tions, and it is not the focus of this book.
114 CHAPTER 6. LEARNING FROM DATA

Inductive learning involves deriving general rules from specific ob-


servations. The general rules can make predictions about any new in-
stances. Such an approach is exactly what we want to apply in the
project methodology we described in section 3.5: the solution is the gen-
eral rule inferred from the data.

Figure 6.1: Organizational chart of the learning field.

artificial intelligence

machine learning

predictive learning

inductive learning

Artificial intelligence studies algorithms that exhibit intelligent


behavior and the behavior of intelligent systems. Machine learn-
ing is a subfield of artificial intelligence that studies algorithms
that enable computers to automatically learn from data. Predic-
tive learning, which focuses on making predictions about out-
comes given known input data. Inductive learning is a yet more
specific type of learning that involves deriving general rules from
specific observations.

Figure 6.1 gives us a hierarchical view of the learning field. Alterna-


tives — such as descriptive learning in opposition to predictive learning,
or transductive learning in opposition to inductive learning — are out
of the scope of this book.
Maybe the most general (and useful) framework for predictive learn-
ing is SLT. In this chapter, we will introduce the basic concepts of this
theory and discuss the properties of the main ML methods.
6.2. THE LEARNING PROBLEM 115

6.2 The learning problem


Consider the set
{(x𝑖 , 𝑦 𝑖 ) ∶ 𝑖 = 1, … , 𝑛} (6.1)
where each sample 𝑖 is associated with a feature vector x𝑖 ∈ 𝒳 and a tar-
get variable 𝑦 𝑖 ∈ 𝒴. We assume that samples are random, independent,
and identically distributed (i.i.d.) observations drawn according to

P(𝑥, 𝑦) = P(𝑦 ∣ 𝑥) P(𝑥).

Both distributions P(𝑥) and P(𝑦 ∣ 𝑥) are fixed but unknown.


This is equivalent to the original SLT setup stated by V. N. Vapnik
(1999), where a generator produces random vectors x according to a
fixed but unknown probability distribution P(𝑥) and a supervisor re-
turns an output value 𝑦 for every input vector 𝑥 according to a condi-
tional distribution function P(𝑦 ∣ 𝑥), also fixed but unknown.
Moreover, note that this setup is compatible with the idea of tidy
data and 3NF (see section 4.4). Of course, we assume 𝑋, 𝑌 are only the
measured variables (or non-prime attributes). In practice, it means that
we set aside the keys in the learning process.
In terms of the tables defined in section 5.1, any row 𝑟 in the table
𝑇 = (𝐾, 𝐻, 𝑐), in the desired observational unit, such that card(𝑟) > 0,
and ℎ ∈ 𝐻 the chosen target variable, we have a corresponding target
𝑦 = 𝑐(𝑟, ℎ) and a feature vector x that corresponds to the tuple

(𝑐(𝑟, ℎ′ ) ∶ ℎ′ ∈ 𝐻 ∖ {ℎ} ).

Similarly, the variables 𝐾 that describe each unit are set aside, as it does
not make sense to infer general rules from them.
From the statistical point of view, learning problems consist of an-
swering questions about the distribution of the data.

6.2.1 Learning tasks


In terms of predictive learning, given the before-mentioned scenario, we
can refine our goals by tackling specific tasks3.
Consider a learning machine capable of generating a set of functions,
or models, 𝑓(𝑥; 𝜃) ≡ 𝑓𝜃 (𝑥), for a set of parametrizations 𝜃 ∈ Θ and such
that 𝑓𝜃 ∶ 𝒳 → 𝒴. In a learning task, we must choose, among all possible
𝑓𝜃 , the one that predicts the target variable in the best possible way.
3I consider tasks as well-defined subproblems of a higher-level problem.
116 CHAPTER 6. LEARNING FROM DATA

In order to learn, we must first define the loss (or discrepancy) ℒ


between the response 𝑦 to a given input 𝑥, drawn from P(𝑥, 𝑦), and the
response provided by the learned function.
Then, given the risk function

𝑅(𝜃) = ∫ ℒ(𝑦, 𝑓𝜃 (𝑥)) 𝑑P(𝑥, 𝑦), (6.2)

the goal is to find the function 𝑓𝜃 that minimizes 𝑅(𝜃) where the only
available information is the training set given by (6.1).
This formulation encompasses many specific tasks. I focus on two of
them, which I believe are the most fundamental ones: binary data clas-
sification4 and regression estimation5. (I left aside the density estimation
problem, once it is not addressed in the remainder of the book.)

Binary data classification task


In this task, the output 𝑦 takes on only two possible values, zero or one6
— called the negative and the positive class, respectively —, and the
functions 𝑓𝜃 are indicator functions. Choosing the loss

0 if 𝑦 = 𝑓𝜃 (𝑥)
ℒ(𝑦, 𝑓𝜃 (𝑥)) = {
1 if 𝑦 ≠ 𝑓𝜃 (𝑥),

the risk (6.2) becomes the probability of classification error. The func-
tion 𝑓𝜃 , in this case, is called a classifier and 𝑦 is called the label.

Regression estimation task


In this task, the output 𝑦 is a real value and the functions 𝑓𝜃 are real-
valued functions. The loss function is the squared error
2
ℒ(𝑦, 𝑓𝜃 (𝑥)) = (𝑦 − 𝑓𝜃 (𝑥)) .

In section 6.3, we show that the function that minimizes the risk with
such a loss function is the so-called regression. The estimator 𝑓𝜃 of the
regression, in this case, is called a regressor.
4In SLT, Vapnik calls it pattern recognition.
5We are not talking about regression analysis; regression estimation is closer to the
scoring task definition by N. Zumel and J. Mount (2019). Practical Data Science with R.
2nd ed. Shelter Island, NY, USA: Manning.
6Alternatively, negative class is represented by −1 and positive class by 1.
6.2. THE LEARNING PROBLEM 117

6.2.2 A few remarks


These two tasks are quite general and can be applied to a wide range
of problems. The modeling of the task at hand and choice of the loss
function are crucial to the success of the learning process.
About these learning tasks, we can make a few remarks.

Supervised and semisupervised learning In both cases, classifica-


tion and regression estimation, the learning task is to find the function
that maps the input data to the output data in the best possible way. Al-
though the learning machine described generates models in a supervised
manner — i.e., the target is known for all samples in the training set —,
there are alternative ways to solve the inductive learning problem, such
as the semisupervised approach, where the model can be trained with a
small subset of labeled data and a large subset of unlabeled data — that
is, data whose outputs 𝑦 are unknown.

Generative and discriminative models Any learning machine pro-


duces a model that describes the relationship between the input and out-
put data. This model can be generative or discriminative. Generative
models describe the joint probability distribution P(𝑥, 𝑦) and can also be
used to generate new data. Discriminative models, on the other hand,
describe the conditional probability distribution P(𝑦 ∣ 𝑥) directly and
can only be used to make predictions. Generative models are usually
much more complex than discriminative models7, but they hold more
information about the data. If you only need to solve the predictive prob-
lem, prefer a discriminative model.

Multiclass classification In the binary classification task, the out-


put 𝑦 is a binary variable. However, it is possible to have a multiclass
classification task, where 𝑦 can take on more than two possible values.
Although some learning methods can address directly the multiclass
classification task, it is possible to transform the problem into a binary
classification task. The most common method is one-versus-all, where
we train 𝑙 binary classifiers, one for each class, and the class with the
highest score is the predicted class. Another method is the one-versus-
one method, where we train 𝑙(𝑙 − 1)/2 binary classifiers, one for each
pair of classes, and the class with the most votes is the predicted class.

7Since modeling P(𝑥, 𝑦) indirectly models P(𝑦 ∣ 𝑥) and P(𝑥).


118 CHAPTER 6. LEARNING FROM DATA

As one should expect, dealing with more than two classes is more com-
plex than dealing with only two classes. If possible, prefer to deal with
binary classification tasks first.

Number of inputs and outputs Note that the definition of the learn-
ing problem does not restrict the number of inputs and outputs. The in-
put data can be a scalar, a vector, a matrix, or a tensor, and the output as
well. The learning machine must be able to handle the input and output
data according to the problem.

6.3 Optimal solutions


In this section, I show that the optimal solutions for the tasks of binary
data classification and regression estimation depend only on P(𝑦 ∣ 𝑥)
(i.e. discriminative models). This is useful to understand how good a
solution can possibly be and to derive practical solutions in the next sec-
tions.

6.3.1 Bayes classifier


The optimal solution for the binary data classification task is the Bayes
classifier, which minimizes the probability of classification error. The
Bayes classifier is defined as

𝑓Bayes (𝑥) = arg max P(𝑦 ∣ 𝑥).


𝑦∈𝒴

We can easily see that the Bayes classifier is the optimal solution for
the binary data classification task. The probability of classification error
for an arbitrary classifier 𝑓 is

𝑅(𝑓) = ∫ 𝟙𝑓(𝑥)≠𝑦 𝑑P(𝑥, 𝑦) = ∬ 𝟙𝑓(𝑥)≠𝑦 𝑑P(𝑦|𝑥) 𝑑P(𝑥),

where 𝟙⋅ is the indicator function that returns one if the condition is true
and zero otherwise. Let 𝑏(𝑥) = P(𝑦 = 1 ∣ 𝑥); we have that

∫ 𝟙𝑓(𝑥)≠𝑦 𝑑P(𝑦|𝑥) = 𝑏(𝑥)𝟙𝑓(𝑥)=0 + (1 − 𝑏(𝑥))𝟙𝑓(𝑥)=1 ,

which means only one of the terms is nonzero for each 𝑥. Thus, the risk
is minimized by choosing a classifier that 𝑓(𝑥) = 1 if 𝑏(𝑥) > 1 − 𝑏(𝑥)
and 𝑓(𝑥) = 0 otherwise. This is the Bayes classifier.
6.3. OPTIMAL SOLUTIONS 119

Consequently, the Bayes error rate, or irreducible error, is the lowest


possible loss for any classifier in a given problem. The Bayes error rate
sums the errors of the Bayes classifier for each class:

𝑅Bayes = ∫ [𝑏(𝑥)𝟙𝑓Bayes (𝑥)=0 + (1 − 𝑏(𝑥)) 𝟙𝑓Bayes (𝑥)=1 ] 𝑑P(𝑥).

We know that 𝑓Bayes (𝑥) = 1 if 𝑏(𝑥) > 0.5 and 𝑓Bayes (𝑥) = 0 otherwise.
Thus, the Bayes error rate can be rewritten as

𝑅Bayes = ∫ min {𝑏(𝑥), 1 − 𝑏(𝑥)} 𝑑P(𝑥).

Figure 6.2: Bayes classifier illustration.

P(𝑥 ∣ 𝑦 = 0) P(𝑥 ∣ 𝑦 = 1)

The Bayes classifier is the line that separates the two classes. The
Bayes error is a result of the darker area in which the distributions
of the classes intersect.

Figure 6.2 illustrates the Bayes classifier and its error rate. The verti-
cal line represents the Bayes classifier that separates the classes the best
way possible in the space of the feature vectors 𝑥. Since the distributions
P(𝑥 ∣ 𝑦 = 0) and P(𝑥 ∣ 𝑦 = 1) may intersect, there is a region where the
Bayes classifier cannot always predict the class correctly.
120 CHAPTER 6. LEARNING FROM DATA

6.3.2 Regression function


In the regression estimation task, the goal is to approximate the optimal
solution, called regression function,

𝑟(𝑥) = ∫ 𝑦 𝑑P(𝑦 ∣ 𝑥), (6.3)

that is the expected value of the target variable 𝑦 given the input 𝑥.
It is easy to show that the regression function minimizes the risk
(6.2) with loss
2
ℒ(𝑦, 𝑟(𝑥)) = (𝑦 − 𝑟(𝑥)) .
The risk functional for an arbitrary function 𝑓 is

2
𝑅(𝑓) = ∫ (𝑦 − 𝑓(𝑥)) 𝑑P(𝑥, 𝑦) =

∫ 𝑦2 𝑑P(𝑦) − 2 ∫ 𝑓(𝑥) [∫ 𝑦 𝑑P(𝑦 ∣ 𝑥)] 𝑑P(𝑥) + ∫ 𝑓(𝑥)2 𝑑P(𝑥),

however we can substitute 𝑟(𝑥) for the inner integral and obtain

𝑅(𝑓) = ∫ 𝑦2 𝑑P(𝑦) − 2 ∫ 𝑓(𝑥)𝑟(𝑥) 𝑑P(𝑥) + ∫ 𝑓(𝑥)2 𝑑P(𝑥) =

∫ 𝑦2 𝑑P(𝑦) + ∫ [𝑓(𝑥)2 − 2𝑓(𝑥)𝑟(𝑥)] 𝑑P(𝑥).

Once the first term is a constant, the risk is minimized by minimizing


𝑓(𝑥)2 − 2𝑓(𝑥)𝑟(𝑥).
Deriving the last expression with respect to 𝑓(𝑥) and setting it to zero,
we obtain
𝑑
[𝑓(𝑥)2 − 2𝑓(𝑥)𝑟(𝑥)] = 2𝑓(𝑥) − 2𝑟(𝑥) = 0 ⇒ 𝑓(𝑥) = 𝑟(𝑥).
𝑑𝑓(𝑥)
Like the Bayes classifier, the stochastic nature of the data leads to an
irreducible error in the regression estimation task. We have that
2
𝑅(𝑟) = ∫ (𝑦 − 𝑟(𝑥)) 𝑑P(𝑥, 𝑦) = ∫ 𝑦2 𝑑P(𝑦) − ∫ 𝑟(𝑥)2 𝑑P(𝑥),

where the first term is


2
E[𝑦2 ] = Var(𝑦) + E[𝑦]
6.3. OPTIMAL SOLUTIONS 121

and the second term is


2 2 2
E [E[𝑦 ∣ 𝑥] ] = Var(E[𝑦 ∣ 𝑥]) + E[E[𝑦 ∣ 𝑥]] = Var(E[𝑦 ∣ 𝑥]) + E[𝑦] .
Thus, the irreducible error is
𝑅(𝑟) = Var(𝑦) − Var(E[𝑦 ∣ 𝑥]) .
The interpretation of the irreducible error comes from the law of the
total variance:
Var(𝑦) = E[Var(𝑦 ∣ 𝑥)] + Var(E[𝑦 ∣ 𝑥]) ,
where the first term is known as the unexplained variance and the sec-
ond term, as the explained variance. The equality 𝑅(𝑟) = E[Var(𝑦 ∣ 𝑥)]
captures the idea that this variance is the intrinsic uncertainty that can-
not be further reduced.

Figure 6.3: Unexplained variance is the error of the regression.

Unexplained variance

1 𝜎=1 𝑟(𝑥) = 𝑥

0.5
Explained variance

𝑥
0.5 1

Expected error of the regression function for data generated by


P(𝑦 ∣ 𝑥) P(𝑥) such that P(𝑦 ∣ 𝑥) = 𝒩(𝑥, 1) and P(𝑥) = 𝒰(0, 1).
The regression function is 𝑟(𝑥) = 𝑥.

Figure 6.3 illustrates the irreducible error of the regression function


for arbitrary distributions P(𝑦 ∣ 𝑥) and P(𝑥). In this case, E[Var(𝑦 ∣ 𝑥)] =
1, which means that points drawn by P(𝑥, 𝑦) are distributed around the
regression function 𝑟(𝑥) with a standard deviation of one. The explained
variance is the spread of the regression across the x-domain.
122 CHAPTER 6. LEARNING FROM DATA

6.4 ERM inductive principle


It is very interesting to study the optimal solution for learning tasks, but
in the real-world, we do not have access to the distributions P(𝑥) and
P(𝑦 ∣ 𝑥). We must rely on the training data (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) to infer
a solution.
In the following sections, for the sake of simplicity, let 𝑧 describe the
pair (𝑥, 𝑦) and 𝐿(𝑧, 𝜃) be a generic loss function for the model 𝑓𝜃 . Note
that the training dataset is thus a set of 𝑛 i.i.d. samples 𝑧1 , … , 𝑧𝑛 .
Since the distribution P(𝑧) is unknown, the risk functional 𝑅(𝜃) is
replaced by the empirical risk functional
𝑛
1
𝑅𝑛 (𝜃) = ∑ 𝐿(𝑧𝑖 , 𝜃). (6.4)
𝑛 𝑖=1

“‘
Approximating 𝑅(𝜃) by the empirical risk functional 𝑅𝑛 (𝜃) is the
so-called empirical risk minimization (ERM) inductive principle. The
ERM principle is the basis of the SLT.
Traditional methods, such as least squares, maximum likelihood,
and maximum a posteriori, are all realizations of the ERM principle for
specific loss functions and hypothesis spaces.

6.4.1 Consistency of the learning process


One important question about the ERM principle is the consistency of
the learning process. Consistency means that, given a sufficient number
of samples, the empirical risk functional 𝑅𝑛 (𝜃) converges to the true risk
functional 𝑅(𝜃) over the hypothesis space Θ.
The consistency of the ERM principle is guaranteed by the uniform
(two-sided) convergence8 of the empirical risk functional 𝑅𝑛 (𝜃) to the
true risk functional 𝑅(𝜃) over the hypothesis space Θ. The uniform con-
vergence is defined as

lim P(sup ||𝑅𝑛 (𝜃) − 𝑅(𝜃)|| > 𝜖) = 0.


𝑛→∞ 𝜃∈Θ
8Actually, only a weaker one-sided uniform convergence is needed; consult a detailed
explanation in chapter 2 of V. N. Vapnik (1999). The nature of statistical learning theory.
2nd ed. Springer-Verlag New York, Inc. isbn: 978-1-4419-3160-3. The equivalence is a con-
sequence of the key theorem of learning proved by Vapnik and Chervonenkis in 1989 and
later translated to English in V. N. Vapnik and A. Chervonenkis (1991). “The necessary
and sufficient conditions for consistency of the method of empirical risk minimization”.
In: Pattern Recognition and Image Analysis 1.3, pp. 284–305.
6.4. ERM INDUCTIVE PRINCIPLE 123

6.4.2 Rate of convergence


Beyond consistency, it is also useful to understand the rate at which
𝑅𝑛 (𝜃) converges to 𝑅(𝜃) as the sample size 𝑛 increases. It is possible for
a learning machine to be consistent but have a slow convergence rate,
which means that a large number of samples is needed to achieve a good
solution.
The asymptotic rate of convergence of the empirical risk functional
𝑅𝑛 (𝜃) is fast if, for any 𝑛 > 𝑛0 , the exponential bound

P(𝑅(𝜃𝑛 ) − 𝑅(𝜃) > 𝜖) < exp(−𝑐𝑛𝜖2 )

holds true, where 𝑐 is a positive constant.

6.4.3 VC entropy
Let 𝐿(𝑧, 𝜃), 𝜃 ∈ Θ, be a set of bounded loss functions, i.e.

|𝐿(𝑧, 𝜃)| < 𝑀,

for some constant 𝑀 and all 𝑧 and 𝜃. One can construct 𝑛-dimensional
vectors
𝑙(𝑧1 , … , 𝑧𝑛 ; 𝜃) = [𝐿(𝑧1 , 𝜃), … , 𝐿(𝑧𝑛 , 𝜃)].
Once the loss functions are bounded, this set of vectors belongs to a 𝑛-
dimensional cube and has a finite minimal 𝜖-net9.
Consider the quantity 𝑁(𝑧1 , … , 𝑧𝑛 ; Θ, 𝜖) that counts the number of
elements of the minimal 𝜖-net of that set of vectors. Once the quantity
𝑁 is a random variable, we can define the VC entropy as

𝐻(𝑛; Θ, 𝜖) = E[ln 𝑁(𝑧1 , … , 𝑧𝑛 ; Θ, 𝜖)] ,

where 𝑧𝑖 are i.i.d. samples drawn from some P(𝑧).


If 𝐿(𝑧, 𝜃), 𝜃 ∈ Θ, is a set of indicator functions (i.e., the loss in a
binary classification task), we measure the diversity of this set using the
quantity 𝑁(𝑧1 , … , 𝑧𝑛 ; Θ) that counts the number of different separations
of the given sample that can be made by the functions. In this case, the
minimal 𝜖-net for 𝜖 < 1 does not depend on 𝜖 and is a subset of the
vertices of the unit cube.
A necessary and sufficient condition for uniform convergence is
𝐻(𝑛; Θ, 𝜖)
lim = 0, (6.5)
𝑛→∞ 𝑛
9An 𝜖-net is a set of points that are 𝜖-close to any point in the set.
124 CHAPTER 6. LEARNING FROM DATA

for all 𝜖 > 0.


The VC entropy measures the complexity of the hypothesis space
Θ. The intuition behind the need for a decreasing VC entropy with in-
creasing numbers of observations is related to the nonfalsifiability of the
learning machine. For instance, the set of functions that can always sep-
arate the training data perfectly (contains all the vertices of the cube) is
nonfalsifiable because it implies that the minimum of the empirical risk
is zero independently of the value of the true risk.

6.4.4 Growing function and VC dimension


It turns out that we can guarantee both the uniform convergence and
the fast rate of convergence independently of P(𝑧). Actually,
𝐺(𝑛; Θ)
lim =0
𝑛→∞ 𝑛
is the necessary and sufficient condition, where

𝐺(𝑛; Θ) = ln sup 𝑁(𝑧1 , … , 𝑧𝑛 ; Θ),


𝑧1 ,…,𝑧𝑛

is the growth function of the hypothesis space Θ.


V. Vapnik and Chervonenkis (1968)10 showed that the growth func-
tion either satisfies
𝐺(𝑛; Θ) = 𝑛 ln 2
or is bounded by
𝑛
𝐺(𝑛; Θ) ≤ ℎ (ln
+ 1) ,

where ℎ is an integer. Thus, the growth function is either linear or log-
arithmic in 𝑛. In the first case, we say that the VC dimension of the
hypothesis space is infinite, and in the second case, the VC dimension
is ℎ.
A finite VC dimension is enough to imply both consistency and a
fast rate of convergence.

Intuitions about the VC dimension


For a set of indicator functions, the VC dimension is the maximum num-
ber of vectors that can be shattered by the functions. If, for any 𝑛, there
10V. Vapnik and A. Chervonenkis (1968). “On the uniform convergence of relative
frequencies of events to their probabilities”. In: Doklady Akademii Nauk USSR. vol. 181.
4, pp. 781–787.
6.4. ERM INDUCTIVE PRINCIPLE 125

Figure 6.4: VC dimension of a set of lines in the plane.


𝜃1

𝜃3

𝜃2

The VC dimension of the lines in the plane is equal to 3, since a


line can shatter 3 points in all 8 possible ways, but not four points.

is a set of 𝑛 vectors that can be shattered by the functions, the VC di-


mension is infinite. We say that ℎ vectors can be shattered if they can be
separated into two classes in all 2ℎ possible ways. Figure 6.4 illustrates
the VC dimension of a set of lines in the plane.
One misconception about the VC dimension is that it is related to
the number of parameters of the model. The VC dimension is actually
related to the complexity of the hypothesis space, not to the number of
parameters. For instance, the VC dimension of functions

𝑓(𝑧; 𝜃) = 𝟙sin 𝜃𝑥>0

is infinite, even though the parameter 𝜃 is a scalar. See fig. 6.5. By in-
creasing the frequency 𝜃 of the sine wave, the function can approximate
any set of points.
This opens remarkable opportunities to find good solutions contain-
ing a huge number of parameters11 but with a finite VC dimension.

11Sometimes, like in a linear model, the number of parameters is proportional to the


number of dimensions of the feature vector.
126 CHAPTER 6. LEARNING FROM DATA

Figure 6.5: High-frequency sine wave functions have an infinite


VC dimension.

High-frequency sine waves approximate any set of points well,


even though they may come from a low-frequency sine wave or
any other function.

6.5 SRM inductive principle


The ERM principle is a powerful tool to study the generalization ability
of the learning process. By generalization ability, we mean the ability
of the learning machine to predict the output of new data that was not
seen during the training process. However, it relies on the hypothesis
that the number of samples tends to infinity.
In fact, V. N. Vapnik (1999)12 summarizes the bounds for the gener-
alization ability of learning machines in the following way13:

𝐵ℰ 4𝑅 (𝜃 )
𝑅(𝜃𝑛 ) ≤ 𝑅𝑛 (𝜃𝑛 ) + (1 + √1 + 𝑛 𝑛 ) , (6.6)
2 𝐵ℰ

12V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-Verlag
New York, Inc. isbn: 978-1-4419-3160-3.
13For the sake of the arguments, we consider only the expression for bounded losses
and an hypothesis space with infinite number of functions. Rigorously, the loss function
may not be bounded; consult the original work for the complete expressions.
6.5. SRM INDUCTIVE PRINCIPLE 127

with
2𝑛 𝜂
ℎ (ln + 1) − ln
ℎ 4
ℰ=4 ,
𝑛
where 𝐵 is the upper bound of the loss function, ℎ is the VC dimension
of the hypothesis space, 𝑛 is the number of samples. The term 𝜂 is the
confidence level, i.e., the inequality holds with probability 1 − 𝜂.
It is easy to see that as the number of samples 𝑛 increases, the em-
pirical risk 𝑅𝑛 (𝜃𝑛 ) approaches the true risk 𝑅(𝜃𝑛 ). Also, the greater the
VC dimension ℎ, the greater the term ℰ, decreasing the generalization
ability of the learning machine.
In other words, if 𝑛/ℎ is small, a small empirical risk does not guar-
antee a small value for the actual risk. A consequence is that we need
to minimize both terms of the right-hand side of the inequality eq. (6.6)
to achieve a good generalization ability.

Table 6.1: Overfitting and underfitting.

Problem Empirical risk Confidence interval


Underfitting High Low
Overfitting Low High

Two problems that can arise in the learning process are underfit-
ting and overfitting. Underfitting occurs when the model is too
simple (low VC dimension) and cannot capture the complexity of
the training data (high empirical risk). Overfitting occurs when
the model is too complex (high VC dimension increases the con-
fidence interval) and fits the training data almost perfectly (low
empirical risk).

Failure to balance the optimization of these terms leads to two prob-


lems: underfitting and overfitting. Table 6.1 summarizes the problems.
The structural risk minimization (SRM) principle consists of mini-
mizing both the empirical risk (optimizing the parameters of the model)
and the confidence interval (controlling VC dimension).
Let Θ𝑘 ⊂ Θ and
𝑆 𝑘 = {𝐿(𝑧, 𝜃) ∶ 𝜃 ∈ Θ𝑘 }
such that
𝑆 1 ⊂ 𝑆 2 ⊂ ⋯ ⊂ 𝑆𝑛 ⊂ … ,
128 CHAPTER 6. LEARNING FROM DATA

satisfying
ℎ1 ≤ ℎ2 ≤ ⋯ ≤ ℎ𝑛 ≤ … ,

where ℎ𝑘 is the finite VC dimension14 of each set 𝑆 𝑘 . This is called an


admissible structure.

Figure 6.6: SRM trade-off.

Risk
Risk upper
bound Confidence
interval

Empirical
risk 𝑘
ℎ1 ∗ ℎ𝑛

The upper bound of the risk is the sum of the empirical risk and
the confidence interval. The smallest bound is found for some 𝑘∗
in the admissible structure.

Given the observations 𝑧1 , … , 𝑧𝑛 , the SRM principle chooses a func-


tion 𝐿(𝑧, 𝜃𝑛𝑘 ) that minimizes the empirical risk 𝑅𝑛 (𝜃𝑛𝑘 ) in the subset 𝑆 𝑘
for which the guaranteed risk — upper bound considering the confi-
dence interval — is minimal. This is a trade-off between the quality of
the approximation and the complexity of the approximating function —
see fig. 6.6.

14Note that the VC dimension considering the whole set Θ might be infinite. Moreover,
in the original formulation, the sets 𝑆𝑘 also need to satisfy some bounds; read more in
chapter 4 of V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-
Verlag New York, Inc. isbn: 978-1-4419-3160-3.
6.5. SRM INDUCTIVE PRINCIPLE 129

6.5.1 Bias invariance trade-off


The trade-off that the SRM principle deals with is the general case of the
so-called bias-variance trade-off. The bias-variance trade-off is a well-
known concept in machine learning that describes the relationship be-
tween different kinds of errors a model can have.
The bias error comes from failure to capture relevant relationships
between features and target outputs. The variance error comes from
erroneously modeling the random noise in the training data.
The terms bias and variance (and the irreducible error) are clearly
illustrated by studying the particular regression estimation task.
Consider a learning machine that produces a function 𝑓(𝑥;̂ 𝐷) based
on the training set

𝐷 = {(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 )}

such that
𝑦 𝑖 = 𝑓(𝑥𝑖 ) + 𝜖,

for a fixed function 𝑓 and a random noise 𝜖 with zero mean and variance
𝜎2 , where 𝑥𝑖 are i.i.d. samples drawn from some distribution P(𝑥).
̄ is the expected value of the function 𝑓(𝑥;
Also, consider that 𝑓(𝑥) ̂ 𝐷)
over all possible training sets 𝐷, i.e.

̄ = ∫ 𝑓(𝑥;
𝑓(𝑥) ̂ 𝐷) 𝑑P(𝐷).

(Note that the models themselves are the random variable we are study-
ing here.)
For any model 𝑓,̂ the expected (squared) error for a particular sample
̂ 𝐷))2 ], is
(𝑥, 𝑦), E𝐷 [(𝑦 − 𝑓(𝑥;

2 2
̂
∫ (𝑦 − 𝑓(𝑥)) ̂
𝑑P(𝐷, 𝜖) = ∫ (𝑦 − 𝑓(𝑥) + 𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷, 𝜖)

2
= ∫ (𝑦 − 𝑓(𝑥)) 𝑑P(𝐷) (6.7)

2
̂
+ ∫ (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷) (6.8)

̂
+ 2 ∫ (𝑦 − 𝑓(𝑥)) (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷, 𝜖). (6.9)
130 CHAPTER 6. LEARNING FROM DATA

The term (6.7) is the irreducible error:


2 2
∫ (𝑦 − 𝑓(𝑥)) 𝑑P(𝐷) = ∫ (𝑓(𝑥) + 𝜖 − 𝑓(𝑥)) 𝑑P(𝐷) = ∫ 𝜖2 𝑑P(𝐷) = 𝜎2 .
(6.10)
As the best solution is 𝑓 itself, the error that comes from the noise is
unavoidable.
The term (6.9) is null:

̂
∫ (𝑦 − 𝑓(𝑥)) (𝑓(𝑥) − 𝑓(𝑥)) ̂
𝑑P(𝐷, 𝜖) = ∫ 𝜖 (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷, 𝜖) =
0
 
*
̂
∫𝜖 𝑑P(𝜖) ∫ (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷) = 0,

since P(𝐷) and P(𝜖) are independent and E[𝜖] = 0 by definition.
We can apply a similar strategy to analyze the term (6.8):
2 2
̂
∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄ + 𝑓(𝑥)
𝑑P(𝐷) = ∫ (𝑓(𝑥) − 𝑓(𝑥) ̄ − 𝑓(𝑥))
̂ 𝑑P(𝐷)

2
̄
= ∫ (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷) (6.11)

2
̄ − 𝑓(𝑥))
+ ∫ (𝑓(𝑥) ̂ 𝑑P(𝐷) (6.12)

̄
+ 2 ∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄ − 𝑓(𝑥))
(𝑓(𝑥) ̂ 𝑑P(𝐷).
(6.13)
Now, the term (6.13) is also null:

̄
∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄ − 𝑓(𝑥;
(𝑓(𝑥) ̂ 𝐷)) 𝑑P(𝐷) =

̄
(𝑓(𝑥) − 𝑓(𝑥)) ̄ − 𝑓(𝑥;
∫ (𝑓(𝑥) ̂ 𝐷)) 𝑑P(𝐷) =

0
:
 
̄ ̄
(𝑓(𝑥) − 𝑓(𝑥))(𝑓(𝑥) −∫  𝐷) 𝑑P(𝐷)) = 0,
̂
𝑓(𝑥;

̄ is the expected value of 𝑓(𝑥;
since 𝑓(𝑥) ̂ 𝐷).
The term (6.11) does not depend on the training set, so
2 2
̄
∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄
𝑑P(𝐷) = (𝑓(𝑥) − 𝑓(𝑥)) . (6.14)
6.5. SRM INDUCTIVE PRINCIPLE 131

This term is the square of the bias of the models.


̂ 𝐷):
The term (6.12) is the variance of the function 𝑓(𝑥;

2
̄ − 𝑓(𝑥;
∫ (𝑓(𝑥) ̂ 𝐷)) 𝑑P(𝐷) =

2
̄ − 𝑓(𝑥;
E𝐷 [(𝑓(𝑥) ̂ 𝐷)) ] = Var𝐷 (𝑓(𝑥;
̂ 𝐷)) . (6.15)

Finally, putting all together — i.e. eqs. (6.10), (6.14) and (6.15) —,
we have that the expected error for a particular sample (𝑥, 𝑦) is
2 2
̂ 𝐷)) ] = 𝜎2 + (𝑓(𝑥) − E[𝑓(𝑥;
E𝐷 [(𝑦 − 𝑓(𝑥; ̂ 𝐷)] ) + Var𝐷 (𝑓(𝑥;
̂ 𝐷)) .

The irreducible error is the regression error that cannot be reduced


by any model — see section 6.3.2. The bias error is the error that one
expects from the model acquired by the learning machine and that we
observe in the training data — i.e. the empirical risk. The variance er-
ror, which does not depend on the real function 𝑓 but on the models the
learning machine can generate, is the error that comes from how differ-
ent the models can be from each other — i.e. the confidence interval
that comes from the VC dimension.

6.5.2 Regularization
Also related to the SRM principle is the concept of regularization. Reg-
ularization encourages models to learn robust patterns within the data
rather than memorizing it.
Regularization techniques usually modify the loss by adding a pen-
alty term that depends on the complexity of the model. So, instead of
minimizing the empirical risk 𝑅𝑛 (𝜃), the learning machine minimizes
the regularized empirical risk
𝑅𝑛 (𝜃) + 𝜆Ω(𝜃),
where Ω(𝜃) is the complexity of the model and 𝜆 is a hyperparameter
that controls the trade-off between the empirical risk and the complex-
ity. Note that the regularization term acts as a proxy for the confidence
interval in the SRM principle. However, regularization is often justified
by common sense or intuition, rather than by strong theoretical argu-
ments.
Other approaches that indirectly control the complexity of the model
— such as early stopping, dropout, ensembles, and pruning — are often
called implicit regularization.
132 CHAPTER 6. LEARNING FROM DATA

6.6 Linear problems


To realize the concepts of the SRM principle in practice, we consider
linear classification tasks.
For the examples in the following subsections, we use the datasets
for the AND and the XOR problem — see table 6.2. The AND problem
is linearly separable, while the XOR problem is not.

Table 6.2: AND and XOR datasets.

𝑥1 𝑥2 𝑦 = 𝑥 1 ∧ 𝑥2 𝑥1 𝑥2 𝑦 = 𝑥 1 ⊕ 𝑥2
0 0 0 0 0 0
0 1 0 0 1 1
1 0 0 1 0 1
1 1 1 1 1 0

The AND and XOR datasets are binary classification datasets


where the output 𝑦 is the “logical AND” and the “exclusive OR”
of the inputs 𝑥1 and 𝑥2 , i.e., 𝑦 = 𝑥1 ∧ 𝑥2 and 𝑦 = 𝑥1 ⊕ 𝑥2 .

We show two learning machines that implement the SRM principle


in different ways:

• The perceptron, which fixes the complexity of the model and tries
to minimize the empirical risk; and
• The maximal margin classifier, which fixes the empirical risk —
in this case, zero – and tries to minimize the confidence interval.

6.6.1 Perceptron
The perceptron is a linear classifier that generates a hyperplane that sep-
arates the classes in the feature space. It is a parametric model, and the
learning process minimizes the empirical risk by adjusting its fixed set
of parameters.
Parametric models are usually simpler and faster to fit, but they are
less flexible. In other words, it is up to the researcher to choose the best
model “size” for the problem. If the model is too small, it will not be
able to capture the complexity of the data. If the model is too large, it
tends to be too complex, too slow to train, and might overfit to the data.
6.6. LINEAR PROBLEMS 133

Definition 6.1: (Parametric model)

If the learning machine generates a set of functions 𝑓𝜃 where the


number of parameters |𝜃| is always fixed, the models are called
parametric.

Note, however, that the VC dimension and number of parameters are


not the same thing — consult section 6.4.4.
The perceptron model (with two inputs) is

𝑓(𝑥1 , 𝑥2 ; w = [𝑤 0 , 𝑤 1 , 𝑤 2 ]) = 𝑢(𝑤 0 + 𝑤 1 𝑥1 + 𝑤 2 𝑥2 ),

where 𝑢 is the Heaviside step function

1 if 𝑥 > 0,
𝑢(𝑥) = {
0 otherwise.

The parameters 𝜃 = w are called the weights of the perceptron. The


equation w⋅x = 0, where x = [1, 𝑥1 , 𝑥2 ], is the equation of a hyperplane.

Figure 6.7: Perceptron decision boundaries in the AND dataset.

1
𝑥2

0 1
𝑥1

The perceptron assumes that the classes are linearly separable.


The hyperplane that separates the classes comes from the weights
of the model. In this case, 𝑤 0 = −1.1, 𝑤 1 = 0.6, and 𝑤 2 = 1.
134 CHAPTER 6. LEARNING FROM DATA

In fig. 6.7, we show the hyperplane (in this case, a line) that the
model with weights w = [−1.1, 0.6, 1] generates in this feature space.
As one can see, the classes are linearly separable, and the perceptron
model classifies the dataset correctly; see table 6.3.

Table 6.3: Truth table for the predictions of the perceptron in the
AND dataset.

𝑥1 𝑥2 𝑦 −1.1 + 𝑥1 + 𝑥2 𝑦̂
0 0 0 -1.1 0
0 1 0 -0.1 0
1 0 0 -0.5 0
1 1 1 0.5 1

The perceptron model with parameters 𝑤 0 = 1.1, 𝑤 1 = −1, and


𝑤 2 = −1 classifies the AND dataset correctly.

Figure 6.8: Perceptron decision boundaries in the XOR dataset.

1
𝑥2

0 1
𝑥1

The XOR dataset is not linearly separable. The hyperplane that


separates the classes comes from the weights of the model. In
this case, 𝑤 0 = −0.5, 𝑤 1 = 1, and 𝑤 2 = −1. There is no way to
classify the XOR dataset correctly with a perceptron.
6.6. LINEAR PROBLEMS 135

In fig. 6.8, we show the hyperplane that the model w = [−0.5, 1, −1]
generates for the XOR dataset. As one can see, the perceptron model
fails to solve the task since there is no single decision boundary that can
classify this data.

Table 6.4: Truth table for the predictions of the perceptron in the
XOR dataset.

𝑥1 𝑥2 𝑦 −0.5 + 𝑥1 − 𝑥2 𝑦̂
0 0 0 -0.5 0
0 1 1 -1.5 0
1 0 1 0.5 1
1 1 0 -0.5 0

The perceptron model with parameters 𝑤 0 = −0.5, 𝑤 1 = 1, and


𝑤 2 = −1 fails to classify the XOR dataset correctly — as any other
perceptron would do.

It is easy to see that there are an infinite number of hyperplanes that


can separate the classes in the AND dataset. The training procedure
of the perceptron is a simple algorithm that adjusts the weights of the
model to find one of these hyperplanes — effectively minimizing the
empirical risk. The algorithm updates the weights iteratively for each
sample that is misclassified, repeating the samples as many times as nec-
essary. It stops when all samples are correctly classified.
For a binary classification problem and a perceptron with weights
w, there are 4 situations for a given sample x and 𝑦:
1. 𝑦 = 0 and 𝑢(w ⋅ x) = 0;
2. 𝑦 = 0 and 𝑢(w ⋅ x) = 1;
3. 𝑦 = 1 and 𝑢(w ⋅ x) = 0;
4. 𝑦 = 1 and 𝑢(w ⋅ x) = 1.
By definition, the algorithm must update the weights when situation
2 or 3 occurs. Let 𝑒 = 𝑦 − 𝑢(w ⋅ x) be the error of the model for a given
sample.
In situation 2, we have that w ⋅ x > 0 which means that the angle
𝛼 between the vectors w and x is less than 90∘ , since ‖w‖‖x‖ cos 𝛼 >
0 ⟹ cos 𝛼 > 0 ⟹ 𝛼 < 90∘ . To increase the angle between the
136 CHAPTER 6. LEARNING FROM DATA

vectors, we can subtract 𝜂x from w, for some small 𝜂 > 0 — see fig. 6.9.
The error here is 𝑒 = −1.

Figure 6.9: Angle between w and x in a positive output.

−𝜂x
w′ w

𝛼
x

A positive output for the perceptron with weights w and input


x means that the angle between the vectors is less than 90∘ . To
increase the angle between the vectors, we can subtract 𝜂x from
w, for some small 𝜂 > 0.

In situation 3, we have that w ⋅ x < 0 which means that the angle


𝛼 between the vectors w and x is greater than 90∘ , since ‖w‖‖x‖ cos 𝛼 <
0 ⟹ cos 𝛼 < 0 ⟹ 𝛼 > 90∘ . To decrease the angle between the
vectors, we can add 𝜂x to w, for some small 𝜂 > 0 — see fig. 6.10. Now,
the error is 𝑒 = 1.

Figure 6.10: Angle between w and x in a negative output.

𝜂x
w w′

𝛼
x

A negative output for the perceptron with weights w and input x


means that the angle between the vectors is greater than 90∘ . To
decrease the angle between the vectors, we can add 𝜂x to w, for
some small 𝜂 > 0.

From those observations, we can derive a general update rule


w′ = w + 𝜂𝑒x,
6.6. LINEAR PROBLEMS 137

where 𝜂 is a small positive number that controls the step size of the algo-
rithm. Note that this rule works even for cases 1 and 4, where the error
is zero.
The algorithm converges given 𝜂 sufficiently small and the dataset
is linearly separable. Note that the algorithm does not make any effort
to reduce the confidence interval.
The perceptron is (possibly) the simplest artificial neural network.
More complex networks can be built by stacking perceptrons in layers
and adding non-linear activation functions. The training strategies for
those networks are usually based on reducing the empirical risk using
the gradient descent algorithm while controlling the complexity of the
model with regularization techniques15. Consult appendix B.1.

6.6.2 Maximal margin classifier


We saw that the perceptron tries to minimize the empirical risk, but it
makes no effort to reduce the confidence interval. A different approach
would be to fix the empirical risk — in our case, assuming that the
classes are linearly separable, to fix it to zero — and minimize the con-
fidence interval.
The confidence interval is an increasing function


Ω( ) ,
𝑛

where ℎ is the VC dimension of the hypothesis space and 𝑛 is the number


of samples. Since the number of training samples 𝑛 is fixed and finite
(sometimes even small), we can minimize the confidence interval by
minimizing the VC dimension ℎ.
In the case of the perceptron, since it can generate any hyperplane,
the VC dimension is ℎ = 𝑑 + 1, where 𝑑 is the number of dimensions of
the feature space — consult section 6.4.4.
Before we dive into the classifier that minimizes the confidence in-
terval, consider the following property. V. N. Vapnik (1999)16 state that
a Δ-margin separating hyperplane is the hyperplane

(w ⋅ x) − 𝑏 = 0, ‖w‖ = 1,

15To counterbalance the potential “excess” of neurons, techniques like 𝑙2 regulariza-


tion “disable” some neurons by pressuring their weights to zero.
16V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-Verlag
New York, Inc. isbn: 978-1-4419-3160-3.
138 CHAPTER 6. LEARNING FROM DATA

such that classifies vector x as

1 if (w ⋅ x) − 𝑏 ≥ Δ,
𝑦={
−1 if (w ⋅ x) − 𝑏 ≤ −Δ.

Given that vectors x ∈ ℝ𝑑 belong to a (hyper)sphere of radius 𝑅, the VC


dimension of the Δ-margin separating hyperplane is

𝑅2
ℎ ≤ min (⌊ ⌋ , 𝑑) + 1,
Δ2

which can be less than 𝑑 + 1.


From that property, to minimize the confidence interval, we can
maximize the margin Δ of the hyperplane. The maximal margin clas-
sifier is a learning machine that generates the hyperplane that separates
the classes with no error and maximizes the margin between the classes.

Figure 6.11: Maximal margin classifier for the AND dataset.


op
tim

1
al
hy
p
er
pl
𝑥2

an
m
ar

e
gi
n

0 1
𝑥1

The maximal margin classifier generates the hyperplane that max-


imizes the margin between the classes. In this case, the margin is
Δ = 0.5.

Thus, given the training set (x1 , 𝑦1 ), … , (x𝑛 , 𝑦𝑛 ), x ∈ ℝ𝑑 and 𝑦 𝑖 ∈


{−1, 1}, the optimal hyperplane — see fig. 6.11 — is the one that satisfies

𝑦 𝑖 [(w ⋅ x𝑖 − 𝑏)] ≥ 1,
6.7. CLOSING REMARKS 139

for all 𝑖 = 1, … , 𝑛 that minimizes ‖w‖2 . The intuition of the minimiza-


tion of the coefficients is that we want only the vectors in the margin
to be exactly equal to ±1. (Consequently, the vectors farther from the
margin have values greater than 1 or less than −1.)
Without entering into the details of the optimization process, one
interesting property of the maximal margin classifier is that the separat-
ing hyperplane is built from the support vectors — the vectors that are
exactly in the margin. In fig. 6.11, the support vectors are the points
(1, 0), (0, 1), and (1, 1).
In other words, maximal margin classifier is
𝑛
𝑓(𝑥) = sign (∑ 𝑦 𝑖 𝑎𝑖 (x𝑖 ⋅ 𝑥) − 𝑏) ,
𝑖=1

for some 𝑏 and coefficients 𝑎𝑖 > 0 for the support vectors (𝑎𝑖 = 0 other-
wise).
In the case that the classes are not linearly separable, the maximal
margin classifier can be extended to the soft margin classifier, which sets
the empirical risk to a value greater than zero.
Moreover, since the number of parameters of the maximal margin
classifier depends on the training data (i.e., the number of support vec-
tors), it is a nonparametric model. Nonparametric models are those in
which the number of parameters is not fixed and can grow as needed
to fit the data. This property becomes more clear when we consider the
kernel trick, which allows the maximal margin classifier to deal with
nonlinear problems. Consult V. N. Vapnik (1999)17 for more details.

6.7 Closing remarks


The SRM principle is a powerful tool to understand the generalization
ability of learning machines. The principle not only explains many of
the empirical results in ML but also provides a theoretical framework to
guide the development of new learning machines.
Many powerful methods have been proposed in the literature — e.g.,
support vector machines, boosting, and deep learning — that can deal
with complex nonlinear problems. I encourage the reader to dive into
the literature to learn more about these methods and the theoretical
principles behind them. Some comments about a few methods are given
in appendix B.
17V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-Verlag
New York, Inc. isbn: 978-1-4419-3160-3.
Data preprocessing
7
I find your lack of faith disturbing.
— Darth Vader, Star Wars: Episode IV – A New Hope (1977)

In this chapter, we discuss the data preprocessing, which is the process


of adjusting the data to make it suitable for a particular learning ma-
chine or, at the least, to ease the learning process.
Similarly to data handling, data preprocessing is done by applying a
series of operations to the data. However, some of the parameters of the
operations are not fixed but rather are fit from a data sampling. In the
context of inductive learning, the sampling is the training set.
The operations are dependent on the chosen learning method. So,
when planning the solution in our project, we must consider the prepro-
cessing tasks that are necessary to make the data suitable for the chosen
methods.
I present the most common data preprocessing tasks in three cat-
egories: data cleaning, data sampling, and data transformation. For
each task, I discuss the behavior of the data preprocessing techniques
in terms of fitting, adjustment of the training set, and application of the
preprocessor in production.
Finally, I discuss the importance of the default behavior of the model
when the preprocessing chain degenerates over a sample, i.e. when the
preprocessor decides that it has no strategy to adjust the data to make it
suitable for the model.

141
142 CHAPTER 7. DATA PREPROCESSING

Chapter remarks

Contents
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Formal definition . . . . . . . . . . . . . . . . . . 144
7.1.2 Degeneration . . . . . . . . . . . . . . . . . . . . 144
7.1.3 Data preprocessing tasks . . . . . . . . . . . . . . 145
7.2 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.1 Treating inconsistent data . . . . . . . . . . . . . 146
7.2.2 Outlier detection . . . . . . . . . . . . . . . . . . 148
7.2.3 Treating missing data . . . . . . . . . . . . . . . . 149
7.3 Data sampling . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 Random sampling . . . . . . . . . . . . . . . . . . 152
7.3.2 Scope filtering . . . . . . . . . . . . . . . . . . . . 152
7.3.3 Class balancing . . . . . . . . . . . . . . . . . . . 153
7.4 Data transformation . . . . . . . . . . . . . . . . . . . . . 154
7.4.1 Type conversion . . . . . . . . . . . . . . . . . . . 155
7.4.2 Normalization . . . . . . . . . . . . . . . . . . . . 157
7.4.3 Dimensionality reduction . . . . . . . . . . . . . . 158
7.4.4 Data enhancement . . . . . . . . . . . . . . . . . 159
7.4.5 Comments on unstructured data . . . . . . . . . . 160

Context

• Tidy data is not necessarily suitable for modeling.


• Parameters of the preprocessor are fitted rather than being fixed.

Objectives

• Understand the main data preprocessing tasks and techniques.


• Learn the behavior of the preprocessing chain in terms of fitting,
adjustment, and application.

Takeaways

• Each learning method requires specific data preprocessing tasks.


• Fitting the preprocessor is crucial to avoid leakage.
• Default behavior of the model when the preprocessing chain de-
generates must be specified.
7.1. INTRODUCTION 143

7.1 Introduction
In chapters 4 and 5, we discussed data semantics and the tools to han-
dle data. They provide the grounds for preparation of the data as we
described in the data sprint tasks in section 3.5.3. However, the focus is
to guarantee that the data is tidy and in the observational unit of interest,
not to prepare it for modeling.
As a result, although data might be appropriate for the learning tasks
we described in chapter 6 — in the sense that we know what the feature
vectors and the target variable are —, they might not be suitable for the
machine learning methods we will use.
One simple example is the perceptron (section 6.6.1) that assumes
all input variables are real numbers. If the data contains categorical
variables, we must convert them to numerical variables before applying
the perceptron.
For this reason, the solution sprint tasks in section 3.5.3 include not
only the learning tasks but also the data preprocessing tasks, which are
dependent on the chosen machine learning methods.

Definition 7.1: (Data preprocessing)

The process of adjusting the data to make it suitable for a particu-


lar learning machine or, at the least, to ease the learning process.

This is done by applying a series of operations to the data, like in


data handling. The difference here is that some of the parameters of the
operations are not fixed; rather, they are fit from a data sampling. Once
fitted, the operations can be applied to new data, sample by sample.
As a result, a data processing technique acts in three steps:

1. Fitting: The parameters of the operation are adjusted to the train-


ing data (which has already been integrated and tidied, represents
well the phenomenon of interest, and each sample is in the correct
observational unit);
2. Adjustment: The training data is adjusted according to the fitted
parameters, eventually changing the sampling size and distribu-
tion;
3. Applying: The operation is applied to new data, sample by sam-
ple.
144 CHAPTER 7. DATA PREPROCESSING

Understanding these steps and correctly defining the behavior of


each of them is crucial to avoid data leakage and to guarantee that the
model will behave as expected in production.

7.1.1 Formal definition


Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table that represents the data in the desired obser-
vational unit — as defined in section 5.1. In this chapter, without loss
of generality — as the keys are not used in the modeling process —, we
can consider 𝐾 = {1, 2, … } such that card(𝑖) = 0 if, and only if, 𝑖 > 𝑛.
That means that every row 𝑟 ∈ {1, … , 𝑛} is present in the table.
A data preprocessing strategy 𝐹 is a function that takes a table 𝑇 =
(𝐾, 𝐻, 𝑐) and returns an adjusted table 𝑇 ′ = (𝐾 ′ , 𝐻 ′ , 𝑐′ ) and a fitted pre-
processor 𝑓(𝑧; 𝜙) ≡ 𝑓𝜙 (𝑧) such that

𝑧∈ 𝒟(ℎ) ∪ {?}

ℎ∈𝐻

and 𝜙 are the fitted parameters of the operation. Similarly, 𝑧 ′ = 𝑓𝜙 (𝑧),


called the preprocessed tuple, satisfies

𝑧′ ∈ 𝒟(ℎ′ ) ∪ {?}.
⨉′
ℎ′ ∈𝐻

Note that we make no restrictions on the number of rows in the adjusted


table, i.e., preprocessing techniques can change the number of rows in
the training table.
In practice, strategy 𝐹 is a chain of dependent preprocessing opera-
tions 𝐹1 , …, 𝐹𝑚 such that, given 𝑇 = 𝑇 (0) , each operation 𝐹𝑖 is applied
to the table 𝑇 (𝑖−1) to obtain 𝑇 (𝑖) and the fitted preprocessor 𝑓𝜙𝑖 . Thus,
𝑇 ′ = 𝑇 (𝑚) and

𝑓(𝑧; 𝜙 = {𝜙1 , … , 𝜙𝑚 }) = (𝑓𝜙1 ∘ … ∘ 𝑓𝜙𝑚 ) (𝑧),

where ∘ is the composition operator. I say that they are dependent since
none of the operations can be applied to the table without the previous
ones.

7.1.2 Degeneration
The objective of the fitted preprocessor is to adjust the data to make it
suitable for the model. However, sometimes it cannot achieve this goal
7.2. DATA CLEANING 145

for a particular input 𝑧. This can happen for many reasons, such as
unexpected values, information “too incomplete” to make a prediction,
etc.
Formally, we say that the preprocessor 𝑓𝜙 degenerates over tuple 𝑧
if it outputs 𝑧 ′ = 𝑓𝜙 (𝑧) such that 𝑧 ′ = (?, … , ?). In practice, that means
that the preprocessor decided that it has no strategy to adjust the data to
make it suitable for the model. For the sake of simplicity, if any step 𝑓𝜙𝑖
degenerates over tuple 𝑧 (𝑖) , the whole preprocessing chain degenerates1
over 𝑧 = 𝑧 (0) .
Consequently, in the implementation of the solution, the developer
must choose a default behavior for the model when the preprocessing
chain degenerates over a tuple. It can be as simple as returning a default
value or as complex as redirecting the tuple to a different pair of prepro-
cessor and model. Sometimes, the developer can choose to integrate this
as an error or warning in the user application.

7.1.3 Data preprocessing tasks


The most common data preprocessing tasks can be divided into three
categories:

• Data cleaning;
• Data sampling; and
• Data transformation.

In the next sections, I address some of the most common data pre-
processing tasks in each of these categories. I present them in the order
they are usually applied in the preprocessing, but note that the order is
not fixed and can be changed according to the needs of the problem.

7.2 Data cleaning


Data cleaning is the process of removing errors and inconsistencies from
the data. This is usually done to make the data more reliable for train-
ing and to avoid bias in the learning process. Usually, such errors and
inconsistencies make the learning machines “confused” and can lead to
poor performance models.
Also, it includes the process of dealing with missing information,
which most machine learning methods do not cope with. Solutions
1Usually, this is implemented as an exception or similar programming mechanism.
146 CHAPTER 7. DATA PREPROCESSING

range from the simple removal of the observations with missing data
to the creation of new information to encode the missing data.

7.2.1 Treating inconsistent data


There are a few, but important, tasks to be done during data preprocess-
ing in terms of invalid and inconsistent data — note that we assume
that most of the issues in terms of the semantics of the data have been
solved in the data handling phase. Especially in production, the devel-
oper must be aware of the behavior of the model when it faces informa-
tion that is not supposed to be present in the data.
One of the tasks is to ensure that physical quantities are dealt with
standard units. One must check whether all columns that store physical
quantities have the same unit of measurement. If not, one must convert
the values to the same unit. A summary of this preprocessing task is
presented in table 7.1.

Table 7.1: Unit conversion preprocessing task.

Unit conversion
Goal Convert physical quantities into the same
unit of measurement.
Fitting None. User must declare the units to be
used and, if appropriate, the conversion
factors.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor converts the numerical val-
ues and drops the unit of measurement
column.

Moreover, if one knows that a variable must follow a specific range


of values, we can check whether the values are within this range. If
not, one must replace the values with missing data or with the closest
valid value. Alternatively, one can discard the observation based on that
criterion. Consult table 7.2 for a summary of this operation.
Another common problem in inconsistent information is that the
same category might be represented by different strings. This is usually
done by creating a dictionary that maps the different names to a single
7.2. DATA CLEANING 147

Table 7.2: Range check preprocessing task.

Range check
Goal Check whether the values are within the
expected range.
Fitting None. User must declare the valid range
of values.
Adjustment Training set is adjusted sample by sam-
ple, independently. If appropriate, de-
generated samples are removed.
Applying Preprocessor checks whether the value
𝑥 of a variable is within the range
[𝑎, 𝑏]. If not, it replaces 𝑥 with: (a)
missing value ?, (b) the closest valid
value max(𝑎, min(𝑏, 𝑥)), or (c) degener-
ates (discards the observation).

one, using standardizing lower or upper case, removing special charac-


ters, or more advanced fuzzy matching techniques — see table 7.3.

Table 7.3: Category standardization preprocessing task.

Category standardize
Goal Create a dictionary and/or function to
map different names to a single one.
Fitting None. User must declare the mapping.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor replaces the categorical
variable 𝑥 with the mapped value 𝑓(𝑥)
that implements case standardization,
special character removal, and/or dictio-
nary fuzzy matching.

Note that these technique parameters are not fitted from the data,
but rather are fixed from the problem definition. As a result, they could
148 CHAPTER 7. DATA PREPROCESSING

be done in the data handling phase. The reason we put them here is that
the new data in production usually come with the same issues. Having
the fixes programmed into the preprocessor makes it easier to guarantee
that the model will behave as expected in production.

7.2.2 Outlier detection


Outliers are observations that are significantly different from the other
observations. They can be caused by errors or by the presence of differ-
ent phenomena mixed in the data collection process. In both cases, it is
important to deal with outliers before modeling.
The standard way to deal with outliers is to remove them from the
dataset. Assuming that the errors or the out-of-distribution data appear
randomly and rarely, this is a good strategy.
Another approach is dealing with each variable independently. This
way, one can replace the outlier value with missing data. There are many
ways to detect outlier values, but the simplest one is probably a heuristic
based on the interquartile range (IQR).
Let 𝑄1 and 𝑄3 be the first and the third quartiles of the values in a
variable, respectively. The IQR is defined as 𝑄3 − 𝑄1 . The values that
are less than 𝑄1 − 1.5 IQR or greater than 𝑄3 + 1.5 IQR are considered
outliers. See table 7.4.

Table 7.4: Outlier detection using the interquartile range.

Outlier detection using the IQR


Goal Detect outliers using the IQR.
Fitting Store the values of 𝑄1 and 𝑄3 for each
variable.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor replaces the outlier values
with missing data.

More sophisticated methods can be used to detect samples that are


outliers, such as using the definition of an outlier in the DBSCAN2. But,
this is not enough to fit the parameters of the preprocessor. The reason
2M. Ester et al. (1996). “A density-based algorithm for discovering clusters in large
spatial databases with noise”. In: kdd. Vol. 96. 34, pp. 226–231.
7.2. DATA CLEANING 149

is that descriptive methods like DBSCAN — in this case, a method for


clustering — do not generalize to new data. I suggest using methods like
One-Class SVM3 to fit the parameters of the preprocessor that detects
outliers. Thus, any new data point can be classified as an outlier or not.
Like filtering operations in the pipeline, the developer must specify
a default behavior for the model when an outlier sample is detected in
production. See table 7.5.

Table 7.5: Task of filtering outliers.

Outlier removal
Goal Remove the observations that are out-
liers.
Fitting Parameters of the outlier classifier.
Adjustment Training set is adjusted sample by sam-
ple, independently, removing degener-
ated samples.
Applying Preprocessor degenerates if the sample is
classified as an outlier and does nothing,
otherwise.

7.2.3 Treating missing data


Since most models cannot handle missing data, it is crucial to deal with
it in the data preprocessing.
There are four main strategies to deal with missing data:
• Remove the observations (rows) with missing data;
• Remove the variables (columns) with missing data;
• Just impute the missing data;
• Use an indicator variable to mark the missing data and impute it.
Removing rows and columns are commonly used when the num-
ber of missing data is small compared to the total number of rows or
columns. However, be aware that removing rows “on demand” can ar-
tificially change the data distribution, especially when the missing data
3B. Schölkopf et al. (2001). “Estimating the support of a high-dimensional distribu-
tion”. In: Neural computation 13.7, pp. 1443–1471.
150 CHAPTER 7. DATA PREPROCESSING

is not missing at random. Row removal suffers from the same problem
as any filtering operations (degeneration) in the preprocessing step; the
developer must specify a default behavior for the model when a row is
discarded in production. See table 7.6.

Table 7.6: Task of filtering rows based on missing data.

Row removal based on missing data


Goal Remove the observations with missing
data in any (or some) variables.
Fitting None. Variables to look for missing data
are declared beforehand.
Adjustment Training set is adjusted sample by sam-
ple, independently, removing degener-
ated samples.
Applying Preprocessor degenerates over the rows
with missing data in the specified vari-
ables.

In the case of column removal, the preprocessor just learns to drop


the columns that have missing data during fitting. Beware that valuable
information might be lost when removing columns for all the samples.

Table 7.7: Task of dropping columns based on missing data.

Column removal based on missing data


Goal Remove the variables with missing data.
Fitting All variables with missing data in the
training set are marked to be removed.
Adjustment Columns marked are dropped from the
training set.
Applying Preprocessor drops the chosen columns
in fitting.

Imputing the missing data is usually done by replacing the missing


values with some statistic of the available values in the column, such as
7.2. DATA CLEANING 151

the mean, the median, or the mode4. This is a simple and effective strat-
egy, but it can introduce bias in the data, especially when the number
of samples with missing data is large. See table 7.8.

Table 7.8: Task of imputing missing data.

Imputation of missing data


Goal Replace the missing data with a statistic
of the available values.
Fitting The statistic is calculated from the avail-
able data in the training set.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor replaces the missing values
with the chosen statistic. If an indica-
tor variable is required, it is created and
filled with the logical value: missing or
not missing.

Just imputing data is not suitable when one is not sure whether the
missing data is missing because of a systematic error or phenomenon. A
model can learn the effect of the underlying reason for missingness for
the predictive task. In that case, creating an indicator variable is a good
strategy. This is done by creating a new column that contains a logical
value indicating whether the data is missing or not5.
Many times the indicator variable is already present in the data. For
instance, in a dataset that contains information about pregnancy, let us
say the number of days since the last pregnancy. This information will
certainly be missing if sex is male or the number of children is zero. In
this case, no new indicator variable is needed. See table 7.7.

4More sophisticated methods can be used, such as the k-nearest neighbors algorithm,
for example, consult O. Troyanskaya et al. (June 2001). “Missing value estimation methods
for DNA microarrays”. In: Bioinformatics 17.6, pp. 520–525. issn: 1367-4803. doi: 10.
1093/bioinformatics/17.6.520.
5Some kind of imputation is still needed, but we expect the model to deal better with
it since it can decide using both the indicator and the original variable.
152 CHAPTER 7. DATA PREPROCESSING

7.3 Data sampling


Once data is cleaned, the next step is (typically) to sample the data. Sam-
pling is the process of selecting a random subset of the data or creating
variations of the original training set.
There are three main tasks that sample the data: subsampling, scope
filtering, and class balancing.

7.3.1 Random sampling


Some machine learning methods are computationally expensive, and a
smaller dataset might be enough to solve the problem. Random sam-
pling is simply done by selecting a random subset of the training data
with a user-defined size.
However, note that the preprocessor for this task must never do any-
thing with the new data (or the test set we discuss in chapter 8). See
table 7.9.

Table 7.9: Task of random sampling.

Random sampling
Goal Select a random subset of the training
data.
Fitting None. User must declare the size of the
sample.
Adjustment Rows of the training set are randomly
chosen.
Applying Pass-through: preprocessor does nothing
with the new data.

7.3.2 Scope filtering


Scope filtering is the process of reducing the scope of the phenomenon
we want to model. Like the filtering operation in the data handling pipe-
line (consult section 5.3.5), the data scientists choose a set of predefined
rules to filter the data.
Unlike outlier detection, we assume that the rule is fixed and known
beforehand. The preprocessor degenerates over the samples that do not
satisfy the rule. A summary of the task is presented in table 7.10.
7.3. DATA SAMPLING 153

Table 7.10: Task of filtering the scope of the data.

Scope filtering
Goal Remove the observations that do not sat-
isfy a predefined rule.
Fitting None. User must declare the rule.
Adjustment Training set is adjusted sample by sam-
ple, independently, removing degener-
ated samples.
Applying Preprocessor degenerates over the sam-
ples that do not satisfy the rule.

An interesting variation is the model trees6. They are shallow deci-


sion trees that are used to filter the data. At each leaf, a different model
is trained with the data that satisfies the rules that reach the leaf. This
is a good strategy when the phenomenon is complex and can be divided
into simpler subproblems. In this case, the preprocessor does not de-
generate over the samples, but rather the preprocessing chain branches
into different models (and potentially other preprocessing steps).

7.3.3 Class balancing


Some data classification methods are heavily affected by the number of
observations in each class. This is especially true for methods that learn
the class priors directly from the data, like the naïve Bayes classifier.
Two strategies are often used to balance the classes: oversampling
and undersampling. The former is done by creating synthetic observa-
tions of the minority class. The latter is done by removing observations
of the majority class.
Undersampling can be done by removing observations of the ma-
jority class randomly (similarly to random sampling, section 7.3.1). On
the other hand, oversampling can be done by creating synthetic observa-
tions of the minority class. The most common method is resampling7,
which selects a random subset of the data with replacement. A draw-

6F. Stulp and O. Sigaud (2015). “Many regression algorithms, one unified model: A
review”. In: Neural Networks 69, pp. 60–79. issn: 0893-6080. doi: https://doi.org/10.
1016/j.neunet.2015.05.005.
7Sometimes called bootstrapping.
154 CHAPTER 7. DATA PREPROCESSING

back of this method is that it produces repeated observations that con-


tain no new information.
In any case, the preprocessor for this task must never do anything
with the new data (or the test set we discuss in chapter 8). See table 7.11.

Table 7.11: Task of class balancing.

Class balancing
Goal Balance the number of observations in
each class.
Fitting None. User must declare the number of
samples in each class.
Adjustment Rows of the training set are randomly
chosen.
Applying Pass-through: preprocessor does nothing
with the new data.

More advanced sampling methods exist. For instance, the SMOTE


algorithm8 creates synthetic observations of the minority class without
repeating the same observations.

7.4 Data transformation


Another important task in data handling is data transformation. This is
the process of adjusting the types of the data and the choice of variables
to make it suitable for modeling.
At this point, the data format is acceptable, i.e., each observation is
in the correct observational unit, there are no missing values, and the
sampling is representative of the phenomenon of interest. Now, we can
perform a series of operations to make the column’s types and values
suitable for modeling. The reason for this is that most machine learn-
ing methods require the input variables to follow some restrictions. For
instance, some methods require the input variables to be real numbers,
others require the input variables to be in a specific range, etc.

8N. V. Chawla et al. (2002). “SMOTE: synthetic minority over-sampling technique”.


In: Journal of artificial intelligence research 16, pp. 321–357.
7.4. DATA TRANSFORMATION 155

7.4.1 Type conversion

Type conversion is the process of changing the type of the values in the
columns. We do so to make the input variables compatible with the
machine learning methods we will use.
The most common type conversion is the conversion from categori-
cal to numerical values. Ideally, the possible values of a categorical vari-
able are known beforehand. For instance, given the values 𝑥 ∈ {𝑎, 𝑏, 𝑐}
in a column, there are two main ways to convert them to numerical val-
ues: label encoding and one-hot encoding. If there is a natural order
𝑎 < 𝑏 < 𝑐, label encoding is usually sufficient. Otherwise, one-hot
encoding can be used.
Label encoding is the process of replacing the values 𝑥 ∈ {𝑎, 𝑏, 𝑐}
with the values 𝑥′ ∈ {1, 2, 3}, where 𝑥′ = 1 if 𝑥 = 𝑎, 𝑥′ = 2 if 𝑥 = 𝑏, and
𝑥′ = 3 if 𝑥 = 𝑐. Other numerical values can be assigned depending on
the specific problem.
One-hot encoding is the process of creating a new column for each
possible value of the categorical variable. The new column is filled with
the logical value 1 if the value is present and 0 otherwise.
However, in the second case, the number of categories might be
too large or might not be known beforehand. So, the preprocessing
step must identify the unique values in the column and create the new
columns accordingly. It is common to group the less frequent values
into a single column, called the other column. See table 7.12.
The other direction is also common: converting numerical values to
categorical values. This is usually done by binning the numerical vari-
able, either by frequency or by range. In both cases, the user declares the
number of bins. Binning by frequency is done by finding the percentiles
of the values and creating the bins accordingly. Binning by range is done
by dividing the range of the values into equal parts, given the minimum
and maximum values. See table 7.13.
Another common task, although it receives less attention, is the con-
version of dates (or other interval variables) to numerical values. Inter-
val variables, like dates, have almost no information in their absolute
values. However, the difference between two dates can be very informa-
tive. For example, the difference between the date of birth and the date
of the last purchase becomes the age of the customer.
156 CHAPTER 7. DATA PREPROCESSING

Table 7.12: One-hot encoding preprocessing task.

One-hot encoding
Goal Create a new column for each possible
value of the categorical variable.
Fitting Store the unique values of the categorical
variable. If appropriate, indicate the spe-
cial category other.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor creates a new column for
each possible value of the categorical
variable. The new column is filled
with the logical value 1 if the old value
matches the new column and 0 other-
wise. If the value is new or among the
less frequent values, it is assigned to the
other column.

Table 7.13: Binning numerical values preprocessing task.

Binning numerical values


Goal Create a new categorical column from a
numerical one.
Fitting Store the range of each bin.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor converts each numerical
value to a categorical value by checking
which bin the value falls into.
7.4. DATA TRANSFORMATION 157

7.4.2 Normalization
Normalization is the process of scaling the values in the columns. This
is usually done to keep data within a specific range or to make different
variables comparable. For instance, some machine learning methods
require the input variables to be in the range [0, 1].
The most common normalization methods are standardization and
rescaling. The former is done by subtracting the mean and dividing by
the standard deviation of the values in the column. The latter is per-
formed so that the values are in a specific range, usually [0, 1] or [−1, 1].
Standardization works well when the values in the column are nor-
mally distributed. It not only keeps the values in an expected range but
also makes the data distribution comparable with other variables. Given
that 𝜇 is the mean and 𝜎 is the standard deviation of the values in the
column, the standardization is done by
𝑥−𝜇
𝑥′ = . (7.1)
𝜎
See table 7.14.

Table 7.14: Standardization preprocessing task.

Standardization
Goal Scale the values in a column.
Fitting Store the statistics of the variable: the
mean and the standard deviation.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor scales the values according
to eq. (7.1).

In the case of rescaling, during production, the preprocessor usually


clamps9 the values after rescaling. This is done to avoid the model mak-
ing predictions that are out of the range of the training data. Given that
we want to rescale the values in the column to the range [𝑎, 𝑏], and that
𝑥min and 𝑥max are the minimum and maximum values in the column,

9The operation clamp(𝑥; 𝑎, 𝑏) where 𝑎 and 𝑏 are the lower and upper bounds, respec-
tively, is defined as max(𝑎, min(𝑏, 𝑥)).
158 CHAPTER 7. DATA PREPROCESSING

the rescaling is done by


𝑥 − 𝑥min
𝑥′ = 𝑎 + (𝑏 − 𝑎) . (7.2)
𝑥max − 𝑥min

See table 7.15.

Table 7.15: Rescaling preprocessing task.

Rescaling
Goal Rescale the values in a column.
Fitting Store the appropriate statistics of the vari-
able: the minimum and the maximum
values.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor scales the values according
to eq. (7.2).

Related to normalization is the log transformation, which applies


the logarithm to the values in the column. This is usually done to make
the data distribution more symmetric over the mean or to reduce the
effect of outliers.

7.4.3 Dimensionality reduction


Dimensionality reduction is the process of reducing the number of vari-
ables in the data. It can identify irrelevant variables and reduce the com-
plexity of the model (since there are fewer variables to deal with).
There are two main types of dimensionality reduction algorithms:
feature selection and feature extraction. The former selects a subset of
the existing variables that leads to the best models. The latter creates
new variables that are combinations of the original ones.
One example of feature selection is ranking the variables by their
mutual information with the target variable and selecting the top 𝑘 vari-
ables. Mutual information is a measure of the amount of information
that one variable gives about another. So, it is expected that variables
with high mutual information with the target variable are more impor-
tant for the model.
7.4. DATA TRANSFORMATION 159

Feature extraction uses either linear methods, such as principal com-


ponent analysis (PCA), or non-linear methods, such as autoencoders.
These methods are able to compress the information in the training data
into a smaller number of variables. Thus, the model can learn the solu-
tion in a lower-dimensional space. A drawback of this method is that
the new variables are hard to interpret, since they are combinations of
the original variables.

7.4.4 Data enhancement


The “opposite” of dimensionality reduction is data enhancement. This
is the process of bringing to the dataset external information that com-
plements the existing data. For example, imagine that in the tidy data
we have a column with the zip code of the customers. We can use this
information to join (in this case, always a left join) a dataset with social
and economic information about the region of the zip code.
The preprocessor, then, stores the external dataset and the column
to join the data. During production, it enhances any new observation
with the external information. See table 7.16.

Table 7.16: Data enhancement preprocessing task.

Data enhancement
Goal Enhance the dataset with external infor-
mation.
Fitting Store the external dataset and the column
to join.
Adjustment Training set is left joined with the exter-
nal dataset. Because of the properties of
the left join, the new dataset has the same
number of rows as the original dataset,
and it is equivalent to enhancing each
row independently.
Applying Preprocessor enhances the new data with
the external information.
160 CHAPTER 7. DATA PREPROCESSING

7.4.5 Comments on unstructured data


Any unstructured data can be transformed into structured data. We can
see this task as a data preprocessing task. Techniques like bag of words,
word embeddings, and signal (or image) processing can be seen as pre-
processing techniques that transform unstructured data into structured
data, which is suitable for modeling.
Also, modern machine learning methods, like convolutional neural
networks (CNNs), are simply models that learn both the preprocessing
and the model at the same time. This is done by using convolutional
layers that learn the features of the data. In digital signal processing, this
is called feature extraction. The difference there is that the convolution
filters are handcrafted, while in CNNs they are learned from the data.
The study of unstructured data is a vast field and is out of the scope
of this book. I recommend Jurafsky and Martin (2008)10 for a complete
introduction to Natural Language Processing and Szeliski (2022)11 for a
comprehensive introduction to Computer Vision.

10D. Jurafsky and J. H. Martin (2008). Speech and Language Processing. An Introduc-
tion to Natural Language Processing, Computational Linguistics, and Speech Recognition.
2nd ed. Hoboken, NJ, USA: Prentice Hall. A new edition is is under preparation and it is
available for free: D. Jurafsky and J. H. Martin (2024). Speech and Language Processing:
An Introduction to Natural Language Processing, Computational Linguistics, and Speech
Recognition with Language Models. 3rd ed. Online manuscript released August 20, 2024.
url: https://web.stanford.edu/~jurafsky/slp3/.
11R. Szeliski (2022). Computer vision. Algorithms and applications. 2nd ed. Springer
Nature. url: https://szeliski.org/Book/.
Solution validation
8
All models are wrong, but some are useful.
— George E. P. Box, Robustness in Statistics

Once we have defined what an inductive problem is and the means to


solve it, we need to think about how to validate the solution.
In this chapter, we present the experimental planning that one can
use in the data-driven parts of a data science project. Experimental plan-
ning in the context of data science involves designing and organizing ex-
periments to gather performance data systematically in order to reach
specific goals or test hypotheses.
The reason we need to plan experiments is that data science is ex-
perimental, i.e., we usually lack a theoretical model that can predict the
outcome of a given algorithm on a given dataset. This is why we need to
run experiments to gather performance data and make inferences from
it. The stochastic nature of data and of the learning process makes it
more difficult to predict the outcome of a given algorithm on a given
dataset. Robust experimental planning is essential to ensure that the re-
sults of the experiments are reliable and can be used to make decisions.
Moreover, we need to understand the main metrics that are used to
evaluate the performance of a solution — i.e., the pair preprocessor and
model. Each learning task has different metrics, and the goals of the
project will determine which metrics are more important.
There is not a single way to plan experiments, but there are some
common steps that can be followed to design a good experimental plan.
In this chapter, we present a framework for experimental planning that
can be used in most data science projects for inductive problems.

161
162 CHAPTER 8. SOLUTION VALIDATION

Chapter remarks

Contents
8.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1.1 Binary classification evaluation . . . . . . . . . . 164
8.1.2 Regression estimation evaluation . . . . . . . . . 169
8.1.3 Probabilistic classification evaluation . . . . . . . 172
8.2 An experimental plan for data science . . . . . . . . . . . 175
8.2.1 Sampling strategy . . . . . . . . . . . . . . . . . . 176
8.2.2 Collecting evidence . . . . . . . . . . . . . . . . . 177
8.2.3 Estimating expected performance . . . . . . . . . 180
8.2.4 Comparing strategies . . . . . . . . . . . . . . . . 182
8.2.5 About nesting experiments . . . . . . . . . . . . . 183
8.3 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . 184

Context

• Before putting a solution into production, we need to validate it.


• The validation process is experimental.

Objectives

• Understand the importance of experimental planning.


• Learn the main evaluation metrics used in predictive tasks.
• Learn how to design an experimental plan to validate a solution.

Takeaways

• Evaluation metrics should be chosen according to the goals of the


project.
• The experimental plan should be designed to gather performance
data systematically.
• A hypothesis test can be used to validate the results of the experi-
ments.
8.1. EVALUATION 163

8.1 Evaluation
One fundamental step in the validation of a data-driven solution for a
task is the evaluation of the pair preprocessor and model. This chapter
presents strategies to measure performance of classifiers and regressors,
and how to interpret the results.
We consider the following setup. Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table that
represents the data in the desired observational unit — as defined in
section 5.1. Without loss of generality — as the keys are not used in the
modeling process —, we can consider 𝐾 = {1, 2, … } such that card(𝑖) =
1, if 𝑖 ∈ {1, … , 𝑛}, and card(𝑖) = 0, otherwise. That means that every
row 𝑟 ∈ {1, … , 𝑛} is present in the table.
The table is split into two sets: a training set, given by indices (or
keys) ℐtraining ∈ {1, … , 𝑛}, and a test set, given by indices ℐtest ∈ {1, … , 𝑛},
such that
ℐtraining ∩ ℐtest = ∅
and
ℐtraining ∪ ℐtest = {1, … , 𝑛}.
The bridge between the table format (definition 5.1) and the data
format used in the learning process (as described in section 6.2) is ex-
plained in the following. We say that the pair (x𝑖 , 𝑦 𝑖 ) contains the fea-
ture vector x𝑖 and the target value 𝑦 𝑖 of the sample with key 𝑖 in table 𝑇.
Mathematically, given target variable ℎ ∈ 𝐻, we have that 𝑦 𝑖 = 𝑐(𝑖, ℎ)
and x𝑖 is the tuple
(𝑐(𝑖, ℎ′ ) ∶ ℎ′ ∈ 𝐻 ∖ {ℎ} ).
For evaluation, we consider a data preprocessing technique 𝐹 and a
learning machine 𝑀. The following steps are taken.

Preprocessing Preprocessing technique 𝐹 is applied to the training


set 𝑇training = (𝐾, 𝐻, 𝑐training ) where

𝑐(𝑖, ℎ) if 𝑖 ∈ ℐtraining ,
𝑐training (𝑖, ℎ) = {
() otherwise.

The result is an adjusted training set 𝑇training and a fitted preprocessor
𝑓(x; 𝜙) ≡ 𝑓𝜙 (x), where x ∈ 𝒳 for some space 𝒳 that does not include
(or does not modify) the target variable — consult section 7.1.1. Note
that, by definition, the size of the adjusted training set can be different
from the original due to sampling or filtering. The hard requirement is
that the target variable ℎ is not changed.
164 CHAPTER 8. SOLUTION VALIDATION

Learning The learning machine 𝑀 is trained on the adjusted training



set 𝐷training = {(x′𝑖 , 𝑦′𝑖 )}, where pairs (x′𝑖 , 𝑦′𝑖 ) come from the table 𝑇training

.
′ ′
The result is a model 𝑓(x ; 𝜃) ≡ 𝑓𝜃 (x ) — consult chapter 6.

Transformation The preprocessor 𝑓𝜙 is applied to the test set 𝑇test =


(𝐾, 𝐻, 𝑐test ) where

𝑐(𝑖, ℎ) if 𝑖 ∈ ℐtest ,
𝑐test (𝑖, ℎ) = {
() otherwise.

The result is a preprocessed test set 𝑇test from which we can obtain the
set 𝐷test = {(x𝑖 , 𝑦 𝑖 ) ∶ 𝑖 ∈ ℐtest } such that x′𝑖 = 𝑓𝜙 (x𝑖 ). Note that, to avoid
′ ′

data leakage and other issues, the preprocessor has no access to the tar-
get values 𝑦 𝑖 (even if the adjusted training set uses the label somehow).

Prediction The model 𝑓𝜃 is used to make predictions on the prepro-



cessed test set 𝐷test to obtain predicted values 𝑦 ̂𝑖 = 𝑓𝜃 (x′𝑖 ) for all 𝑖 ∈ ℐtest .

Evaluation By comparing 𝑦 ̂𝑖 with 𝑦 𝑖 for all 𝑖 ∈ ℐtest , we evaluate how


well the choice of 𝜙 (parameters of the preprocessor) and 𝜃 (parameters
of the model) is.

8.1.1 Binary classification evaluation


In order to assess the quality of a solution for a binary classification task,
we need to know which samples in the test set were classified into which
classes. This information is summarized in the confusion matrix, which
is the basis for performance metrics in classification tasks.

Confusion matrix
The confusion matrix is a table where the rows represent the true classes
and the columns represent the predicted classes. The diagonal of the
matrix represents the correct classifications, while the off-diagonal ele-
ments represent errors. For binary classification, the confusion matrix
is given by
Predicted
1 0
Expected 1 TP FN
( )
0 FP TN
8.1. EVALUATION 165

where TP is the number of true positives

|{𝑖 ∈ ℐtest ∶ 𝑦 𝑖 = 1 ∧ 𝑦 ̂𝑖 = 1}|,

TN is the number of true negatives

|{𝑖 ∈ ℐtest ∶ 𝑦 𝑖 = 0 ∧ 𝑦 ̂𝑖 = 0}|,

FN is the number of false negatives

|{𝑖 ∈ ℐtest ∶ 𝑦 𝑖 = 1 ∧ 𝑦 ̂𝑖 = 0}|,

and FP is the number of false positives

|{𝑖 ∈ ℐtest ∶ 𝑦 𝑖 = 0 ∧ 𝑦 ̂𝑖 = 1}|.

Performance metrics
From the confusion matrix, we can derive several performance metrics.
Each of them focuses on different aspects of the classification task, and
the choice of the metric depends on the problem at hand. Each metric
prioritizes different types of errors and yields a value between 0 and 1,
where 1 is the best possible value.

Accuracy is the proportion of correct predictions over the total num-


ber of samples in the test set, given by
TP + TN
Accuracy = .
TP + TN + FP + FN
This metric is simple and easy to interpret: a classifier with an accuracy
of 1 is perfect, while a classifier with an accuracy of 0.5 misses half of
the predictions. Accuracy assigns the same weight to any kind of error
— i.e., false positives and false negatives. As a result, if the proportion
of positive and negative samples is imbalanced, the value of accuracy
may become misleading. Let 𝜋 be the ratio of positive samples in the
test set — consequently, 1 − 𝜋 is the ratio of negative samples —, then
a classifier that correctly predicts all positive samples and none of the
negative samples will have an accuracy of 𝜋. If 𝜋 is close to 1, the classi-
fier will have a high value of accuracy even if it is not good at predicting
the negative class.
This issue is not impeditive for the usage of accuracy in imbalanced
datasets, but one needs to be aware that accuracy values lower than
max(𝜋, 1 − 𝜋) are not better than guessing.
166 CHAPTER 8. SOLUTION VALIDATION

Balanced accuracy aims to solve this interpretation issue of the ac-


curacy. It is the average of the true positive rate (TPR) and the true
negative rate (TNR), given by
TPR + TNR
Balanced Accuracy = ,
2
where
TP
TPR = ,
TP + FN
and
TN
TNR = .
TN + FP
Each term penalizes a different type of error independently: TPR penal-
izes false negatives, while TNR penalizes false positives. Balanced ac-
curacy is useful when the cost of errors on the minority class is higher
than the cost of errors on the majority class. This way, any value greater
than 0.5 is better than random guessing.
A limitation of the balanced accuracy is that it “automatically” as-
signs the weight of errors based on the class proportion, which may not
be the best choice for the problem. Other metrics focus only on one of
the classes and are more flexible to adjust the weight of errors.

Precision is an asymmetrical metric that focuses on the positive class.


It is the proportion of true positive predictions over the total number of
samples predicted as positive, given by
TP
Precision = .
TP + FP
This metric is useful when the cost of false alarms is high, as it quanti-
fies the ability of the classifier to avoid false positives. For example, in
a medical diagnosis task, precision is important to avoid unnecessary
treatments (false positive diagnoses). Semantically, precision measures
how confident we can be that a positive prediction is actually positive.
Note that it measures nothing about the ability of the classifier in terms
of the negative predictions.

Recall is another asymmetrical metric that also focuses on the posi-


tive class. It is the proportion of true positive predictions over the total
number of samples that are actually positive, given by
TP
Recall = TPR = .
TP + FN
8.1. EVALUATION 167

This metric is useful when the cost of missing a positive sample is high,
as it quantifies the ability of the classifier to avoid false negatives. It can
also be interpreted as the “completeness” of the classifier: how many
positive samples were correctly retrieved. For example, in a medical
diagnosis task, recall is important to avoid missing a diagnosis.

F-score is a way of balancing both kinds of errors, false positives and


false negatives, while maintaining the focus on the positive class. It is
the weighted harmonic mean of precision and recall given by

(1 + 𝛽 2 ) ⋅ Precision ⋅ Recall
F-score(𝛽) = F𝛽 -score = ,
𝛽2 ⋅ Precision + Recall

where 𝛽 > 0 is a parameter that controls the weight of precision in the


metric. The most common value for 𝛽 is 1, which gives the F1 -score.
Higher values of 𝛽 give more weight to precision (𝛽 > 1), while lower
values give more weight to recall (0 < 𝛽 < 1).

Specificity goes in the opposite direction of recall, focusing on the


negative class. It is the proportion of true negative predictions over the
total number of samples that are actually negative, given by

TN
Specificity = TNR = .
TN + FP
This metric is very common in the medical literature, but less common
in other contexts. The probable reason is that it is easier to interpret the
metrics that focus on the positive class, as the negative class is usually
the majority class — and, thus, less interesting.

Interpretation of metrics
Table 8.1 summarizes the properties of the classification performance
metrics. Accuracy and balanced accuracy are good metrics when no
particular class is more important than the other. Remember, however,
that balanced accuracy gives more weight to errors on the minority class.
Precision and recall are useful to evaluate the performance of the solu-
tion in terms of the positive class. They are complementary metrics, and
looking at only one of them may give a biased view of the performance
— more on that below. The F-score is a way to balance precision and
recall with a controllable parameter.
168 CHAPTER 8. SOLUTION VALIDATION

Table 8.1: Summary of the properties of data classification perfor-


mance metrics.

Metric Focus Interpretation


Accuracy Symmetrical Penalizes all
Balanced Accuracy Symmetrical Penalizes all (weighted)
Recall (TPR) Positive Penalizes FN
Precision Positive Penalizes FP
F-score Positive Penalizes all (weighted)
Specificity (TNR) Negative Penalizes FP

A common misconception about the asymmetrical metrics (espe-


cially precision) is that they are always robust to class imbalance. Ob-
serve table 8.2, which shows the behavior of the classification perfor-
mance metrics for three (useless) classifiers: one that always predicts
the positive class (Guess 1), another that always predicts the negative
class (Guess 0), and a classifier that randomly guesses the class inde-
pendently of the class priors (Random).

Table 8.2: Behavior of classification performance metrics for dif-


ferent classifiers.

Metric Guess 1 Guess 0 Random


Accuracy† 𝜋 1−𝜋 0.5
Balanced Accuracy 0.5 0.5 0.5
Recall (TPR) 1 0 0.5
Precision† 𝜋 0/0 = 0 𝜋
2𝜋 2𝜋
F1 -score† 0
1+𝜋 1+2𝜋
Specificity (TNR) 0 1 0.5

Performance of different classifiers in the example of a dataset


with ratio 𝜋 of positive and 1 − 𝜋 of negative samples. Metrics
affected by class imbalance are marked with † .

We can see that, as 𝜋 → 1, i.e. the positive class dominates the


dataset, guessing the positive class achieves maximum values for met-
8.1. EVALUATION 169

rics like accuracy, precision, and F1 -score. Even for random guessing
the class, precision (and F1 -score) is affected by the class imbalance,
yielding 1 (and 2/3) as 𝜋 → 1. As a result, these metrics should be
preferred when the positive class is the minority class, so the results are
not erroneously inflated — and, consequently, mistakenly interpreted
as good. C. K. I. Williams (2021)1 provides an interesting discussion on
that.
Finally, besides accuracy, the other metrics do not behave well when
the evaluation set is too small. In this case, the metrics may be too sen-
sitive to the particular samples in the test set or may not be able to be
calculated at all.

8.1.2 Regression estimation evaluation


Performance metrics for regression tasks are usually calculated based
on the error (also called residual)

𝜖 𝑖 = 𝑦 ̂𝑖 − 𝑦 𝑖

for all 𝑖 ∈ ℐtest or a scaled version


(𝑓)
𝜖𝑖 = 𝑓(𝑦 ̂𝑖 ) − 𝑓(𝑦 𝑖 ),

for some scaling function 𝑓.

Performance metrics
From the errors, we can calculate several performance metrics that give
us useful information about the behavior of the model. Specifically, we
are interested in understanding what kind of errors the model is making
and how large they are. Unlike classification, the higher the value of the
metric, the worse the model is.

Mean absolute error is probably the simplest performance metric


for regression estimation tasks. It is the average of the absolute values
of the errors, given by
𝑛
1
MAE = ∑ |𝜖𝑖 |.
𝑛 𝑖=1

1C. K. I. Williams (Apr. 2021). “The Effect of Class Imbalance on Precision-Recall


Curves”. In: Neural Computation 33.4, pp. 853–857. issn: 0899-7667. doi: 10.1162/neco_
a_01362.
170 CHAPTER 8. SOLUTION VALIDATION

This metric is easy to interpret, is in the same unit as the target vari-
able, and gives an idea of the average error of the model. It ignores the
direction of the errors, so it is not useful to understand if the model is
systematically overestimating or underestimating the target variable.

Mean squared error is the average of the squared residuals, given by

𝑛
1
MSE = ∑ 𝜖2 .
𝑛 𝑖=1 𝑖

This metric penalizes large errors more than the mean absolute error, as
the squared residuals are summed.

Root mean squared error is the square root of the mean squared
error, given by
RMSE = √MSE.

This metric is in the same unit as the target variable, which makes it
easier to interpret. It keeps the same properties as the mean squared
error, such as penalizing large errors more than the mean absolute error.
Both MAE and RMSE (or MSE) work well for positive and negative
values of the target variable. However, they might be misleading when
the range of the target variable is large.

Mean absolute percentage error is an alternative when the target


variable (and the predictions) assume only strictly positive values, i.e.,
𝑦 𝑖 > 0 and 𝑦 ̂𝑖 > 0. It is the average of the relative errors, given by

𝑛
1 |𝜖 |
MAPE = ∑ 𝑖 .
𝑛 𝑖=1 𝑦 𝑖

This metric is useful when the range of the target variable is large, as it
gives an idea of the relative error of the model, not the absolute error.

Mean absolute logarithmic error is an alternative for the MAPE


under the same premises of the target values. It aims to reduce the in-
fluence of outliers in the error calculation, especially when the target
variable prior follows a long-tail distribution — many small values and
8.1. EVALUATION 171

few large values. Distributions like that are common in practice, e.g., in
sales, income, and population data. It is given by

𝑛 𝑛
1 (ln) 1
MALE = ∑ |𝜖 | = ∑ | ln 𝑦 ̂𝑖 − ln 𝑦 𝑖 |.
𝑛 𝑖=1 𝑖 𝑛 𝑖=1

Interpretation of metrics
Note that, unlike the classification performance metrics, the scale of the
regression performance metrics is not bounded between 0 and 1. This
makes it potentially harder to interpret the results, as the values depend
on the scale of the target variable.
Absolute error metrics, like MAE and RMSE, are useful for under-
standing the central tendency of the magnitude of the errors. They are
easy to interpret because they are in the same unit as the target variable.
However, they tend to be less informative when the target variable has
a large range or when the errors are not normally distributed.
In those situations, relative error metrics, like MAPE and MALE, are
more useful. For instance, imagine we are predicting house prices. The
error of $20,000 for a house that costs $100,000 is more significant than
the same error for a house that costs $1,000,000. The absolute error is
the same in both cases, but the relative error is different.
In that example, the MAPE would be 20% for the first house and 2%
for the second house. Note, however, that MAPE punishes overestimat-
ing more than underestimating in multiplicative terms. Consider the
example in table 8.3. In the first row, the prediction is ten times larger
than the actual value, which results in a MAPE of 900%. In the second
row, the prediction is one tenth of the actual value, which results in a
MAPE of 90%.

Table 8.3: Comparison of relative error metrics.

𝑦̂ 𝑦 𝜖 MAPE exp(MALE)
100 10 90 9.0 10
1 10 9 0.9 10

MAPE and MALE for two predictions. The MAPE punishes over-
estimating more than underestimating.
172 CHAPTER 8. SOLUTION VALIDATION

If multiplicative factors of the error are important, one should con-


sider using MALE. Observe that ln(𝑦)̂ − ln(𝑦) = ln(𝑦/𝑦), ̂ which is the
logarithm of the ratio of the prediction to the actual value. In the case
of the absolute value, we have another interesting property:

𝑦̂ 𝑦 𝑦̂ 𝑦
| ln 𝑦 ̂ − ln 𝑦| = | ln | = | ln | = ln max ( , ) .
𝑦 𝑦̂ 𝑦 𝑦̂

Tofallis (2015)2 discuss some of these advantages. To interpret MALE,


we can use the exponential function, which gives us a multiplicative
factor of the error. In the example in table 8.3, we have that

100 10 100 10
exp ln max ( , ) = max ( , ) = 10.
10 100 10 100

Finally, for the experimental plan we propose in this book, we should


avoid metrics like coefficient of determination, 𝑅2 , as we do not make
assumptions about the model — in this case, we do not assume that the
model is linear. Similarly to data classification, we should prefer metrics
that work well with small test sets.

8.1.3 Probabilistic classification evaluation


A particular case of the regression estimation is when we want to esti-
mate the probability3 of a sample belonging to the positive class — i.e.
𝑦 = 1. In this case, the output of the model should be a value in the
interval [0, 1]. We can use a threshold 𝜏 to convert the probabilities into
binary predictions. The default threshold is usually 𝜏 = 0.5 — a sam-
ple is positive if the probability is greater than or equal to 0.5, and it is
negative, otherwise.
However, the threshold can be adjusted to change the trade-off be-
tween recall and specificity. A low threshold, 𝜏 ≈ 0, will increase recall
at the expense of specificity, while a high threshold, 𝜏 ≈ 1, will increase
specificity at the expense of recall.
Thus, any regressor 𝑓𝑅 ∶ 𝒳 → [0, 1] can be converted into a binary
classifier 𝑓𝐶 ∶ 𝒳 → {0, 1} by comparing the output with the threshold

2C. Tofallis (2015). “A better measure of relative prediction accuracy for model selec-
tion and model estimation”. In: Journal of the Operational Research Society 66.8, pp. 1352–
1362. doi: 10.1057/jors.2014.103.
3Although the term probability is used, the output of the regressor does not need to
be a probability in the strict sense. It is a confidence level in the interval [0, 1] that can be
interpreted as a probability.
8.1. EVALUATION 173

𝜏:
1 if 𝑓𝑅 (x) ≥ 𝜏,
𝑓𝐶 (x; 𝜏) = {
0 otherwise.
Since the task is still a classification task, one should not use regres-
sion performance metrics. On the other hand, instead of choosing a
particular threshold and measuring the resulting classifier performance,
we can summarize the performance of all possible variations of the clas-
sifiers using appropriate metrics.
Before diving into the metrics, consider the following error metric.
Let false positive rate (FPR) be the proportion of false positive predic-
tions over the total number of samples that are actually negative,

FP
FPR = .
FP + TN
It is the complement of the specificity, i.e. FPR = 1 − Specificity.
Consider the example in table 8.4 of a given test set and the predic-
tions of a regressor. We can see that a threshold of 0.5 would yield a
classifier that errors in 3 out of 9 samples. We can adjust the threshold
to understand the behavior of the other possible classifiers.

Table 8.4: Illustrative example of probability regressor output.

Expected Predicted
0 0.1
0 0.5
0 0.2
0 0.6
1 0.4
1 0.9
1 0.7
1 0.8
1 0.9

We first sort the samples by the predicted probabilities and then cal-
culate the TPR (recall) and FPR for each threshold. We need to con-
sider only thresholds equal to the predicted values to understand the
variations. In this case, TPR values become the cumulative sum of the
expected outputs divided by the total number of positive samples, and
174 CHAPTER 8. SOLUTION VALIDATION

FPR values become the cumulative sum of the complement of the ex-
pected outputs divided by the total number of negative samples.

Table 8.5: Illustrative example of classifiers derived from different


thresholds.

Expected Threshold TPR FPR


- -/∞ 0/5 0/4
1 0.9 1/5 0/4
1 0.9 2/5 0/4
1 0.8 3/5 0/4
1 0.7 4/5 0/4
0 0.6 4/5 1/4
0 0.5 4/5 2/4
1 0.4 5/5 2/4
0 0.2 5/5 3/4
0 0.1 5/5 4/4

Performance of different classifiers derived from the regressor out-


put in table 8.4. The thresholds are equal to the predicted values.

Note that, from the ordered list of predictions, we can easily see that
a threshold of 0.7 would yield a classifier that commits only one error. A
way to summarize the performance of all possible classifiers is presented
in the following.

Receiver operating characteristic


The receiver operating characteristic (ROC) curve is a graphical repre-
sentation of the trade-off between TPR and FPR as the threshold 𝜏 is
varied. The ROC curve is obtained by plotting the TPR against the FPR
for all possible thresholds. Figure 8.1 is the ROC curve for the example
in table 8.5.
The ROC curve is useful to explore the trade-off between recall and
specificity. The diagonal line represents a random classifier, and points
above the diagonal are better than random.
The area under the ROC curve (AUC) is an interesting metric of
the performance of the family of classifiers. It ranges between 0 and
1, where 1 is the best possible value. The AUC is scale invariant, which
8.2. AN EXPERIMENTAL PLAN FOR DATA SCIENCE 175

Figure 8.1: Illustrative example of ROC curve.

0.8

0.6
TPR

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
FPR

ROC curve for the example in table 8.5. The diagonal line repre-
sents a random classifier, and points above the diagonal are better
than random.

means that it measures how well predictions are ranked, rather than
their absolute values. It is also robust to class imbalance, once both re-
call and specificity are considered. In our example, the AUC is 0.9.

8.2 An experimental plan for data science


Like any other experimental science, data science requires a robust ex-
perimental plan to ensure that evaluation results are reliable and can be
used to make decisions. Failure to use the resources we have at hand
— i.e., the limited amount of data — can lead to incorrect conclusions
about the performance of a solution.
There are important elements that should be considered when de-
signing an experimental plan. These elements are:

• Hypothesis: The main question that the experiment aims to val-


idate. In this chapter, we address common questions in data sci-
ence projects and how to validate them.
176 CHAPTER 8. SOLUTION VALIDATION

• Data: The dataset that will be used in the experiment. In chap-


ters 2 and 4, we address topics about collecting and organizing
data. In chapter 5, we address topics about preparing the data for
the experiments.

• Solution search algorithm: Techniques that find a solution for


the task. We use the term “search” because the chosen algorithm
aims at optimizing both the parameters of the preprocessing chain
and those of the model. The theoretical basis for these techniques
is in chapters 6 and 7.

• Performance measuring: The metric that will be used to eval-


uate the performance of the model. Refer to section 8.1 for the
main metrics used in binary classification and regression estima-
tion tasks.

A general example of a description of an experimental plan is “What


is the probability that the technique 𝐴 will find a model that reaches a
performance 𝑋 in terms of metric 𝑌 in the real-world given dataset 𝑍 as
training set (assuming 𝑍 is a representative dataset)?”
Another example is “Is technique 𝐴 better than technique 𝐵 for find-
ing a model that predicts the output with 𝐷 as a training set in terms of
metric 𝐸?”
In the next sections, we consider these two cases: estimating expected
performance and comparing algorithms. Before that, we discuss a strat-
egy to make the best use of the finite amount of data we have available.

8.2.1 Sampling strategy


When dealing with a data-driven solution, the available data is a repre-
sentation of the real world. So, we have to make the best use of the data
we have to estimate how well our solution is expected to be in produc-
tion.
As we have seen, the more data we use to search for a solution, the
better the solution is expected to be. Thus, we use the whole dataset for
deploying a solution. But, what method for preprocessing and learning
should we use? How well is that technique expected to perform in the
real world?
Let us say we fix a certain technique, let us call it 𝐴. Let 𝑀 be the
solution found by 𝐴 using the whole dataset 𝐷. If we assess 𝑀 using the
whole dataset 𝐷, the performance 𝑝 we get is optimistic. This is because
𝑀 has been trained and tested on the same data.
8.2. AN EXPERIMENTAL PLAN FOR DATA SCIENCE 177

One could argue that we could use a hold-out set to estimate the per-
formance of 𝑀 — i.e., splitting the dataset into a training set and a test
set once. However, this does not solve the problem. The performance
𝑝 we observe in the test set might be an overestimation or an underes-
timation of the performance of 𝑀 in production. This is because the
randomly chosen test set might be an “outlier” in the representation of
the real world, containing cases that are too easy or too hard to predict.
The correct way to estimate the performance of 𝑀 is to address per-
formance as a random variable, since both the data and the learning
process are stochastic. By doing so, we can study the distribution of the
performance, not particular values.
As with any statistical evaluation, we need to generate samples of
the performance of the possible solutions that 𝐴 is able to obtain. To
do so, we use a sampling strategy to generate datasets 𝐷1 , 𝐷2 , … from 𝐷.
Each dataset is further divided into a training set and a test set, which
must be disjoint. Each training set is thus used to find a solution —
𝑀1 , 𝑀2 , … for each training set — and the test set is used to evaluate the
performance — 𝑝1 , 𝑝2 , … for each test set — of the solution. The test
set emulates the real-world scenario, where the model is used to make
predictions on new data.
The most common sampling strategy is the cross-validation. It as-
sumes that data are independent and identically distributed (i.i.d.). This
sampling strategy divides the dataset into 𝑟 folds randomly, with the
same size. Each part (fold) is used as a test set once and as a training set
𝑟 − 1 times. So, first we use as training set folds 2, 3, … , 𝑟 and as test set
fold 1. Then, we use as training set folds 1, 3, … , 𝑟 and as test set fold 2.
And so on. See fig. 8.2.
If possible, one should use repeated cross-validation, where this pro-
cess is repeated many times, each having a different fold partitioning
chosen at random. Also, when dealing with classification problems,
we should use stratified cross-validation, where the distribution of the
classes is preserved in each fold.

8.2.2 Collecting evidence


Once we understand the sampling strategy, we can design the experi-
mental plan to collect evidence about the performance of the solution.
The plan involves the following steps.
The solution search algorithm 𝐴 involves both a given data prepro-
cessing chain and a machine learning method. Both of them generate a
different result for each dataset 𝐷𝑘 used as an input. In other words, the
178 CHAPTER 8. SOLUTION VALIDATION

Figure 8.2: Cross-validation

Fold 1 Fold 2 Fold 3 Fold 4

Test Training Training Training

Training Test Training Training

Training Training Test Training

Training Training Training Test

Cross-validation is a technique to sample training and test sets. It


divides the dataset into 𝑟 folds, using 𝑟 − 1 folds as a training set
and the remaining fold as a test set.

parameters 𝜙 of the data preprocessing step are adjusted — see chapter 7


— and the parameters 𝜃 of the machine learning model are adjusted —
see chapter 6. These parameters, [𝜙𝑘 , 𝜃𝑘 ] are the solution 𝑀𝑘 , and must
be calculated exclusively using the training set 𝐷𝑘,train .
Once the parameters 𝜙𝑘 and 𝜃𝑘 are fixed, we apply them in the test
set 𝐷𝑘,test . For each sample (𝑥𝑖 , 𝑦 𝑖 ) ∈ 𝐷𝑘,test , we calculate the prediction
𝑦 ̂𝑖 = 𝑓𝜙,𝜃 (𝑥𝑖 ). The target value 𝑦 is called the ground-truth or expected
outcome.
Given a performance metric 𝑅, for each dataset 𝐷𝑘 , we calculate

𝑝 𝑘 = 𝑅([𝑦 𝑖 ∶ 𝑖] , [𝑦 ̂𝑖 ∶ 𝑖]) .

Note that, by definition, 𝑝 𝑘 is free of data leakage, as [𝜙𝑘 , 𝜃𝑘 ] are found


without the use of the data in 𝐷𝑘,test and to calculate 𝑦 ̂𝑖 we use only 𝑥𝑖
(with no target 𝑦 𝑖 ).
For a detailed explanation of this process for each sampling, con-
sult section 8.1. A summary of the experimental plan for estimating
expected performance is shown in fig. 8.3.
Finally, we can study the sampled performance values 𝑝1 , 𝑝2 , … like
any other statistical data to prove (or disprove) the hypothesis. This pro-
cess is called validation.
8.2. AN EXPERIMENTAL PLAN FOR DATA SCIENCE 179

Figure 8.3: Experimental plan for estimating expected perfor-


mance of a solution.

Data

Sampling
strategy

Training Test Training Test … Training Test Training Test

Solution search algorithm

Data
Machine 𝜙
Training handling [ ]
learning 𝜃
pipeline

Test (no target) Preprocessor Model predictions

Test (target) 𝑝

𝑝1
Hypothesis
[ 𝑝2 ]
⋮ test

The experimental plan for estimating the expected performance


of a solution involves sampling the data, training and testing the
solution, evaluating the performance, and validating the results.
180 CHAPTER 8. SOLUTION VALIDATION

Definition 8.1: (Validation)

While we call evaluation the process of assessing the performance


of a solution using a test set; validation, on the other hand, is the
process of interpreting or confirming the meaning of the evalua-
tion results. Validation is the process of determining the degree
to which the evaluation results support the intended use of the
solution (unseen data).

The results are not the “real” performance of the solution 𝑀 in the
real world, as that would require new data to be collected. However, we
can safely interpret the performance samples as being sampled from the
same distribution as the real-world performance of the solution 𝑀.

8.2.3 Estimating expected performance


We have seen that we need a process of interpreting or confirming the
meaning of the evaluation results. Sometimes, it is as simple as calculat-
ing the mean and standard deviation of the performance samples. Other
times, we need to use more sophisticated techniques, like hypothesis
tests or Bayesian analysis.
Let us say our goal is to reach a certain performance threshold 𝑝0 .
After an experiment done with 10 repeated 10-fold cross-validation, we
have the average performance 𝑝 ̄ and the standard deviation 𝜎. If 𝑝 ̄ −
𝜎 ≫ 𝑝0 , it is very likely that the solution will reach the threshold in
production. Although this is not a formal validation, it is a good and
likely indication.
Also, it is common to use visualization techniques to analyze the re-
sults. Box plots are a good way to see the distribution of the performance
samples.
A more sophisticated technique is to use Bayesian analysis. In this
case, we use the performance samples to estimate the probability dis-
tribution of the performance of the algorithm. This distribution can be
used to calculate the probability of the performance being better than a
certain threshold.
Benavoli et al. (2017)4 propose an interesting Bayesian test that ac-

4A. Benavoli et al. (2017). “Time for a Change: a Tutorial for Comparing Multiple
Classifiers Through Bayesian Analysis”. In: Journal of Machine Learning Research 18.77,
pp. 1–36. url: http://jmlr.org/papers/v18/16-305.html.
8.2. AN EXPERIMENTAL PLAN FOR DATA SCIENCE 181

counts for the overlapping training sets in the cross-validation5. Let


𝑧𝑘 = 𝑝 𝑘 − 𝑝∗ be the difference between the performance of the 𝑘-th
fold and the performance goal 𝑝∗ , a generative model for the data is
z = 1𝜇 + v,
where z = (𝑧1 , 𝑧2 , … , 𝑧𝑛 ) is the vector of performance gains, 1 is a vec-
tor of ones, 𝜇 is the parameter of interest (the mean performance gain),
and v ∼ MVN(0, Σ) is a multivariate normal noise with zero mean and
covariance matrix Σ. The covariance matrix Σ is characterized as
Σ𝑖𝑖 = 𝜎2 , Σ𝑖𝑗 = 𝜎2 𝜌,
for all 𝑖 ≠ 𝑗 ∈ {1, 2, … , 𝑛}, where 𝜌 is the correlation (between folds)
and 𝜎2 is the variance. The likelihood model of the data is
1 1
P(z ∣ 𝜇, Σ) = exp (− (z − 1𝜇)𝑇 Σ−1 (z − 1𝜇)) .
2 (2𝜋)𝑛/2 √|Σ|
According to them, such likelihood does not allow to estimate the corre-
lation from data, as the maximum likelihood estimate of 𝜌 is zero regard-
less of the observations. Since 𝜌 is not identifiable, the authors suggest
using the heuristic where 𝜌 is the ratio between the number of folds and
the total number of performance samples.
To estimate the probability of the performance of the solution being
greater than the threshold, we first estimate the parameters 𝜇 and 𝜈 =
𝜎−2 of the generative model. Benavoli et al. consider the prior
P(𝜇, 𝜈 ∣ 𝜇0 , 𝜅0 , 𝑎, 𝑏) = NG(𝜇, 𝜈; 𝜇0 , 𝜅0 , 𝑎, 𝑏),
which is a Normal-Gamma distribution with parameters (𝜇0 , 𝜅0 , 𝑎, 𝑏).
This is a conjugate prior to the likelihood model. Choosing the prior
parameters 𝜇0 = 0, 𝜅0 → ∞, 𝑎 = −1/2, and 𝑏 = 0, the posterior distri-
bution of 𝜇 is a location-scale Student distribution. Mathematically, we
have
1 𝜌
P(𝜇 ∣ z, 𝜇0 , 𝜅0 , 𝑎, 𝑏) = St(𝜇; 𝑛 − 1, 𝑧,̄ ( + ) 𝑠2 ),
𝑛 1−𝜌
where
𝑛
1
𝑧̄ = ∑𝑧 ,
𝑛 𝑖=1 𝑖
5This is actually a particular case of the proposal in the paper, where the authors
consider the comparison between two performance vectors — which is the case described
in section 8.2.4.
182 CHAPTER 8. SOLUTION VALIDATION

and
𝑛−1
1
𝑠2 = ∑ (𝑧 − 𝑧)̄ 2 .
𝑛 − 1 𝑖=1 𝑖

Thus, validating that the solution obtained by the algorithm in pro-


duction will surpass the threshold 𝑝∗ consists of calculating the proba-
bility
P(𝜇 > 0 ∣ z) > 𝛾,
where 𝛾 is the confidence level.
Note that the Bayesian analysis is a more sophisticated technique
than null hypothesis significance testing, as it allows us to estimate the
probability of the hypothesis instead of the probability of observing the
data given the hypothesis. Benavoli et al. (2017)6 thoroughly discuss the
subject.
Also, be aware that the choice of the model and the prior distribution
can affect the results. Benavoli et al. suggest using 10 repetitions of 10-
fold cross-validation to estimate the parameters of the generative model.
They also show experimental evidence that their procedure is robust to
the choice of the prior distribution. However, one should be aware of
the limitations of the model.

8.2.4 Comparing strategies


When we have two or more strategies to solve a problem, we need to
compare them to see which one is better. This is a common situation
in data science projects, as we usually have many techniques to solve a
problem.
One way to look at this problem is to consider that the algorithm7 𝐴
has hyperparameters 𝜆 ∈ Λ. A hyperparameter here is a parameter that
is not learned by the algorithm, but is set by the user. For example, the
number of neighbors in a k-NN algorithm is a hyperparameter. For the
sake of generality, we can consider that the hyperparameters may also
include different learning algorithms or data handling pipelines.
Let us say we have a baseline algorithm 𝐴(𝜆0 ) — for instance, some-
thing that is in production, the result of the last sprint or a well-known
algorithm — and a new candidate algorithm 𝐴(𝜆). Suppose p(𝜆0 ) and

6A. Benavoli et al. (2017). “Time for a Change: a Tutorial for Comparing Multiple
Classifiers Through Bayesian Analysis”. In: Journal of Machine Learning Research 18.77,
pp. 1–36. url: http://jmlr.org/papers/v18/16-305.html.
7That includes both data preprocessing and machine learning.
8.2. AN EXPERIMENTAL PLAN FOR DATA SCIENCE 183

p(𝜆) are the performance vectors of the baseline and the candidate al-
gorithms, respectively, that are calculated using the same strategy de-
scribed in section 8.2.3. It is important to note that the same samplings
must be used to compare the algorithms — i.e., performance samples
must be paired, each one of them coming from the same sampling, and
consequently, from the same training and test datasets.
We can validate whether the candidate is better than the baseline by

P(𝜇 > 0 ∣ z) > 𝛾,

where z is now p(𝜆) − p(𝜆0 ). The interpretation of the results is similar;


𝛾 is the chosen confidence level and 𝜇 is the expected performance gain
of the candidate algorithm — or the performance loss, if negative.
This strategy can be applied iteratively to compare many algorithms.
For example, we can compare 𝐴(𝜆1 ) with 𝐴(𝜆0 ), 𝐴(𝜆2 ) with 𝐴(𝜆1 ), and
so on, keeping the best algorithm found so far as the baseline. In the
cases where the confidence level is not reached, but the expected per-
formance gain is positive, we can consider additional characteristics of
the algorithms, like the interpretability of the model, the computational
cost, or the ease of implementation, to decide which one is better. How-
ever, one should pay attention to whether the probability

P(𝜇 < 0 ∣ z)

is too high or not. Always ask yourself if the risk of performance loss is
worth it in the real-world scenario.

8.2.5 About nesting experiments


Mathematically speaking, there is no difference between assessing the
choice of [𝜙, 𝜃] and the choice of 𝜆. Thus, some techniques — like grid
search — can be used to find the best hyperparameters using a nested
experimental plan.
The idea is the same: we assess how good the expected choice of
the hyperparameter-optimization technique 𝐵 is to find the appropriate
hyperparameters. Similarly, the choice of the hyperparameters and the
parameters that go to production is the application of 𝐵 to the whole
dataset. However, never use the choices of the hyperparameters in the
experimental plan to make decisions about what goes to production.
(The same is true for the parameters [𝜙, 𝜃] in the traditional case.)
Although nesting experiments usually lead to a general understand-
ing of the performance of the solution, it is not always the best choice.
184 CHAPTER 8. SOLUTION VALIDATION

Nested experiments are computationally expensive, as the possible com-


binations are multiplied. Also, the size of the dataset in the inner exper-
iment is smaller, which can lead to a less reliable estimate of the perfor-
mance.
Nonetheless, we can always unnest the search by taking the options
as different algorithms two by two, like we described in section 8.2.4.
This solves the problem of the size of the dataset in the inner experi-
ment, but it does not solve the problem of the computational cost —
often increasing it.

8.3 Final remarks


In this chapter, we presented a framework for experimental planning
that can be used in most data science projects for inductive tasks. One
major limitation of the framework is that it assumes that the data is i.i.d.
This is not always the case, as the data can be dependent on time or
space. In these cases, the sampling strategy must be adjusted to account
for the dependencies.
Unfortunately, changing the sampling strategy also means that the
validation method must be adjusted. That is why tasks like time-series
forecasting and spatial data analysis require a different approach to ex-
perimental planning.
Mathematical foundations
A
Maar ik maak steeds wat ik nog niet kan om het te leeren kun-
nen.
— Vincent van Gogh, The Complete Letters of Vincent Van
Gogh, Volume Three

Foundations in data science come from a variety of fields, including alge-


bra, statistics, computer science, optimization theory, and information
theory. This appendix provides a brief overview of the main computa-
tional, algebraic, and statistical concepts in data science.
My goal is not to provide a comprehensive treatment of these topics,
but to consolidate notations and definitions that are used throughout
the book. The reader is encouraged to consult the references provided at
the end of each topic for a more in-depth treatment. Statisticians with a
strong programming background and computer scientists with a strong
statistics background will probably not find much new here.
I first introduce the main concepts in algorithms and data structures,
which are the building blocks of computational thinking. Then, I show
the basic concepts in set theory and linear algebra, which are impor-
tant mathematical foundations for data science. Finally, I introduce
the main concepts in probability theory, the cornerstone of statistical
learning and inference.

185
186 APPENDIX A. MATHEMATICAL FOUNDATIONS

Chapter remarks

Contents
A.1 Algorithms and data structures . . . . . . . . . . . . . . . 187
A.1.1 Computational complexity . . . . . . . . . . . . . 187
A.1.2 Algorithmic paradigms . . . . . . . . . . . . . . . 188
A.1.3 Data structures . . . . . . . . . . . . . . . . . . . 192
A.2 Set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.2.1 Set operations . . . . . . . . . . . . . . . . . . . . 196
A.2.2 Set operations properties . . . . . . . . . . . . . . 196
A.2.3 Relation to Boolean algebra . . . . . . . . . . . . . 197
A.3 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . 197
A.3.1 Operations . . . . . . . . . . . . . . . . . . . . . . 198
A.3.2 Systems of linear equations . . . . . . . . . . . . . 200
A.3.3 Eigenvalues and eigenvectors . . . . . . . . . . . 200
A.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.4.1 Axioms of probability and main concepts . . . . . 201
A.4.2 Random variables . . . . . . . . . . . . . . . . . . 202
A.4.3 Expectation and moments . . . . . . . . . . . . . 203
A.4.4 Common probability distributions . . . . . . . . . 205
A.4.5 Permutations and combinations . . . . . . . . . . 208
B.1 Multi-layer perceptron . . . . . . . . . . . . . . . . . . . . 210
B.2 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . 212

Context

• Data science relies on a variety of mathematical and computational


concepts.
• The main concepts are algorithms, data structures, set theory, lin-
ear algebra, and probability theory.

Objectives

• Introduce a brief overview of the main computational, mathemat-


ical, and statistical concepts in data science.
• Remind the reader of the main definitions and properties of these
concepts.
• Consolidate notations and definitions that are used throughout the
book.
A.1. ALGORITHMS AND DATA STRUCTURES 187

A.1 Algorithms and data structures


Algorithms are step-by-step procedures for solving a problem. They are
used to manipulate data structures, which are ways of organizing data
to solve problems. They are realized in programming languages, which
are formal languages that can be used to express algorithms.
My suggestion for a comprehensive book about algorithms and data
structures is Cormen et al. (2022)1. An alternative for beginners is Gut-
tag (2021)2.

A.1.1 Computational complexity


The computational complexity of an algorithm is the total amount of
resources it uses to run as a function of the size of the input. The most
common resources are time and space.
Usually, we are interested in the asymptotic complexity of an algo-
rithm, i.e., how the complexity grows as the size of the input grows. The
most common notation for asymptotic complexity is the Big-O notation.

Big-O notation Let 𝑓 and 𝑔 be functions from the set of natural num-
bers to the set of real numbers, i.e., 𝑓, 𝑔 ∶ ℕ → ℝ. We say that 𝑓 is 𝑂(𝑔) if
there exists a constant 𝑐 > 0 such that 𝑓(𝑛) ≤ 𝑐𝑔(𝑛) for all 𝑛 ≥ 𝑛0 , where
𝑛0 is a natural number. We can order functions by their asymptotic com-
plexity. For example, 𝑂(1) < 𝑂(log 𝑛) < 𝑂(𝑛) < 𝑂(𝑛 log 𝑛) < 𝑂(𝑛2 ) <
𝑂(2𝑛 ) < 𝑂(𝑛!). Throughout this book, we consider log 𝑛 = log2 𝑛, i.e.,
whenever the base of the logarithm is not specified, it is assumed to be
2.
The asymptotic analysis of algorithms is usually done in the worst-
case scenario, i.e. the maximum amount of resources the algorithm uses
for any input of size 𝑛. Thus, it gives us an upper bound on the com-
plexity of the algorithm. In other words, an algorithm with complexity
𝑂(𝑔(𝑛)) is guaranteed to run in at most 𝑐𝑔(𝑛) time for some constant 𝑐.
It does not mean, for instance, that an algorithm with time complex-
ity 𝑂(𝑛) will always run faster than an algorithm with time complexity
𝑂(𝑛2 ), but that the former will run faster for a large enough input size.

1T. H. Cormen et al. (2022). Introduction to Algorithms. 4th ed. The MIT Press, p. 1312.
isbn: 978-0262046305.
2J. V. Guttag (2021). Introduction to Computation and Programming Using Python.
With Application to Computational Modeling and Understanding Data. 3rd ed. The MIT
Press, p. 664. isbn: 978-0262542364.
188 APPENDIX A. MATHEMATICAL FOUNDATIONS

An important property of the Big-O notation is that

𝑂(𝑓) + 𝑂(𝑔) = 𝑂(max(𝑓, 𝑔)),

i.e. if an algorithm has two sequential steps with time complexity 𝑂(𝑓)
and 𝑂(𝑔), the highest complexity is the one that determines the overall
complexity.

A.1.2 Algorithmic paradigms


Some programming techniques are used to solve a wide variety of prob-
lems. They are called algorithmic paradigms. The most common ones
are listed below.

Divide and conquer The problem is divided into smaller subprob-


lems that are solved recursively. The solutions to the subproblems are
then combined to give a solution to the original problem. Some example
algorithms are merge sort, quick sort, and binary search.

Algorithm A.1: Binary search algorithm.

Data: A sorted array a = [𝑎1 , 𝑎2 , … , 𝑎𝑛 ] and a key 𝑥


Result: True if 𝑥 is in a, false otherwise
1 𝑙 ← 1;
2 𝑟 ← 𝑛;
3 while 𝑙 ≤ 𝑟 do
𝑙+𝑟
4 𝑚 ← ⌊ ⌋;
2
5 if 𝑥 = 𝑎𝑚 then
6 return true
7 if 𝑥 < 𝑎𝑚 then
8 𝑟 ← 𝑚 − 1;
9 else
10 𝑙 ← 𝑚 + 1;
11 return false

An iterative algorithm that searches for a key in a sorted array.

Consider as an example the algorithm A.1 that solves the binary


search problem. Given an 𝑛-element sorted array a = [𝑎1 , 𝑎2 , … , 𝑎𝑛 ],
A.1. ALGORITHMS AND DATA STRUCTURES 189

𝑎1 ≤ 𝑎2 ≤ ⋯ ≤ 𝑎𝑛 , and a key 𝑥, the algorithm returns true if 𝑥 is in 𝐴


and false otherwise. The algorithm works by dividing the array in half
at each step and comparing the key with the middle element. Each time
the key is not found, the search space is reduced by half.

Algorithm A.2: Recursive binary search algorithm.

1 function bsearch([𝑎1 , 𝑎2 , … , 𝑎𝑛 ] , 𝑥) is
2 if 𝑛 = 0 then
3 return false
𝑛
4 𝑚 ← ⌊ ⌋;
2
5 if 𝑥 = 𝑎𝑚 then
6 return true
7 if 𝑥 < 𝑎𝑚 then
8 return bsearch([𝑎1 , … , 𝑎𝑚−1 ] , 𝑥)
9 else
10 return bsearch([𝑎𝑚+1 , … , 𝑎𝑛 ] , 𝑥)

A recursive algorithm that searches for a key in a sorted array.


Note that trivial conditions — 𝑛 = 0 and key found — are handled
first, so the recursion stops when the problem is small enough.

Divide and conquer algorithms can be implemented using recursion.


Recursion is also an algorithmic paradigm where a function calls itself to
solve smaller instances of the same problem. The recursion stops when
the problem is small enough to be solved directly.
Algorithm A.2 displays a recursive implementation of the binary
search algorithm. The smaller instances, or so-called base cases, are
when the array is empty or the key is found in the middle. Other condi-
tions — key is smaller or greater than the middle element — are handled
by calling the function recursively with the left or right half of the array.
This solution — both algorithms — has a worst-case time complexity
of 𝑂(log 𝑛). The search space is halved at each step, thus, in the 𝑖-th
iteration, the remaining number of elements in the array is 𝑛/2𝑖−1 . In
the worst case, the algorithm stops when the search space has size 1 or
smaller, i.e.
𝑛
= 1 ⟹ 𝑖 = 1 + log 𝑛.
2𝑖−1
190 APPENDIX A. MATHEMATICAL FOUNDATIONS

Note that this strategy leads to such a low time complexity that we
can solve large instances of the problem in a reasonable amount of time.
Consider the case of an array with 264 = 18,446,744,073,709,551,616
elements, the algorithm will find the key in at most 65 steps.

Greedy algorithms The problem is solved with incremental steps,


each of which is locally optimal. The overall solution is not guaranteed
to (but might) be optimal. Some example algorithms are Dijkstra’s al-
gorithm and Prim’s algorithm. Greedy algorithms are usually very effi-
cient in terms of time complexity — see more in the following.
One example of a suboptimal greedy algorithm is a heuristic solution
for the knapsack problem. The knapsack problem is a combinatorial op-
timization problem where the goal is to maximize the value of items in
a knapsack without exceeding its capacity. The problem is mathemati-
cally defined as
𝑛
maximize ∑ 𝑣 𝑖 𝑥𝑖 ,
𝑖=1

𝑛
subject to ∑ 𝑤 𝑖 𝑥𝑖 ≤ 𝑊 ,
𝑖=1

where 𝑣 𝑖 is the value of item 𝑖, 𝑤 𝑖 is the weight of item 𝑖, 𝑥𝑖 is a binary


variable that indicates if item 𝑖 is in the knapsack, and 𝑊 is the capacity
of the knapsack.
An algorithm that finds a suboptimal solution for the knapsack prob-
lem is shown in algorithm A.3. It iterates over the items in decreasing
order of value and puts the item in the knapsack if it fits. The algorithm
is suboptimal because there might exist small-value items that, when
combined, would fit in the knapsack and yield a higher total value.
The most costly operation in the algorithm is the sorting of the items
in decreasing order of value, which has a time complexity3 of 𝑂(𝑛 log 𝑛).

Brute force The problem is solved by trying all possible solutions.


Most of the time, brute force algorithms have exponential time complex-
ity, leading to impractical solutions for large instances of the problem.
On the other hand, brute force algorithms are usually easy to implement
and understand, as well as guaranteed to find the optimal solution.
3Considering the worst-case time complexity of the sorting algorithm, consult T. H.
Cormen et al. (2022). Introduction to Algorithms. 4th ed. The MIT Press, p. 1312. isbn:
978-0262046305 for more details.
A.1. ALGORITHMS AND DATA STRUCTURES 191

Algorithm A.3: Heuristic solution for the knapsack problem.

Data: A list of 𝑛 items, each with a value 𝑣 𝑖 and a weight


𝑤 𝑖 , and a capacity 𝑊
Result: The binary variable 𝑥𝑖 for each item 𝑖 that
maximizes the total value
1 Sort the items in decreasing order of value;
2 𝑉 ← 0;
3 𝑥𝑖 ← 0, ∀𝑖;
4 for 𝑖 ← 1 to 𝑛 do
5 if 𝑤 𝑖 ≤ 𝑊 then
6 𝑥𝑖 ← 1;
7 𝑉 ← 𝑉 + 𝑣𝑖 ;
8 𝑊 ← 𝑊 − 𝑤𝑖 ;
9 return 𝑥𝑖 , ∀𝑖

A greedy algorithm that solves the knapsack problem subopti-


mally. The algorithm iterates over the items in decreasing order
of value and puts the item in the knapsack if it fits.

In the previous example, a brute force algorithm for the knapsack


problem would try all possible combinations of items and select the one
that maximizes the total value without exceeding the capacity. One can
easily see that the time complexity of such an algorithm is 𝑂(2𝑛 ), where
𝑛 is the number of items, as there are 2𝑛 possible combinations of items.
Such an exhaustive search is impractical for large 𝑛, but it is guaranteed
to find the optimal solution.
One should avoid brute force algorithms whenever possible, as they
are usually too costly to be practical. However, they are useful for small
instances of the problem, for verification of the results of other algo-
rithms, and for educational purposes.

Backtracking The problem is solved incrementally, one piece at a


time. If a piece does not fit, it is removed and replaced by another piece.
Some example algorithms are the naïve solutions for N-queens problem
and for the Sudoku problem. Backtracking, as a special case of brute
force, often leads to exponential (or worse) time complexity.
Many times, backtracking algorithms are combined with other tech-
192 APPENDIX A. MATHEMATICAL FOUNDATIONS

niques to reduce the search space and make the algorithm more effi-
cient. For example, the backtracking algorithm for the Sudoku problem
is combined with constraint propagation to reduce the number of possi-
ble solutions.
A Sudoku puzzle consists of an 𝑛 × 𝑛 grid, divided into 𝑛 subgrids
of size √𝑛 × √𝑛. The goal is to fill the grid with numbers from 1 to 𝑛
such that each row, each column, and each subgrid contains all numbers
from 1 to 𝑛 but no repetitions. The most common grid size is 9 × 9.
An illustration of backtracking to solve a 4 × 4 Sudoku puzzle4 is
shown in fig. A.1. The puzzle is solved by trying all possible numbers
in each cell and backtracking when a number does not fit. The solu-
tion is found when all cells are filled and the constraints are satisfied.
Arrows indicate the steps of the backtracking algorithm. Every time a
constraint is violated — indicated in gray —, the algorithm backtracks
to the previous cell and tries a different number.
One can easily see that a puzzle with 𝑚 missing cells has 𝑛𝑚 possible
solutions. For small values of 𝑚 and 𝑛, the algorithm is practical, but for
large values, it becomes too costly.

A.1.3 Data structures


Data structures are ways of organizing data to solve problems. The most
common ones are listed below. A comprehensive material about the
properties and implementations of data structures can be found in Cor-
men et al. (2022)5.

Arrays An array is a homogeneous collection of elements that are ac-


cessed by an integer index. The elements are usually stored in contigu-
ous memory locations. In the scope of this book, it is equivalent to a
mathematical vector whose elements’ type are not necessarily numeri-
cal. Thus, a 𝑛-elements array a is denoted by [𝑎1 , 𝑎2 , … , 𝑎𝑛 ], where the
𝑖 in 𝑎𝑖 is the index of the element.

Stacks A stack is a collection of elements that are accessed in a last-


in-first-out (LIFO) order. Elements are added to the top of the stack and
removed from the top of the stack. In other words, only two operations

4Smaller puzzles are more didactic, but the same principles apply to larger puzzles.
5T. H. Cormen et al. (2022). Introduction to Algorithms. 4th ed. The MIT Press, p. 1312.
isbn: 978-0262046305.
A.1. ALGORITHMS AND DATA STRUCTURES 193

Figure A.1: Backtracking to solve a Sudoku puzzle.

? 1
1 2 4
4 1 2
2

1 1 2 1 4 ? 1
1 2 4 1 2 4 1 2 4
4 1 2 4 1 2 4 1 2
2 2 2
3 ? 1
1 2 4 …
4 1 2
2

3 1 1 3 2 1 3 3 1 3 4 1
1 2 4 1 2 4 1 2 4 1 2 4
4 1 2 4 1 2 4 1 2 4 1 2
2 2 2 2

A Sudoku puzzle — in this case, 4 × 4 — is solved by trying all


possible numbers in each cell and backtracking when a number
does not fit. The solution is found when all cells are filled and the
constraints are satisfied. Arrows indicate the backtracking steps.
The question mark indicates an empty cell that needs to be filled
at that step. Constraints violation are shown in gray.
194 APPENDIX A. MATHEMATICAL FOUNDATIONS

are allowed: push (add an element to the top of the stack) and pop (re-
move the top element). Only the top element is accessible.

Queues A queue is a collection of elements that are accessed in a first-


in-first-out (FIFO) order. Elements are added to the back of the queue
and removed from the front of the queue. The two operations allowed
are enqueue (add an element to the back of the queue) and dequeue
(remove the front element). Only the front and back elements are acces-
sible.

Trees A tree is a collection of nodes. Each node contains a value and


a list of references to its children. The first node is called the root. A
node with no children is called a leaf. No cycles are allowed in a tree,
i.e., a child cannot be an ancestor of its parent. The most common type
of tree is the binary tree, where each node has at most two children.
Mathematically, a binary tree is a recursive data structure. A binary
tree is either empty or consists of a root node and two binary trees, called
the left and right children. Thus, a binary tree 𝑇 is

∅ if it is empty, or
𝑇={
(𝑣, 𝑇𝑙 , 𝑇𝑟 ) if it has a value 𝑣 and two children 𝑇𝑙 and 𝑇𝑟 .

Note that the left and right children are themselves binary trees. If 𝑇 is
a leaf, then 𝑇𝑙 = 𝑇𝑟 = ∅.
These properties make it easy to represent a binary tree using paren-
theses notation. For example, (1, (2, ∅, ∅), (3, ∅, ∅)) is a binary tree
with root 1, left child 2, and right child 3.

Graphs A graph is also a collection of nodes. Each node contains a


value and a list of references to its neighbors; the references are called
edges. A graph can be directed or undirected. A graph is directed if the
edges have a direction.
Mathematically, a graph is a pair 𝐺 = (𝑉 , 𝐸), where 𝑉 is a set of
vertices and 𝐸 ⊆ 𝑉 × 𝑉 is a set of edges. An edge is a pair of vertices,
i.e., 𝑒 = (𝑣 𝑖 , 𝑣𝑗 ), where 𝑣 𝑖 , 𝑣𝑗 ∈ 𝑉 . If the graph is directed, the edge is an
ordered pair, i.e., 𝑒 = (𝑣 𝑖 , 𝑣𝑗 ) ≠ (𝑣𝑗 , 𝑣 𝑖 ).
Not only can each node hold a value, but also each edge can have
a weight. A weighted graph is a graph where there exists a function
𝑤 ∶ 𝐸 → ℝ that assigns a real number to each edge.
A.2. SET THEORY 195

Figure A.2: A graph with four vertices and five edges.

4 3

1 2

A graph with four vertices and five edges. Vertices are numbered
from 1 to 4, and edges are represented by arrows. The graph is
directed, as the edges have a direction.

A graphical representation of a directed graph with four vertices and


five edges is shown in fig. A.2. The vertices are numbered from 1 to 4,
and the edges are represented by arrows.
Another common representation of a graph is the adjacency matrix.
An adjacency matrix is a square matrix 𝐴 of size 𝑛 × 𝑛, where 𝑛 is the
number of vertices. The 𝑖, 𝑗-th entry of the matrix is 1 if there is an edge
from vertex 𝑖 to vertex 𝑗, and 0 otherwise. The adjacency matrix of the
graph in fig. A.2 is
0 1 1 0
⎛ ⎞
0 0 1 0
𝐴=⎜ ⎟.
⎜0 0 0 1 ⎟
⎝1 0 0 0 ⎠

A.2 Set theory


A set is a collection of elements. The elements of a set can be anything,
including other sets. The elements of a set are unordered, and each ele-
ment is unique. The most common notation for sets is the curly braces
notation, e.g., {1, 2, 3}.
Some special sets are listed below.

Universe set The universe set is the set of all elements in a given con-
text. It is denoted by Ω.

Empty set The empty set is the set with no elements. It is denoted by
the symbol ∅. Depending on the context, it can also be denoted by {}.
196 APPENDIX A. MATHEMATICAL FOUNDATIONS

A.2.1 Set operations


The basic operations on sets are union, intersection, difference, and
complement.

Union The union of two sets 𝐴 and 𝐵 is the set of elements that are
in 𝐴 or 𝐵. It is denoted by 𝐴 ∪ 𝐵. For example, the union of {1, 2, 3} and
{3, 4, 5} is {1, 2, 3, 4, 5}.

Intersection The intersection of two sets 𝐴 and 𝐵 is the set of ele-


ments that are in both 𝐴 and 𝐵. It is denoted by 𝐴 ∩ 𝐵. For example, the
intersection of {1, 2, 3} and {3, 4, 5} is {3}.

Difference The difference of two sets 𝐴 and 𝐵 is the set of elements


that are in 𝐴 but not in 𝐵. It is denoted by 𝐴 ∖ 𝐵. For example, the
difference of {1, 2, 3} and {3, 4, 5} is {1, 2}.

Complement The complement of a set 𝐴 is the set of elements that


are not in 𝐴. It is denoted by 𝐴𝑐 = Ω ∖ 𝐴.

Inclusion Inclusion is a relation between sets. A set 𝐴 is included in


a set 𝐵 if all elements of 𝐴 are also elements of 𝐵. It is denoted by 𝐴 ⊆ 𝐵.

A.2.2 Set operations properties


Union and intersection are commutative, associative, and distributive.
Thus, given sets 𝐴, 𝐵, and 𝐶, the following statements hold:

• Commutativity: 𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴 and 𝐴 ∩ 𝐵 = 𝐵 ∩ 𝐴;

• Associativity: (𝐴∪𝐵)∪𝐶 = 𝐴∪(𝐵∪𝐶) and (𝐴∩𝐵)∩𝐶 = 𝐴∩(𝐵∩𝐶);

• Distributivity: 𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) and 𝐴 ∩ (𝐵 ∪ 𝐶) =
(𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶).

The difference operation can be expressed in terms of union and in-


tersection as
𝐴 ∖ 𝐵 = 𝐴 ∩ 𝐵𝑐 .
The complement of the union of two sets is the intersection of their
complements, i.e.
(𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵 𝑐 .
A.3. LINEAR ALGEBRA 197

Similarly, the complement of the intersection of two sets is the union of


their complements, i.e.

(𝐴 ∩ 𝐵)𝑐 = 𝐴𝑐 ∪ 𝐵 𝑐 .

This property is known as De Morgan’s laws.


In terms of inclusion, given sets 𝐴, 𝐵, and 𝐶, the following state-
ments hold:
• Reflexivity: 𝐴 ⊆ 𝐴;
• Antisymmetry: 𝐴 ⊆ 𝐵 and 𝐵 ⊆ 𝐴 if and only if 𝐴 = 𝐵;
• Transitivity: 𝐴 ⊆ 𝐵 and 𝐵 ⊆ 𝐶 implies 𝐴 ⊆ 𝐶.

A.2.3 Relation to Boolean algebra


Set operations are closely related to Boolean algebra. In Boolean algebra,
the elements of a set are either true or false, many times represented by 1
and 0, respectively. The union operation is equivalent to the logical OR
operation, expressed by the symbol ∨; and the intersection operation
is equivalent to the logical AND operation, expressed by the symbol ∧.
The complement operation is equivalent to the logical NOT operation,
expressed by the symbol ¬.
The distributive property of set operations is equivalent to the dis-
tributive property of Boolean algebra. Important properties like De Mor-
gan’s laws also hold in Boolean algebra, i.e. ¬(𝐴 ∨ 𝐵) = ¬𝐴 ∧ ¬𝐵 and
¬(𝐴 ∧ 𝐵) = ¬𝐴 ∨ ¬𝐵.
Boolean algebra is the foundation of digital electronics and com-
puter science. The logical operations are implemented in hardware us-
ing logic gates, and the logical operations are used in programming lan-
guages to control the flow of a program.
Readers interested in more details about Boolean algebra and Dis-
crete Mathematics should consult Rosen (2018)6.

A.3 Linear algebra


Linear algebra is the branch of mathematics that studies vector spaces
and linear transformations. It is a fundamental tool in many areas of
science and engineering. The basic objects of linear algebra are vectors
6K. H. Rosen (2018). Discrete Mathematics and Its Applications. 8th ed. McGraw Hill,
p. 1120. isbn: 9781259676512.
198 APPENDIX A. MATHEMATICAL FOUNDATIONS

and matrices. A common textbook that covers the subject in depth is


Strang (2023)7.

Vector A vector is an ordered collection of numbers. It is denoted by


a bold lowercase letter, e.g., v = [𝑣 𝑖 ]𝑖=1,…,𝑛 is a vector of length 𝑛.

Matrix A matrix is a rectangular collection of numbers. It is denoted


by an uppercase letter, e.g., 𝐴 = (𝑎𝑖𝑗 )𝑖=1,…,𝑛; 𝑗=1,…,𝑚 is the matrix with
𝑛 rows and 𝑚 columns.

Tensor Tensors are generalizations of vectors and matrices. A tensor


of rank 𝑘 is a multidimensional array with 𝑘 indices. Scalars are tensors
of rank 0, vectors are tensors of rank 1, and matrices are tensors of rank
2. Tensors are commonly used in machine learning and physics.

A.3.1 Operations
The main operations in linear algebra are presented below.

Addition The sum of two vectors v and w is the vector v + w whose


𝑖-th entry is 𝑣 𝑖 +𝑤 𝑖 . The sum of two matrices 𝐴 and 𝐵 is the matrix 𝐴+𝐵
whose 𝑖, 𝑗-th entry is 𝑎𝑖𝑗 + 𝑏𝑖𝑗 . (The same rules apply to subtraction.)

Scalar multiplication The product of a scalar 𝛼 and a vector v is the


vector 𝛼v whose 𝑖-th entry is 𝛼𝑣 𝑖 . Similarly, the product of a scalar 𝛼 and
a matrix 𝐴 is the matrix 𝛼𝐴 whose 𝑖, 𝑗-th entry is 𝛼𝑎𝑖𝑗 .

Dot product The dot product of two vectors v and w is the scalar

𝑛
v ⋅ w = ∑ 𝑣𝑖 𝑤𝑖 .
𝑖=1

The dot product is also called the inner product.

7G. Strang (2023). Introduction to Linear Algebra. 6th ed. Wellesley-Cambridge Press,
p. 440. isbn: 978-1733146678.
A.3. LINEAR ALGEBRA 199

Matrix multiplication The product of two matrices 𝐴 and 𝐵 is the


matrix 𝐶 = 𝐴𝐵 whose 𝑖, 𝑗-th entry is
𝑛
𝑐 𝑖𝑗 = ∑ 𝑎𝑖𝑘 𝑏𝑘𝑗 .
𝑘=1

The number of columns of 𝐴 must be equal to the number of rows of 𝐵,


and the resulting matrix 𝐶 has the same number of rows as 𝐴 and the
same number of columns as 𝐵. Unless otherwise stated, we consider
the vector v with length 𝑛 as a column matrix, i.e., a matrix with one
column and 𝑛 rows.

Transpose The transpose of a matrix 𝐴 is the matrix 𝐴𝑇 whose 𝑖, 𝑗-


th entry is the 𝑗, 𝑖-th entry of 𝐴. If 𝐴 is a square matrix, then 𝐴𝑇 is the
matrix obtained by reflecting 𝐴 along its main diagonal.

Determinant The determinant of a square matrix 𝐴 is a scalar that is


a measure of the (signed) volume of the parallelepiped spanned by the
columns of 𝐴. It is denoted by det(𝐴) or |𝐴|.
The determinant is nonzero if and only if the matrix is invertible
and the linear map represented by the matrix is an isomorphism – i.e.,
it preserves the dimension of the vector space. The determinant of a
product of matrices is the product of their determinants.
𝑎 𝑏
Particularly, the determinant of a 2 × 2 matrix ( ) is
𝑐 𝑑

|𝑎 𝑏 |
| | = 𝑎𝑑 − 𝑏𝑐.
| 𝑐 𝑑|

Inverse matrix An 𝑛 × 𝑛 matrix 𝐴 has an inverse 𝑛 × 𝑛 matrix 𝐴−1 if

𝐴𝐴−1 = 𝐴−1 𝐴 = 𝐼𝑛 ,

where 𝐼𝑛 is the 𝑛 × 𝑛 identity matrix, i.e., a matrix whose diagonal en-


tries are 1 and all other entries are 0. If such a matrix exists, 𝐴 is said
to be invertible. A square matrix that is not invertible is called singu-
lar. A square matrix with entries in a field is singular if and only if its
determinant is zero.
To calculate the inverse of a matrix, we can use the formula
1
𝐴−1 = adj(𝐴),
det(𝐴)
200 APPENDIX A. MATHEMATICAL FOUNDATIONS

where adj(𝐴) is the adjugate (or adjoint) of 𝐴, i.e., the transpose of the
cofactor matrix of 𝐴.
The cofactor of the 𝑖, 𝑗-th entry of a matrix 𝐴 is the determinant of
the matrix obtained by removing the 𝑖-th row and the 𝑗-th column of 𝐴,
multiplied by (−1)𝑖+𝑗 .
In the case of a 2 × 2 matrix, the inverse is
−1
𝑎 𝑏 1 𝑑 −𝑏
( ) = ( ).
𝑐 𝑑 𝑎𝑑 − 𝑏𝑐 −𝑐 𝑎

A.3.2 Systems of linear equations


A system of linear equations is a collection of linear equations that share
their unknowns. It is usually written in matrix form as 𝐴x = b, where
𝐴 is a matrix of constants, x is a vector of unknowns, and b is a vector
of constants.
The system has a unique solution if and only if the matrix 𝐴 is in-
vertible. The solution is x = 𝐴−1 b.

A.3.3 Eigenvalues and eigenvectors


An eigenvalue of an 𝑛 × 𝑛 square matrix 𝐴 is a scalar 𝜆 such that there
exists a non-zero vector v satisfying

𝐴v = 𝜆v. (A.1)

The vector v is called an eigenvector of 𝐴 corresponding to 𝜆.


The eigenvalues of a matrix are the roots of its characteristic polyno-
mial, i.e., the roots of the polynomial det(𝐴 − 𝜆𝐼𝑛 ) = 0, where 𝐼𝑛 is the
𝑛 × 𝑛 identity matrix.

A.4 Probability
Probability is the branch of mathematics that studies the likelihood of
events. It is used to model uncertainty and randomness. The basic ob-
jects of probability are events and random variables.
For a comprehensive material about probability theory, the reader is
referred to Ross (2018)8 and Ross (2023)9.
8S. M. Ross (2018). A First Course in Probability. 10th ed. Pearson, p. 528. isbn: 978-
1292269207.
9S. M. Ross (2023). Introduction to Probability Models. 13th ed. Academic Press, p. 870.
isbn: 978-0443187612.
A.4. PROBABILITY 201

A.4.1 Axioms of probability and main concepts


The Kolmogorov axioms of probability are the foundation of probability
theory. They are

1. The probability of an event 𝐴 is a non-negative real number, i.e.


P(𝐴) ≥ 0;

2. The probability of the sample space10, denoted by Ω, is one, i.e.


P(Ω) = 1; and

3. The probability of the union of disjoint events, 𝐴 ∩ 𝐵 = ∅, is the


sum of the probabilities of the events, i.e. P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵).

Sum rule A particular consequence of the third axiom is the addition


law of probability. If 𝐴 and 𝐵 are not disjoint, then

P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 ∩ 𝐵).

Joint probability The joint probability of two events 𝐴 and 𝐵 is the


probability that both events occur. It is denoted by P(𝐴, 𝐵) = P(𝐴 ∩ 𝐵).

Law of total probability The law of total probability states that if


𝐵1 , … , 𝐵𝑛 are disjoint events such that ∪𝑛𝑖=1 𝐵𝑖 = Ω, then for any event 𝐴,
we have that
𝑛
P(𝐴) = ∑ P(𝐴, 𝐵𝑖 ).
𝑖=1

Conditional probability The conditional probability of an event 𝐴


given an event 𝐵 is the probability that 𝐴 occurs given that 𝐵 occurs. It
is denoted by P(𝐴 ∣ 𝐵).

Independence Two events 𝐴 and 𝐵 are independent if the probability


of 𝐴 given 𝐵 is the same as the probability of 𝐴, i.e., P(𝐴 ∣ 𝐵) = P(𝐴). It
is equivalent to P(𝐴, 𝐵) = P(𝐴) ⋅ P(𝐵).

10The set of all possible events.


202 APPENDIX A. MATHEMATICAL FOUNDATIONS

Bayes’ rule Bayes’ rule is a formula that relates the conditional prob-
ability of an event 𝐴 given an event 𝐵 to the conditional probability of 𝐵
given 𝐴. It is
P(𝐵 ∣ 𝐴) ⋅ P(𝐴)
P(𝐴 ∣ 𝐵) = . (A.2)
P(𝐵)
Bayes’ rule is one of the most important formulas in probability theory
and is used in many areas of science and engineering. Particularly, for
data science, it is used in Bayesian statistics and machine learning.

A.4.2 Random variables


A random variable is a function that maps the sample space Ω to the
real numbers. It is denoted by a capital letter, e.g., 𝑋.
Formally, let 𝑋 ∶ Ω → 𝐸 be a random variable. The probability that
𝑋 takes on a value in a set 𝐴 ⊆ 𝐸 is

P(𝑋 ∈ 𝐴) = P({𝜔 ∈ Ω ∶ 𝑋(𝜔) ∈ 𝐴}). (A.3)

If 𝐸 = ℝ, then 𝑋 is a continuous random variable. If 𝐸 = ℤ, then 𝑋


is a discrete random variable. The random variable 𝑋 is said to follow
a certain probability distribution 𝑃 — denoted by 𝑋 ∼ 𝑃 — given by its
probability mass function or probability density function — see below.

Probability mass function The probability mass function (PMF) of


a discrete random variable 𝑋 is the function 𝑝𝑋 ∶ ℤ → [0, 1] defined by

𝑝𝑋 (𝑥) = P(𝑋 = 𝑥). (A.4)

Probability density function We call probability density function


(PDF) of a continuous random variable 𝑋 the function 𝑓𝑋 ∶ ℝ → [0, ∞)
defined by
𝑏
P(𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑓𝑋 (𝑥)𝑑𝑥. (A.5)
𝑎

Cumulative distribution function Similarly, the cumulative distri-


bution function (CDF) of a random variable 𝑋 is the function 𝐹𝑋 ∶ ℝ →
[0, 1] defined by
𝐹𝑋 (𝑥) = P(𝑋 ≤ 𝑥). (A.6)
A.4. PROBABILITY 203

A.4.3 Expectation and moments


Expectation is a measure of the average value of a random variable. Mo-
ments are measures of the shape of a probability distribution.

Expectation The expectation of a random variable 𝑋 is the average


value of 𝑋. It is denoted by E[𝑋]. By definition, it is

E[𝑋] = ∑ 𝑥 ⋅ 𝑝𝑋 (𝑥),
𝑥

if 𝑋 is discrete, or

E[𝑋] = ∫ 𝑥 ⋅ 𝑓𝑋 (𝑥)𝑑𝑥,
−∞

if 𝑋 is continuous.
The main properties of expectation are listed below.
The expectation operator is linear. Given two random variables 𝑋
and 𝑌 and a real number 𝑐, we have

E[𝑐𝑋] = 𝑐 E[𝑋],

E[𝑋 + 𝑐] = E[𝑋] + 𝑐,
and
E[𝑋 + 𝑌 ] = E[𝑋] + E[𝑌 ].
Under a more general setting, given a function 𝑔 ∶ ℝ → ℝ, the
expectation of 𝑔(𝑋) is

E[𝑔(𝑋)] = ∑ 𝑔(𝑥) ⋅ 𝑝𝑋 (𝑥),


𝑥

if 𝑋 is discrete, or

E[𝑔(𝑋)] = ∫ 𝑔(𝑥) ⋅ 𝑓𝑋 (𝑥)𝑑𝑥,
−∞

if 𝑋 is continuous.

Variance The variance of a random variable 𝑋 is a measure of how


spread out the values of 𝑋 are. It is denoted by Var(𝑋). By definition, it
is
2
Var(𝑋) = E[(𝑋 − E[𝑋]) ] . (A.7)
204 APPENDIX A. MATHEMATICAL FOUNDATIONS

Note that, as a consequence, the expectation of 𝑋 2 — called the sec-


ond moment — is
E[𝑋 2 ] = Var(𝑋) + E[𝑋]2 ,

since
2
Var(𝑋) = E[(𝑋 − E[𝑋]) ]
= E[𝑋 2 − 2𝑋 E[𝑋] + E[𝑋]2 ]
= E[𝑋 2 ] − 2 E[𝑋] E[𝑋] + E[𝑋]2
= E[𝑋 2 ] − E[𝑋]2 .

Higher moments are defined similarly. The 𝑘-th moment of 𝑋 is

E[𝑋 𝑘 ] = ∑ 𝑥𝑘 ⋅ 𝑝𝑋 (𝑥),
𝑥

if 𝑋 is discrete, or

E[𝑋 𝑘 ] = ∫ 𝑥𝑘 ⋅ 𝑓𝑋 (𝑥)𝑑𝑥,
−∞

if 𝑋 is continuous.

Sample mean The sample mean is the average of a sample of ran-


dom variables. Given a sample 𝑋1 , … , 𝑋𝑛 such that 𝑋𝑖 ∼ 𝑋 for all 𝑖, the
sample mean is
𝑛
1
𝑋̄ = ∑𝑋 .
𝑛 𝑖=1 𝑖

Law of large numbers The law of large numbers states that the aver-
age of a large number of independent and identically distributed (i.i.d.)
random variables converges to the expectation of the random variable.
Mathematically,
𝑛
1
lim ∑ 𝑋𝑖 = E[𝑋],
𝑛→∞ 𝑛
𝑖=1

given 𝑋𝑖 ∼ 𝑋 for all 𝑖.


A.4. PROBABILITY 205

Sample variance The sample variance is a measure of how spread


out the values of a sample are. Given a sample 𝑋1 , … , 𝑋𝑛 such that 𝑋𝑖 ∼
𝑋 for all 𝑖, the sample variance is
𝑛
1
𝑆2 = ∑(𝑋 − 𝑋)̄ 2 .
𝑛 − 1 𝑖=1 𝑖

Note that the denominator is 𝑛 − 1 instead of 𝑛 to correct the bias of the


sample variance.

Sample standard deviation The sample standard deviation is the


square root of the sample variance, i.e., 𝑆 = √𝑆 2 .

Sample skewness The skewness is a measure of the asymmetry of


a probability distribution. The sample skewness is based on the third
moment of the sample. Given a sample 𝑋1 , … , 𝑋𝑛 such that 𝑋𝑖 ∼ 𝑋 for
all 𝑖, the sample skewness is
1 𝑛
∑𝑖=1 (𝑋𝑖 − 𝑋)̄ 3
𝑛
Skewness = .
𝑆3
Skewness is zero for a symmetric distribution. Otherwise, it is positive
for a right-skewed distribution, and negative for a left-skewed distribu-
tion.

Sample kurtosis The kurtosis is a measure of the tailedness of a prob-


ability distribution. The sample kurtosis is based on the fourth moment
of the sample. Given a sample 𝑋1 , … , 𝑋𝑛 such that 𝑋𝑖 ∼ 𝑋 for all 𝑖, the
sample kurtosis is
1 𝑛
∑𝑖=1 (𝑋𝑖 − 𝑋)̄ 4
𝑛
Kurtosis = − 3.
𝑆4
Kurtosis is positive if the tails are heavier than a normal distribution,
and negative if the tails are lighter.

A.4.4 Common probability distributions


Several phenomena in nature and society can be modeled as random
variables. Some distributions are frequently used to model these phe-
nomena. The main ones are listed below.
206 APPENDIX A. MATHEMATICAL FOUNDATIONS

Bernoulli distribution The Bernoulli distribution is a discrete distri-


bution with two possible outcomes, usually called success and failure. It
is parametrized by a single parameter 𝑝 ∈ [0, 1], which is the probability
of success. It is denoted by Bern(𝑝).
The expected value of 𝑋 ∼ Bern(𝑝) is E[𝑋] = 𝑝, and the variance is
Var(𝑋) = 𝑝(1 − 𝑝).

Poisson distribution The Poisson distribution is a discrete distribu-


tion that models the number of events occurring in a fixed interval of
time or space. It is parametrized by a single parameter 𝜆 > 0, which is
the average number of events in the interval. It is denoted by Poisson(𝜆).
The probability mass function of 𝑋 ∼ Poisson(𝜆) is

𝑒−𝜆 𝜆𝑥
𝑝𝑋 (𝑥) = . (A.8)
𝑥!
The expected value of 𝑋 ∼ Poisson(𝜆) is E[𝑋] = 𝜆, and the variance
is Var(𝑋) = 𝜆.

Normal distribution The normal distribution is a continuous distri-


bution with a bell-shaped density. It is parametrized by two parameters,
the mean 𝜇 ∈ ℝ and the standard deviation 𝜎 > 0. It is denoted by
𝒩(𝜇, 𝜎2 ).
The special case where 𝜇 = 0 and 𝜎 = 1 is called the standard normal
distribution. It is denoted by 𝒩(0, 1).
The probability density function of 𝑋 ∼ 𝒩(𝜇, 𝜎2 ) is

1 (𝑥 − 𝜇)2
𝑓𝑋 (𝑥) = exp (− ). (A.9)
√2𝜋𝜎2 2𝜎2

The expected value of 𝑋 ∼ 𝒩(𝜇, 𝜎2 ) is E[𝑋] = 𝜇, and the variance is


Var(𝑋) = 𝜎2 .

Central limit theorem The central limit theorem states that the nor-
malized version of the sample mean converges to a standard normal
distribution11. Given 𝑋1 , … , 𝑋𝑛 i.i.d. random variables with mean 𝜇 and
finite variance 𝜎2 < ∞,

√𝑛(𝑋̄ − 𝜇) ∼ 𝒩(0, 𝜎2 ),
11This statement of the central limit theorem is known as the Lindeberg-Levy CLT.
There are other versions of the central limit theorem, some more general and some more
restrictive.
A.4. PROBABILITY 207

as 𝑛 → ∞. In other words, for a large enough 𝑛, the distribution of


the sample mean gets closer12 to a normal distribution with mean 𝜇 and
variance 𝜎2 /𝑛.
The central limit theorem is one of the most important results in
probability theory and statistics. Its implications are fundamental in
many areas of science and engineering.

T distribution The T distribution is a continuous distribution with


a bell-shaped density. It is parametrized by a single parameter 𝜈 > 0,
called the degrees of freedom. It is denoted by 𝒯(𝜈).
The T distribution generalizes to the three-parameter location-scale
t distribution 𝒯(𝜇, 𝜎2 , 𝜈), where 𝜇 is the location parameter and 𝜎 is
the scale parameter. Thus, given 𝑋 ∼ 𝒯(𝜈), we have that 𝜇 + 𝜎𝑋 ∼
𝒯(𝜇, 𝜎2 , 𝜈).
Note that
lim 𝒯(𝜈) = 𝒩(0, 1).
𝜈→∞

Thus, the T distribution converges to the standard normal distribution


as the degrees of freedom go to infinity.

Gamma distribution The Gamma distribution is a continuous dis-


tribution with a right-skewed density. It is parametrized by two param-
eters, the shape parameter 𝛼 > 0 and the rate parameter 𝛽 > 0. It is
denoted by Gamma(𝛼, 𝛽).
The probability density function of 𝑋 ∼ Gamma(𝛼, 𝛽) is

𝛽𝛼 𝑥𝛼−1 𝑒−𝛽𝑥
𝑓𝑋 (𝑥) = , (A.10)
Γ(𝛼)

where Γ(𝛼) is the gamma function, defined by



Γ(𝛼) = ∫ 𝑡𝛼−1 𝑒−𝑡 𝑑𝑡. (A.11)
0

In Bayesian analysis, the Gamma distribution is commonly used as


a conjugate prior. A conjugate prior is a prior distribution that, when
combined with the likelihood, results in a posterior distribution that is
of the same family as the prior.
12Formally, this is called convergence in distribution, refer to P. Billingsley (1995).
Probability and Measure. 3rd ed. John Wiley & Sons. isbn: 0-471-00710-2 for more de-
tails.
208 APPENDIX A. MATHEMATICAL FOUNDATIONS

A.4.5 Permutations and combinations


For the sake of reference, we present some definitions and formulas
from combinatorics. Combinatorics is the branch of mathematics that
studies the counting of objects.

Factorial The factorial of a non-negative integer 𝑛 is the product of


all positive integers up to 𝑛. It is denoted by

𝑛! = 𝑛 ⋅ (𝑛 − 1) ⋅ … ⋅ 2 ⋅ 1.

By definition, 0! = 1.

Permutation A permutation is an arrangement of a set of elements.


The number of permutations of 𝑛 elements is 𝑛!. Permutations are used
in combinatorics to count the number of ways to arrange a set of ele-
ments.

Combination A combination is a selection of a subset of elements


from a set. The number of combinations of 𝑘 elements from a set of 𝑛
elements is
𝑛 𝑛!
( )= .
𝑘 𝑘!(𝑛 − 𝑘)!
Combinations are used in combinatorics to count the number of ways
to select a subset of elements from a set. The binomial coefficient is also
called a choose function, and is read as “𝑛 choose 𝑘”.
Topics on learning machines
B
Oh, the depth of the riches and wisdom and knowledge of God!
How unsearchable are his judgments and how inscrutable his
ways!
— Romans 11:33 (ESV)

This appendix is under construction. Topics like the kernel trick, back-
propagation, and other machine learning algorithms will be discussed
here.

209
210 APPENDIX B. TOPICS ON LEARNING MACHINES

B.1 Multi-layer perceptron


The multilayer perceptron (MLP) is a non-linear classifier that gener-
ates a set of hyperplanes that separates the classes. In order to simplify
understanding, consider that the activation function of the hidden layer
is the discrete step function

1 if 𝑥 > 0
𝜎(𝑥) = {
0 otherwise.

A model with two neurons in the hidden layer (effectively the combina-
tion of three perceptrons) is

𝑓(𝑥1 , 𝑥2 ; 𝜃 = {w(1) , w(2) , w(3) }) =


𝜎 (w(3) ⋅ [1, 𝜎(w(1) ⋅ x), 𝜎(w(2) ⋅ x)]) .

The parameters w(1) and w(2) represent the hyperplanes that sepa-
rate the classes in the hidden layer, and w(3) represents how the hyper-
planes are combined to generate the output. If we set weights w(1) =
[−0.5, 1, −1] (like the perceptron in the previous example) and w(2) =
[−0.5, −1, 1], we use the third neuron to combine the results of the first
two neurons. This way, a possible solution for the XOR problem is set-
ting w(3) = [0, 1, 1].
Figure B.1 and table B.1 show the class boundaries and the predic-
tions of the MLP for the XOR problem.
Note that there are many possible solutions for the XOR problem us-
ing the MLP. Learning strategies like back-propagation are used to find
the optimal parameters for the model and regularization techniques,
like 𝑙1 and 𝑙2 regularization, are used to prevent overfitting.
Deep learning is the study of neural networks with many layers. The
idea is to use many layers to learn not only the boundaries that separate
the classes (or the function that maps inputs and outputs) but also the
features that are relevant to the problem. A complete discussion of deep
learning can be found in Goodfellow, Bengio, and Courville (2016)1.

1I. Goodfellow, Y. Bengio, and A. Courville (2016). Deep Learning. http : / / www .
deeplearningbook.org. MIT Press.
B.1. MULTI-LAYER PERCEPTRON 211

Figure B.1: MLP class boundaries for the XOR problem.

1
𝑥2

0 1
𝑥1

MLP with two neurons in the hidden layer generates two linear
hyperplanes that separate the classes, effectively solving the XOR
problem.

Table B.1: Truth table for the predictions of the MLP.

𝑥1 𝑥2 𝑦 1st neuron 2nd neuron 𝑦̂


0 0 0 0 0 0
0 1 1 0 1 1
1 0 1 1 0 1
1 1 0 0 0 0

Predictions of the MLP for the XOR problem. The output of the 1st
and 2nd neurons are hyperplanes that separate the classes in the
hidden layer, which are combined by the 3rd neuron to generate
the correct output.
212 APPENDIX B. TOPICS ON LEARNING MACHINES

B.2 Decision trees


The decision tree is a non-linear classifier that generates a set of hyper-
planes that are orthogonal to the axes. Consider the decision tree in
fig. B.2.

Figure B.2: Decision tree representation.

≤ 0.5 > 0.5


𝑥1

≤ 0.5 > 0.5


𝑦̂ = 0 𝑥2

𝑦̂ = 0 𝑦̂ = 1

The decision tree that solves the AND problem.

The spatial representation of the decision tree is shown in fig. B.3.


Decision trees are a type of classifier that generates a set of hyperplanes
orthogonal to the axes.
Decision trees are nonparametric models, one can easily increase
the depth of the tree to fit the data, generating as many hyperplanes as
necessary to separate the classes. Training a decision tree with a large
depth can lead to overfitting, so it is important to use techniques like
depth limit and pruning to prevent this from happening.
B.2. DECISION TREES 213

Figure B.3: Decision tree spatial representation.

1
𝑥2

0 1
𝑥1

Decision trees assume that the classes can be separated with hy-
perplanes orthogonal to the axes.
Bibliography

Aristotle (2019). Categorias (Κατηγορiαι). Greek and Portuguese. Trans.


by J. V. T. da Mata. São Paulo, Brasil: Editora Unesp. isbn: 978-85-
393-0785-2 (cit. on p. 26).
Baijens, J., R. Helms, and D. Iren (2020). “Applying Scrum in Data Sci-
ence Projects”. In: 2020 IEEE 22nd Conference on Business Informat-
ics (CBI). Vol. 1, pp. 30–38. doi: 10.1109/CBI49978.2020.00011 (cit.
on p. 41).
Beaumont, P. B. and R. G. Bednarik (2013). In: Rock Art Research 30.1,
pp. 33–54. doi: 10.3316/informit.488018706238392 (cit. on p. 7).
Benavoli, A., G. Corani, J. Demšar, and M. Zaffalon (2017). “Time for
a Change: a Tutorial for Comparing Multiple Classifiers Through
Bayesian Analysis”. In: Journal of Machine Learning Research 18.77,
pp. 1–36. url: http : / / jmlr . org / papers / v18 / 16 - 305 . html (cit. on
pp. 180–182).
Billingsley, P. (1995). Probability and Measure. 3rd ed. John Wiley &
Sons. isbn: 0-471-00710-2 (cit. on p. 207).
Breiman, L. (1996). “Bagging predictors”. In: Machine Learning 24.2,
pp. 123–140. doi: 10.1007/BF00058655 (cit. on p. 17).
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002).
“SMOTE: synthetic minority over-sampling technique”. In: Journal
of artificial intelligence research 16, pp. 321–357 (cit. on p. 154).

215
216 BIBLIOGRAPHY

Cleveland, W. S. (2001). “Data Science: An Action Plan for Expanding


the Technical Areas of the Field of Statistics”. In: ISI Review. Vol. 69,
pp. 21–26 (cit. on p. 4).
Cobb, C. G. (2015). The Project Manager’s Guide to Mastering Agile: Prin-
ciples and Practices for an Adaptive Approach. John Wiley & Sons
(cit. on p. 39).
Codd, E. F. (1970). “A Relational Model of Data for Large Shared Data
Banks”. In: Commun. ACM 13.6, pp. 377–387. issn: 0001-0782. doi:
10.1145/362384.362685 (cit. on p. 9).
Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2022). In-
troduction to Algorithms. 4th ed. The MIT Press, p. 1312. isbn: 978-
0262046305 (cit. on pp. 187, 190, 192).
Cortes, C. and V. N. Vapnik (1995). “Support-vector networks”. In: Ma-
chine Learning 20.3, pp. 273–297. doi: 10.1007/BF00994018 (cit. on
p. 17).
Cover, T. M. (1965). “Geometrical and Statistical Properties of Systems
of Linear Inequalities with Applications in Pattern Recognition”. In:
IEEE Transactions on Electronic Computers EC-14.3, pp. 326–334.
doi: 10.1109/PGEC.1965.264137 (cit. on p. 17).
Dean, J. and S. Ghemawat (Jan. 2008). “MapReduce: simplified data pro-
cessing on large clusters”. In: Commun. ACM 51.1, pp. 107–113. issn:
0001-0782. doi: 10.1145/1327452.1327492 (cit. on p. 11).
Denning, S. (2016). “Why Agile Works: Understanding the Importance
of Scrum in Modern Software Development”. In: Forbes. url: https:
/ / www . forbes . com / sites / stevedenning / 2016 / 08 / 10 / why - agile -
works/ (cit. on p. 38).
Ester, M., H.-P. Kriegel, J. Sander, X. Xu, et al. (1996). “A density-based
algorithm for discovering clusters in large spatial databases with
noise”. In: kdd. Vol. 96. 34, pp. 226–231 (cit. on p. 148).
Fagin, R. (1979). “Normal forms and relational database operators”. In:
Proceedings of the 1979 ACM SIGMOD International Conference on
Management of Data. SIGMOD ’79. Boston, Massachusetts: Associa-
tion for Computing Machinery, pp. 153–160. isbn: 089791001X. doi:
10.1145/582095.582120 (cit. on p. 61).
Friedman, J. H. (2001). “Greedy function approximation: A gradient
boosting machine.” In: The Annals of Statistics 29.5, pp. 1189–1232.
doi: 10.1214/aos/1013203451 (cit. on p. 17).
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. http:
//www.deeplearningbook.org. MIT Press (cit. on p. 210).
BIBLIOGRAPHY 217

Grajalez, C. G., E. Magnello, R. Woods, and J. Champkin (2013). “Great


moments in statistics”. In: Significance 10.6, pp. 21–28. doi: 10.1111/
j.1740-9713.2013.00706.x (cit. on p. 7).
Guttag, J. V. (2021). Introduction to Computation and Programming Us-
ing Python. With Application to Computational Modeling and Under-
standing Data. 3rd ed. The MIT Press, p. 664. isbn: 978-0262542364
(cit. on p. 187).
Hayashi, C. (1998). “What is Data Science? Fundamental Concepts and
a Heuristic Example”. In: Data Science, Classification, and Related
Methods. Ed. by C. Hayashi, K. Yajima, H.-H. Bock, N. Ohsumi, Y.
Tanaka, and Y. Baba. Tokyo, Japan: Springer Japan, pp. 40–51. isbn:
978-4-431-65950-1 (cit. on p. 21).
Hillis, W. D. (1985). “The Connection Machine”. Hillis, W.D.: The Con-
nection Machine. PhD thesis, MIT (1985). Cambridge, MA, USA:
Massachusetts Institute of Technology. url: http://hdl.handle.net/
1721.1/14719 (cit. on p. 11).
Ho, T. K. (1995). “Random decision forests”. In: Proceedings of 3rd Inter-
national Conference on Document Analysis and Recognition. Vol. 1,
278–282 vol.1. doi: 10.1109/ICDAR.1995.598994 (cit. on p. 17).
Hunt, E. B., J. Marin, and P. J. Stone (1966). Experiments in Induction.
New York, NY, USA: Academic Press (cit. on p. 15).
Ifrah, G. (1998). The Universal History of Numbers, from Prehistory to the
Invention of the Computer. First published in French, 1994. London:
Harvill. isbn: 1 86046 324 x (cit. on p. 7).
Jurafsky, D. and J. H. Martin (2008). Speech and Language Processing.
An Introduction to Natural Language Processing, Computational Lin-
guistics, and Speech Recognition. 2nd ed. Hoboken, NJ, USA: Prentice
Hall (cit. on p. 160).
— (2024). Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recog-
nition with Language Models. 3rd ed. Online manuscript released
August 20, 2024. url: https : / / web . stanford . edu / ~jurafsky / slp3/
(cit. on p. 160).
Kelleher, J. D. and B. Tierney (2018). Data science. The MIT Press (cit. on
pp. 6, 22).
Kraut, N. and F. Transchel (2022). “On the Application of SCRUM in
Data Science Projects”. In: 2022 7th International Conference on Big
Data Analytics (ICBDA), pp. 1–9. doi: 10.1109/ICBDA55095.2022.
9760341 (cit. on p. 41).
Le Cun, Y. (1986). “Learning Process in an Asymmetric Threshold Net-
work”. In: Disordered Systems and Biological Organization. Berlin,
218 BIBLIOGRAPHY

Heidelberg: Springer Berlin Heidelberg, pp. 233–240. isbn: 978-3-


642-82657-3 (cit. on p. 16).
Naur, P. (1974). Concise Survey of Computer Methods. Lund, Sweden:
Studentlitteratur. isbn: 91-44-07881-1. url: http://www.naur.com/
Conc.Surv.html (cit. on p. 3).
Quinlan, J. R. (1986). “Induction of Decision Trees”. In: Machine Learn-
ing 1, pp. 81–106. url: https://api.semanticscholar.org/CorpusID:
13252401 (cit. on p. 16).
Rosen, K. H. (2018). Discrete Mathematics and Its Applications. 8th ed.
McGraw Hill, p. 1120. isbn: 9781259676512 (cit. on p. 197).
Ross, S. M. (2018). A First Course in Probability. 10th ed. Pearson, p. 528.
isbn: 978-1292269207 (cit. on p. 200).
— (2023). Introduction to Probability Models. 13th ed. Academic Press,
p. 870. isbn: 978-0443187612 (cit. on p. 200).
Rubin, S. (2012). “Scrum for Teams: Maximizing Efficiency in Short It-
erations”. In: Agile Processes Journal 8, pp. 45–52 (cit. on p. 39).
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). “Learning
representations by back-propagating errors”. In: Nature 323.6088,
pp. 533–536. doi: 10.1038/323533a0 (cit. on p. 16).
Saltz, J. and A. Suthrland (2019). “SKI: An Agile Framework for Data
Science”. In: 2019 IEEE International Conference on Big Data (Big
Data), pp. 3468–3476. doi: 10 . 1109 / BigData47090 . 2019 . 9005591
(cit. on p. 41).
Schapire, R. E. (1990). “The strength of weak learnability”. In: Machine
Learning 5.2, pp. 197–227. doi: 10.1007/BF00116037 (cit. on p. 17).
Schölkopf, B., J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. William-
son (2001). “Estimating the support of a high-dimensional distribu-
tion”. In: Neural computation 13.7, pp. 1443–1471 (cit. on p. 149).
Schwaber, K. and J. Sutherland (2020). Scrum Guide: The Definitive
Guide to Scrum: The Rules of the Game. Scrum.org. url: https :
//scrumguides.org/docs/scrumguide/v2020/2020- Scrum- Guide-
US.pdf (cit. on p. 38).
Smith, J. (2019). “Understanding Scrum Roles: Product Owner, Scrum
Master, and Development Team”. In: Open Agile Journal 12, pp. 22–
28 (cit. on p. 39).
Song, J., H. V. Jagadish, and G. Alter (2021). “SDTA: An Algebra for
Statistical Data Transformation”. In: Proc. of 33rd International Con-
ference on Scientific and Statistical Database Management (SSDBM
2021). Tampa, FL, USA: Association for Computing Machinery,
p. 12. doi: 10.1145/3468791.3468811 (cit. on p. 109).
BIBLIOGRAPHY 219

Strang, G. (2023). Introduction to Linear Algebra. 6th ed. Wellesley-


Cambridge Press, p. 440. isbn: 978-1733146678 (cit. on p. 198).
Stulp, F. and O. Sigaud (2015). “Many regression algorithms, one unified
model: A review”. In: Neural Networks 69, pp. 60–79. issn: 0893-6080.
doi: https://doi.org/10.1016/j.neunet.2015.05.005 (cit. on p. 153).
Szeliski, R. (2022). Computer vision. Algorithms and applications. 2nd ed.
Springer Nature. url: https://szeliski.org/Book/ (cit. on p. 160).
Takens, F. (2006). “Detecting strange attractors in turbulence”. In: Dy-
namical Systems and Turbulence, Warwick 1980: proceedings of a
symposium held at the University of Warwick 1979/80. Springer,
pp. 366–381 (cit. on p. 79).
Tofallis, C. (2015). “A better measure of relative prediction accuracy
for model selection and model estimation”. In: Journal of the Opera-
tional Research Society 66.8, pp. 1352–1362. doi: 10.1057/jors.2014.
103 (cit. on p. 172).
Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tib-
shirani, D. Botstein, and R. B. Altman (June 2001). “Missing value
estimation methods for DNA microarrays”. In: Bioinformatics 17.6,
pp. 520–525. issn: 1367-4803. doi: 10.1093/bioinformatics/17.6.520
(cit. on p. 151).
Vapnik, V. and A. Chervonenkis (1968). “On the uniform convergence
of relative frequencies of events to their probabilities”. In: Doklady
Akademii Nauk USSR. Vol. 181. 4, pp. 781–787 (cit. on p. 124).
Vapnik, V. N. and A. Chervonenkis (1991). “The necessary and suffi-
cient conditions for consistency of the method of empirical risk min-
imization”. In: Pattern Recognition and Image Analysis 1.3, pp. 284–
305 (cit. on p. 122).
Vapnik, V. N. (1999). The nature of statistical learning theory. 2nd ed.
Springer-Verlag New York, Inc. isbn: 978-1-4419-3160-3 (cit. on
pp. 6, 115, 122, 126, 128, 137, 139).
Vapnik, V. N. and R. Izmailov (2015). “Learning with Intelligent Teacher:
Similarity Control and Knowledge Transfer”. In: Statistical Learn-
ing and Data Sciences. Ed. by A. Gammerman, V. Vovk, and H.
Papadopoulos. Cham: Springer International Publishing, pp. 3–32.
isbn: 978-3-319-17091-6 (cit. on p. 18).
Velleman, P. F. and L. Wilkinson (1993). “Nominal, Ordinal, Interval,
and Ratio Typologies are Misleading”. In: The American Statistician
47.1, pp. 65–72. doi: 10.1080/00031305.1993.10475938 (cit. on p. 58).
Verri, F. A. N. (2024). Data Science Project: An Inductive Learning Ap-
proach. Version v0.1.0. doi: 10.5281/zenodo.14498011. url: https:
//leanpub.com/dsp (cit. on p. ii).
220 BIBLIOGRAPHY

Vincent, M. W. (1997). “A corrected 5NF definition for relational


database design”. In: Theoretical Computer Science 185.2. Theoreti-
cal Computer Science in Australia and New Zealand, pp. 379–391.
issn: 0304-3975. doi: 10.1016/S0304-3975(97)00050-9 (cit. on p. 63).
Wickham, H. (2014). “Tidy Data”. In: Journal of Statistical Software
59.10, pp. 1–23. doi: 10.18637/jss.v059.i10 (cit. on pp. 64, 66, 73).
Wickham, H., M. Çetinkaya-Rundel, and G. Grolemund (2023). R for
Data Science: Import, Tidy, Transform, Visualize, and Model Data.
2nd ed. O’Reilly Media (cit. on pp. 21, 36, 66, 73, 90).
Williams, C. K. I. (Apr. 2021). “The Effect of Class Imbalance on
Precision-Recall Curves”. In: Neural Computation 33.4, pp. 853–
857. issn: 0899-7667. doi: 10.1162/neco_a_01362 (cit. on p. 169).
Zumel, N. and J. Mount (2019). Practical Data Science with R. 2nd ed.
Shelter Island, NY, USA: Manning (cit. on pp. 21, 35–37, 42, 44, 50,
116).
Glossary

AI artificial intelligence 113

BI business intelligence 10

BNF Backus–Naur form 3

CDF cumulative distribution function 202

CI/CD continuous integration/continuous deployment 53

CNN convolutional neural network 160

data leakage A situation where information from the test set is used
to transform the training set in any way or to train the model. 50,
51, 81, 87, 144, 164, 178

ERM empirical risk minimization 16, 17, 122, 126

ETL extract, transform, load 9

FIFO first-in-first-out 194

HDFS Hadoop distributed file system 11

221
222 GLOSSARY

IBM International Business Machines Corporation 8


IoT internet of things 12
IQR interquartile range 148

LIFO last-in-first-out 192


LUSI learning using statistical inference 18

ML machine learning 113, 114, 139


MLP multilayer perceptron 210, 211
model A general function that can be used to estimate the relationship
between the input and output variables in a dataset. 42

ontology Ontology is the study of being, existence, and reality. In com-


puter science and information science, an ontology is a formal
naming and definition of the types, properties, and interrelation-
ships of the entities that really or fundamentally exist for a partic-
ular domain. 26

PCA principal component analysis 159


PDF probability density function 202
PMF probability mass function 202

RDBMS relational database management system 9, 29

SLT statistical learning theory 111, 114–116, 122


SQL structured query language 9
SRM structural risk minimization 127–129, 131, 132, 139
SVM support vector machine 17

You might also like