Data Science Project - An Inductive Learning Approach, Verri
Data Science Project - An Inductive Learning Approach, Verri
F.A.N. VERRI
DATA SCIENCE PROJECT
AN INDUCTIVE LEARNING APPROACH
FILIPE A. N. VERRI
@misc{verri2024datascienceproject,
author = {Verri, Filipe Alves Neto},
title = {Data Science Project: An Inductive Learning Approach},
year = 2024,
publisher = {Leanpub},
version = {v0.1.0},
doi = {10.5281/zenodo.14498011},
url = {https://leanpub.com/dsp}
}
F. A. N. Verri (2024). Data Science Project: An Inductive Learning Approach. Version v0.1.0.
doi: 10.5281/zenodo.14498011. url: https://leanpub.com/dsp.
Disclaimer: This version is a work in progress. Proofreading and professional editing are
still pending.
The book is typeset with XETEX using the Memoir class. All figures are original and cre-
ated with TikZ. Proudly written in Neovim with the assistance of GitHub Copilot. The
book cover image was created with the assistance of Gemini and DALL·E 2. We use the
beautiful STIX fonts for text and math. Some icons are from Font Awesome 5 by Dave
Gandy.
Scripture quotations are from The ESV® Bible (The Holy Bible, English Standard Version®),
copyright © 2001 by Crossway, a publishing ministry of Good News Publishers. Used by
permission. All rights reserved.
Contents v
Foreword ix
Preface xi
2 Fundamental concepts 19
2.1 Data science definition . . . . . . . . . . . . . . . . . . 21
2.2 The data science continuum . . . . . . . . . . . . . . . 24
2.3 Fundamental data theory . . . . . . . . . . . . . . . . . 26
2.3.1 Phenomena . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Measurements . . . . . . . . . . . . . . . . . . . 28
2.3.3 Knowledge extraction . . . . . . . . . . . . . . . 29
v
vi CONTENTS
4 Structured data 55
4.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Database normalization . . . . . . . . . . . . . . . . . . 58
4.2.1 Relational algebra . . . . . . . . . . . . . . . . . 59
4.2.2 Normal forms . . . . . . . . . . . . . . . . . . . 60
4.3 Tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Common messy datasets . . . . . . . . . . . . . 66
4.4 Bridging normalization, tidiness, and data theory . . . . 73
4.4.1 Tidy or not tidy? . . . . . . . . . . . . . . . . . . 73
4.4.2 Change of observational unit . . . . . . . . . . . 76
4.5 Data semantics and interpretation . . . . . . . . . . . . 78
4.6 Unstructured data . . . . . . . . . . . . . . . . . . . . . 79
5 Data handling 81
5.1 Formal structured data . . . . . . . . . . . . . . . . . . 83
5.1.1 Splitting and binding . . . . . . . . . . . . . . . 84
5.1.2 Split invariance . . . . . . . . . . . . . . . . . . 86
5.1.3 Illustrative example . . . . . . . . . . . . . . . . 87
5.2 Data handling pipelines . . . . . . . . . . . . . . . . . . 90
5.3 Split-invariant operations . . . . . . . . . . . . . . . . . 91
5.3.1 Tagged splitting and binding . . . . . . . . . . . 92
5.3.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Joining . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.4 Selecting . . . . . . . . . . . . . . . . . . . . . . 98
5.3.5 Filtering . . . . . . . . . . . . . . . . . . . . . . 99
5.3.6 Mutating . . . . . . . . . . . . . . . . . . . . . . 101
5.3.7 Aggregating . . . . . . . . . . . . . . . . . . . . 102
5.3.8 Ungrouping . . . . . . . . . . . . . . . . . . . . 102
5.4 Other operations . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Projecting or grouping . . . . . . . . . . . . . . 105
5.4.2 Grouped and arranged operations . . . . . . . . 107
5.5 An algebra for data handling . . . . . . . . . . . . . . . 109
Bibliography 215
Glossary 221
Foreword
Data is now a ubiquitous presence and is collected every time and ev-
erywhere. However, the real challenge lies in harnessing this data to
generate actionable insights that guide decision-making and drive in-
novation. This is the essence of data science, a multidisciplinary field
that leverages mathematical, statistical, and computational techniques
to analyse data and solve complex problems.
The book “Data Science Project: An Inductive Learning Approach”
by F.A.N. Verri provides readers with a structured and insightful explo-
ration of the entire data science pipeline, from the initial stages of data
collection to the final step of model deployment. The book effectively
balances theory and practice, focusing on the inductive principles un-
derpinning predictive analytics and machine learning.
While other texts focus solely on machine learning algorithms or
delve deeply into tool-specific details, this book provides a holistic view
of every stage of a data science project. It emphasises the importance of
robust data handling, sound statistical learning principles, and meticu-
lous model evaluation. The author thoughtfully integrates the mathe-
matical foundations and practical considerations needed to design and
execute successful data science projects.
Beyond the technical mechanics, this book challenges readers to crit-
ically evaluate their models’ strengths and limitations. It underscores
the importance of semantics in data handling, equipping readers with
the skills to interpret and transform data meaningfully.
Whether you are a student embarking on your first data science
project or a data scientist professional seeking to expand and refine your
skills, this book’s clarity, rigour, and practical focus make it a guide that
will serve you well in tackling the complex challenges of data-driven
ix
x FOREWORD
Dear reader,
This book is based on the lecture notes from my course PO-235 Data
Science Project, which I teach to graduate students at both the Aero-
nautics Institute of Technology (ITA) and the Federal University of São
Paulo (UNIFESP) in Brazil. I have been teaching this subject since 2021,
and I have continually updated the material each year.
Also, I was the coordinator of the Data Science Specialization Pro-
gram (CEDS) at ITA. That experience, which included a great deal of
administrative work, as well as teaching and supervising professionals
in the course, has helped me to understand the needs of the market and
the students.
Moreover, parts of the project development methodology presented
here came from my experience as a lead data scientist in R&D projects
for the Brazilian Air Force, which included projects in areas such as im-
age processing, natural language processing, and spatio-temporal data
analysis.
Literature provides us with a wide range of excellent theoretical ma-
terial on machine learning and statistics, and highly regarded practical
books on data science tools. However, I missed something that could
provide a solid foundation on data science, covering all steps in a data
science project, including its software engineering aspects.
My goal is to provide a book that serves as a textbook for a course on
data science projects or as a reference for professionals working in the
field. I strive to maintain a formal tone while preserving the practical
aspects of the subject. I do not focus on a specific tool or programming
xi
xii PREFACE
language, but rather seek to explain the semantics of data science tasks
that can be implemented in any programming language.
Also, instead of teaching specific machine learning algorithms, I try
to explain why machine learning works, thereby increasing awareness
of its pitfalls and limitations. For this purpose, I assume you have a
strong mathematical and statistical foundation.
One important artificial constraint I have imposed in the material
(for the sake of the course) is that I only consider predictive methods,
more specifically inductive ones. I do not address topics such as cluster-
ing, association rule mining, transductive learning, anomaly detection,
time series forecasting, reinforcement learning, etc.
I expect my approach on the subject to provide understanding of all
steps in a data science project, including a deeper focus on correct eval-
uation and validation of data science solutions.
Note that, in this book, I openly express my opinions and beliefs. On
several occasions it may sound controversial. I am not trying to be rude
or to demean any researcher or practitioner in the field; rather, I aim to
be honest and transparent.
Contributors
I would like to thank the following contributors for their help in improv-
ing this book:
• Johnny C. Marques
• Manoel V. Machado (aka ryukinix)
• Vitor V. Curtis
All contributors have freely waived their rights to the content they
contributed to this book.
A brief history of data science
1
“Begin at the beginning,” the King said gravely, “and go on till
you come to the end: then stop.”
— Lewis Carroll, Alice in Wonderland
There are many points of view regarding the origin of data science. For
the sake of contextualization, I separate the topic into two approaches:
the history of the term itself and a broad timeline of data-driven sciences,
highlighting the important figures in each age.
I believe that the history of the term is important for understand-
ing the context of the discipline. Over the years, the term has been em-
ployed to label quite different fields of study. Before presenting my view
on the term, I present the views of two influential figures in the history
of data science: Peter Naur and William Cleveland.
Moreover, studying the key facts and figures in the history of data-
driven sciences enables us to comprehend the evolution of the field and
hopefully guide us towards evolving it further. Besides, history also
teaches us ways to avoid repeating the same mistakes.
Most of the significant theories and methods in data science have
been developed simultaneously across different fields, such as statistics,
computer science, and engineering. The history of data-driven sciences
is long and rich. I present a timeline of the ages of data handling and
the most important milestones of data analysis.
I do not intend to provide a comprehensive history of data science.
I aim to provide enough context to support the development of the ma-
terial in the following chapters, sometimes avoiding directions that are
not relevant in the context of inductive learning.
1
2 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE
Chapter remarks
Contents
1.1 The term “data science” . . . . . . . . . . . . . . . . . . . 3
1.2 Timeline and historical markers . . . . . . . . . . . . . . 6
1.2.1 Timeline of data handling . . . . . . . . . . . . . 6
1.2.2 Timeline of data analysis . . . . . . . . . . . . . . 12
Context
• The term “data science” is recent and has been used to label rather
different fields.
• The history of data-driven sciences is long and rich.
• Many important theories and methods in data science have been
developed simultaneously in different fields.
• The history of data-driven sciences is important to understand the
evolution of the field.
Objectives
Takeaways
Peter Naur (1928 – 2016) The term “data science” itself was coined in
the 1960s by Peter Naur (/naʊə/). Naur was a Danish computer scientist
and mathematician who made significant contributions to the field of
computer science, including his work on the development of program-
ming languages1. His ideas and concepts laid the groundwork for the
way we think about programming and data processing today.
Naur disliked the term computer science and suggested it be called
datalogy or data science. In the 1960s, the subject was practised in Den-
mark under Peter Naur’s term datalogy, which means the science of data
and data processes.
He coined this term to emphasize the importance of data as a fun-
damental component of computer science and to encourage a broader
perspective on the field that included data-related aspects. At that time,
the field was primarily centered on programming techniques, but Naur’s
concept broadened the scope to recognize the intrinsic role of data in
computation.
In his book2, “Concise Survey of Computer Methods”, he parts from
the concept that data is “a representation of facts or ideas in a formalised
manner capable of being communicated or manipulated by some pro-
cess.”3 Note however that his view of the science only “deals with data
[…] while the relation of data to what they represent is delegated to other
fields and sciences.”
It is interesting to see the central role he gave to data in the field of
computer science. His view that the relation of data to what they rep-
resent is delegated to other fields and sciences is still debatable today.
Some scientists argue that data science should focus on the techniques
to deal with data, while others argue that data science should encom-
pass the whole business domain. A depiction of Naur’s view of data
science is shown in fig. 1.1.
1He is best remembered as a contributor, with John Backus, to the Backus–Naur form
(BNF) notation used in describing the syntax for most programming languages.
2P. Naur (1974). Concise Survey of Computer Methods. Lund, Sweden: Studentlitter-
atur. isbn: 91-44-07881-1. url: http://www.naur.com/Conc.Surv.html.
3I. H. Gould (ed.): ‘IFIP guide to concepts and terms in data processing’, North-
Holland Publ. Co., Amsterdam, 1971.
4 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE
Data science
For Naur, data science studies the techniques to deal with data,
but he delegates the meaning of data to other fields.
4W. S. Cleveland (2001). “Data Science: An Action Plan for Expanding the Technical
Areas of the Field of Statistics”. In: ISI Review. Vol. 69, pp. 21–26.
5W. S. Cleveland. Data Science: An Action Plan for the Field of Statistics. Statistical
Analysis and Data Mining, 7:414–417, 2014. reprinting of 2001 article in ISI Review, Vol
69.
1.1. THE TERM “DATA SCIENCE” 5
Data science
Domain expertise
Statistics
Computer science
methods and that the domain expertise should be considered in the anal-
ysis.
6Press, Gil. “Data Science: What’s The Half-Life of a Buzzword?”. Forbes. Available
at forbes.com/sites/gilpress/2013/08/19/data-science-whats-the-half-life-of-a-buzzword.
6 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE
Pre-digital age
We can consider the earliest records of data collection to be the notches
on sticks and bones (probably) used to keep track of the passing of time.
The Lebombo bone, a baboon fibula with notches, is one of the earliest
known mathematical objects. It was found in the Lebombo Mountains
located between South Africa and Eswatini. They estimate it is more
than 40,000 years old. It is conjectured to be a tally stick, but its exact
7J. D. Kelleher and B. Tierney (2018). Data science. The MIT Press.
8V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-Verlag
New York, Inc. isbn: 978-1-4419-3160-3.
1.2. TIMELINE AND HISTORICAL MARKERS 7
purpose is unknown. Its 29 notches suggest that it may have been used
as a lunar phase counter. However, since it is broken at one end, the 29
notches may or may not be the total number9.
Another milestone in the history of data collection is the record of
demographic data. One of the first known censuses was conducted in
3,800 BC in the Babylonian Empire. It was ordered to assess the popu-
lation and resources of the empire. Records were stored on clay tiles10.
Since the early forms of writing, humanity’s abilities to register data
and events increased significantly. The first known written records date
back to around 3,500 BC, the Sumerian archaic (pre-cuneiform) writ-
ing. This writing system was used to represent commodities using clay
tokens and to record transactions11.
“Data storage” was also a challenge. Some important devices that
increased our capacity to register textual information are the Sumerian
clay tablets (3,500 BC), the Egyptian papyrus (3,000 BC), the Roman wax
tablets (100 BC), the codex (100 AD), the Chinese paper (200 AD), the
printing press (1440), and the typewriter (1868).
Besides those improvements in unstructured data storage, at least
in the Western and Middle Eastern world, there are no significant ad-
vances in structured data collection until the 19th century. (An Eastern
timeline research seems much harder to perform. Unfortunately, I left
it out in this book.)
I consider a major influential figure in the history of data collection
to be Florence Nightingale (1820 – 1910). She was a passionate statis-
tician and probably the first person to use statistics to influence pub-
lic and official opinion. The meticulous records she kept during the
Crimean War (1853 – 1856) were the evidence that saved lives — part of
the mortality came from lack of sanitation. She was also the first to use
statistical graphics to present data in a way that was easy to understand.
She is credited with developing a form of the pie chart now known as the
polar area diagram. She also reformed healthcare in the United King-
dom and is considered the founder of modern nursing; where a great
part of the work was to collect data in a standardized way to quickly
draw conclusions12.
9P. B. Beaumont and R. G. Bednarik (2013). In: Rock Art Research 30.1, pp. 33–54.
doi: 10.3316/informit.488018706238392.
10C. G. Grajalez et al. (2013). “Great moments in statistics”. In: Significance 10.6,
pp. 21–28. doi: 10.1111/j.1740-9713.2013.00706.x.
11G. Ifrah (1998). The Universal History of Numbers, from Prehistory to the Invention of
the Computer. First published in French, 1994. London: Harvill. isbn: 1 86046 324 x.
12C. G. Grajalez et al. (2013). “Great moments in statistics”. In: Significance 10.6,
8 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE
Digital age
In the modern period, several devices were developed to store digital13
information. One particular device that is important for data collection
is the punched card. It is a piece of stiff paper that contains digital infor-
mation represented by the presence or absence of holes in predefined
positions. The information can be read by a mechanical or electrical de-
vice called a card reader. The earliest famous usage of punched cards
was by Basile Bouchon (1725) to control looms. Most of the advances un-
til the 1880s were about the automation of machines (data processing)
using hand-punched cards, and not particularly specialized for data col-
lection.
However, the 1890 census in the United States was the first to use
machine-readable punched cards to tabulate data. Processing 1880 cen-
sus data took eight years, so the Census Bureau contracted Herman
Hollerith (1860 – 1929) to design and build a tabulating machine. He
founded the Tabulating Machine Company in 1896, which later merged
with other companies to become International Business Machines Cor-
poration (IBM) in 1924. Later models of the tabulating machine were
widely used for business applications such as accounting and inventory
control. Punched card technology remained a prevalent method of data
processing for several decades until more advanced electronic comput-
ers were developed in the mid-20th century.
The invention of the digital computer is responsible for a revolution
in data handling. The amount of information we can capture and store
increased exponentially. ENIAC (1945) was the first electronic general-
purpose computer. It was Turing-complete, digital, and capable of being
reprogrammed to solve a full range of computing problems. It had 20
words of internal memory, each capable of storing a 10-digit decimal
integer number. Programs and data were entered by setting switches
and inserting punched cards.
For the 1950 census, the United States Census Bureau used the UNI-
VAC I (Universal Automatic Computer I), the first commercially pro-
duced computer in the United States14.
It goes without saying that digital computers have become much
more powerful and sophisticated since then. The data collection process
has been easily automated and scaled to a level that was unimaginable
before. However, “where” storing data is not the only challenge. “How”
to store data is also a challenge. The next period of history addresses
this issue.
Formal age
In 1970, Edgar Frank Codd (1923 – 2003), a British computer scientist,
published a paper entitled “A Relational Model of Data for Large Shared
Data Banks”15. In this paper, he introduced the concept of a relational
model for database management.
A relational model organizes data in tables (relations) where each
row represents a record and each column represents an attribute of the
record. The tables are related by common fields. Codd showed means to
organize the tables of a relational database to minimize data redundancy
and improve data integrity. Section 4.2 provides more details on the
topic.
His work was a breakthrough in the field of data management. The
standardization of relational databases led to the development of struc-
tured query language (SQL) in 1974. SQL is a domain-specific language
used in programming and designed for managing data held in a rela-
tional database management system (RDBMS).
As a result, a new challenge rapidly emerged: how to aggregate data
from different sources. Once data is stored in a relational database, it is
easy to query and manage it. However, data is usually stored in different
databases, and it is not always possible to directly combine them.
Integrated age
The solution to this problem was the development of the extract, trans-
form, load (ETL) process. ETL is a process in data warehousing responsi-
ble for extracting data from several sources, transforming it into a format
that can be analyzed, and loading it into a data warehouse.
The concept of data warehousing dates back to the late 1980s when
IBM researchers Barry Devlin and Paul Murphy developed the “busi-
ness data warehouse.”
Two major figures in the history of ETL are Ralph Kimball (born
1944) and Bill Inmon (born 1945), both American computer scientists.
15E. F. Codd (1970). “A Relational Model of Data for Large Shared Data Banks”. In:
Commun. ACM 13.6, pp. 377–387. issn: 0001-0782. doi: 10.1145/362384.362685.
10 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE
Although they differ in their approaches, they both agree that data ware-
housing is the foundation for business intelligence (BI) and analytics,
and that data warehouses should be designed to be easy to understand
and fast to query for business users.
A famous debate between Kimball and Inmon is the top-down ver-
sus bottom-up approach to data warehousing. Inmon’s approach is top-
down, where the data warehouse is designed first and then the data
marts16 are created from the data warehouse. Kimball’s approach is
bottom-up, where the data marts are created first and then the data ware-
house is created from the data marts.
One of the earliest and most famous case studies of the implemen-
tation of a data warehouse is that of Walmart. In the early 1990s, Wal-
mart faced the challenge of managing and analyzing vast amounts of
data from its stores across the United States. The company needed a so-
lution that would enable comprehensive reporting and analysis to sup-
port decision-making processes. The solution was to implement a data
warehouse that would integrate data from various sources and provide
a single source of truth for the organization.
Ubiquitous age
The last and current period of history is the ubiquitous age. It is charac-
terized by the proliferation of data sources.
The ubiquity of data generation and the evolution of data-centric
technologies have been made possible by a multitude of figures across
various domains.
• Vinton Gray Cerf (born 1943) and Robert Elliot Kahn (born 1938),
often referred to as the “Fathers of the Internet,” developed the
TCP/IP protocols, which are fundamental to internet communi-
cation.
• Tim Berners-Lee (born 1955), credited with inventing the World
Wide Web, laid the foundation for the massive data flow on the
internet.
• Steven Paul Jobs (1955 – 2011) and Stephen Wozniak (born 1950),
from Apple Inc., and William Henry Gates III (born 1955), from
Microsoft Corporation, were responsible for the introduction of
16A data mart is a specialized subset of a data warehouse that is designed to serve the
needs of a specific business unit, department, or functional area within an organization.
1.2. TIMELINE AND HISTORICAL MARKERS 11
sis. MapReduce is not particularly novel, but its simplicity and scalabil-
ity made it popular.
Nowadays, another important topic is internet of things (IoT). IoT
is a system of interrelated computing devices that communicate with
each other over the internet. The devices can be anything from cell-
phones, coffee makers, washing machines, headphones, lamps, wear-
able devices, and almost anything else you can think of. The reality of
IoT increased the challenges of data handling, especially in terms of data
storage and processing.
In summary, we currently live in a world where data is ubiquitous
and comes in many different forms. The challenge is to collect, store,
and process this data in a way that is meaningful and useful, also re-
specting privacy and security.
Summary statistics
The earliest known records of systematic data analysis date back to the
first censuses. The term statistics itself refers to the analysis of data about
the state, including demographics and economics. That early (and sim-
plest) form of statistical analysis is called summary statistics, which con-
sists of describing data in terms of its central tendencies (e.g., arithmetic
mean) and variability (e.g., range).
1.2. TIMELINE AND HISTORICAL MARKERS 13
Probability advent
However, after the 17th century, the foundations of modern probability
theory were laid out. Important figures for developing that theory are
Blaise Pascal (1623 – 1662), Pierre de Fermat (1601 – 1665), Christiaan
Huygens (1629 – 1695), and Jacob Bernoulli (1655 – 1705).
The foundation methods brought to life the field of statistical infer-
ence. In the following years, important results were achieved.
and bar chart of economic data, and the pie chart and circle graph to
show proportions.
20https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15227/1/138.pdf
21After Rosenblatt’s work, however, it was used to solve inductive inference (classifi-
cation) as well. For curiosity, Fisher’s paper introduced the famous Iris data set.
1.2. TIMELINE AND HISTORICAL MARKERS 15
the field of statistics, such as the use of likelihood functions and the
Bayesian approach.
Hunt inducing trees In 1966, Hunt, Marin, and Stone’s book23 de-
scribed a way to induce decision trees from data. The algorithm is based
22Although Rosenblatt was aware of the limitations of the perceptron and was probably
working on solutions, he died in 1971.
23E. B. Hunt, J. Marin, and P. J. Stone (1966). Experiments in Induction. New York, NY,
USA: Academic Press.
16 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE
26R. E. Schapire (1990). “The strength of weak learnability”. In: Machine Learning 5.2,
pp. 197–227. doi: 10.1007/BF00116037.
27L. Breiman (1996). “Bagging predictors”. In: Machine Learning 24.2, pp. 123–140.
doi: 10.1007/BF00058655.
28T. K. Ho (1995). “Random decision forests”. In: Proceedings of 3rd International
Conference on Document Analysis and Recognition. Vol. 1, 278–282 vol.1. doi: 10.1109/
ICDAR.1995.598994.
29J. H. Friedman (2001). “Greedy function approximation: A gradient boosting ma-
chine.” In: The Annals of Statistics 29.5, pp. 1189–1232. doi: 10.1214/aos/1013203451.
30C. Cortes and V. N. Vapnik (1995). “Support-vector networks”. In: Machine Learning
20.3, pp. 273–297. doi: 10.1007/BF00994018.
31T. M. Cover (1965). “Geometrical and Statistical Properties of Systems of Linear In-
equalities with Applications in Pattern Recognition”. In: IEEE Transactions on Electronic
Computers EC-14.3, pp. 326–334. doi: 10.1109/PGEC.1965.264137.
32https://awards.acm.org/about/2018-turing
18 CHAPTER 1. A BRIEF HISTORY OF DATA SCIENCE
33V. N. Vapnik and R. Izmailov (2015). “Learning with Intelligent Teacher: Similarity
Control and Knowledge Transfer”. In: Statistical Learning and Data Sciences. Ed. by A.
Gammerman, V. Vovk, and H. Papadopoulos. Cham: Springer International Publishing,
pp. 3–32. isbn: 978-3-319-17091-6.
Fundamental concepts
2
The simple believes everything,
but the prudent gives thought to his steps.
— Proverbs 14:15 (ESV)
1As it would be as unproductive as creating a “new math” for each new application.
All “sciences” rely on each other in some way
19
20 CHAPTER 2. FUNDAMENTAL CONCEPTS
Chapter remarks
Contents
2.1 Data science definition . . . . . . . . . . . . . . . . . . . 21
2.2 The data science continuum . . . . . . . . . . . . . . . . 24
2.3 Fundamental data theory . . . . . . . . . . . . . . . . . . 26
2.3.1 Phenomena . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Measurements . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Knowledge extraction . . . . . . . . . . . . . . . . 29
Context
Objectives
Takeaways
2N. Zumel and J. Mount (2019). Practical Data Science with R. 2nd ed. Shelter Island,
NY, USA: Manning.
3H. Wickham, M. Çetinkaya-Rundel, and G. Grolemund (2023). R for Data Science:
Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media.
4C. Hayashi (1998). “What is Data Science? Fundamental Concepts and a Heuristic
Example”. In: Data Science, Classification, and Related Methods. Ed. by C. Hayashi et al.
Tokyo, Japan: Springer Japan, pp. 40–51. isbn: 978-4-431-65950-1.
22 CHAPTER 2. FUNDAMENTAL CONCEPTS
Note that these definitions do not contradict each other. But, they
do not attempt to emphasize the “science” aspect of it. From these ideas,
let us define the term.
Data science
Philosophy /
Statistics
domain expertise
Computer science
new science that incorporates concepts from other sciences. In the next
section, I argue the reasons to understand data science as a new science.
Established sciences
Computer Philosophy
Statistics
science and others
Emergence of principles
Unique methods
Data science
larly, the accepted methodologies in data science differ and keep evolv-
ing from those in other sciences. New questions arise, such as:
2.3.1 Phenomena
Phenomenon is a term used to describe any observable event or process.
They are the source we use to understand the world around us. In gen-
eral, we use our senses to perceive phenomena. To make sense of them,
we use our knowledge and reasoning.
Philosophy is the study of knowledge and reasoning. It is a very
broad field of study that has been divided into many subfields. One pos-
sible starting point is ontology, which is the study of being, existence,
and reality. Ontology studies what exists and how we can classify it. In
particular, ontology describes the nature of categories, properties, and
relations.
Aristotle (384 – 322 BC) is one of the first philosophers to study on-
tology. In Κατηγορίαι9, he proposed a classification of the world into
ten categories. Substance, or οὐ σία, is the most important one. It is the
category of being. The other categories are properties, quantity, quality,
relation, place, time, position, state, and action.
Although rudimentary10, Aristotle’s categories served as a basis for
the development of logical reasoning and scientific classification, espe-
9For Portuguese readers, I suggest Aristotle (2019). Categorias (Κατηγορiαι). Greek
and Portuguese. Trans. by J. V. T. da Mata. São Paulo, Brasil: Editora Unesp. isbn: 978-
85-393-0785-2.
10Most historians agree that Categories was written before Aristotle’s other works.
Many concepts are further developed in his later works.
2.3. FUNDAMENTAL DATA THEORY 27
cially in the Western world. The categories are still used in many appli-
cations, including computer systems and data systems.
Aristotle marked a rupture with many previous philosophers. While
Heraclitus (6th century – 5th century BC) defended that everything is in
a constant state of flux and Plato (c. 427 – 348 BC) defended that only the
perfect can be known, Aristotle focused on the world we can perceive
and understand. His practical view also opposed Antisthenes (c. 446 –
366 BC) view that the predicate determines the object, which leads to
the impossibility of negation and consequently contradiction.
What is the importance of ontology for data science? Describing,
which is basically reducing the complexity of the world to simple, small
pieces, is the first step to understand any phenomenon. Drawing a sim-
plistic parallel, phenomena are like the substance category, and the data
we collect are like the other categories, which describe the properties, re-
lations, and states of the substance. A person who can easily organize
their thoughts to identify the entities and their properties in a problem
is more likely to collect relevant data. Also, the understanding of logical
and grammatical limitations — such as univocal and equivocal terms —
is important to avoid errors in data science applications11.
Another important field in Philosophy is epistemology, which is the
study of knowledge. Epistemology elaborates on how we can acquire
knowledge and how we can distinguish between knowledge and opin-
ion. In particular, epistemology studies the nature of knowledge, justi-
fication, and the rationality of belief.
Finally, logic is the study of reasoning. It studies the nature of rea-
soning and argumentation. In particular, logic studies the nature of in-
ference, validity, and fallacies.
I further discuss knowledge and reasoning in section 2.3.3.
In the context of a data science project, we usually focus on phe-
nomena from a particular domain of expertise. For example, we may be
interested in phenomena related to the stock market, or related to the
weather, or related to human health. Thus, we need to understand the
nature of the phenomena we are studying.
Fully understanding the phenomena we are tackling requires both
general knowledge of epistemology, ontology, and logic, and particular
knowledge of the domain of expertise.
11It is very common to see data scientists reducing the meaning of the columns in a
dataset to a single word. Or even worse, they assume that different columns with the same
name have the same meaning. This is a common source of errors in data science projects.
28 CHAPTER 2. FUNDAMENTAL CONCEPTS
2.3.2 Measurements
Similarly to Aristotle’s work, data scientists focus on the world we can
perceive with our senses (or using external sensors). In a more restric-
tive way, we focus on the world we can measure12. Measurable phe-
nomena are those that we can quantify in some way. For example, the
temperature of a room is a measurable phenomenon because we can
measure it using a thermometer. The number of people in a room is
also a measurable phenomenon because we can count them.
When we quantify a phenomenon, we perform data collection. Data
collection is the process of gathering data on a targeted phenomenon in
an established systematic way. Systematic means that we have a plan to
collect the data and we understand the consequences of the plan, includ-
ing the sampling bias. Sampling bias is the influence that the method
of collecting the data has on the conclusions we can draw from them.
Once we have collected the data, we need to store them. Data storage is
the process of storing data in a computer.
To perform those tasks, we need to understand the nature of data.
Data are any piece of information that can be digitally stored. Data can
be stored in many different formats. For example, we can store data in
a spreadsheet, in a database, or in a text file. We can also store data
in many different types. For example, we can store data as numbers,
strings, or dates.
In data science, studying data types is important because they need
to correctly reflect the nature of the source phenomenon and be com-
12Some phenomena might be knowable but not measurable. For example, the exis-
tence of God is a knowable phenomenon, but it is not measurable.
2.3. FUNDAMENTAL DATA THEORY 29
patible with the computational methods we are using. Data types also
restrict the operations we can perform on the data.
The foundation and tools to understand data types come from com-
puter science. Among the subfields, I highlight:
• Algorithms and data structures: the study of data types and the
computational methods to manipulate them.
• Databases: the study of storing and retrieving data.
13In mathematics, axioms are the premises and accepted facts. Corollaries, lemmas,
and theorems are the results of the reasoning process.
30 CHAPTER 2. FUNDAMENTAL CONCEPTS
14It is important to highlight that it is expected that some of the method’s assumptions
are not fully met. These methods are usually robust enough to extract valuable knowledge
even when data contain imperfections, errors, and noise. However, it is still useful to
perform data preprocessing to adjust data as much as possible.
Data science project
3
with contributions from Johnny C. Marques
Once we have established what data science is, we can now discuss how
to conduct a data science project. First of all, a data science project is
a software project. The difference between a data science software and
a traditional software is that some components of the former are con-
structed from data. This means that part of the solution is not designed
from the knowledge of the domain expert.
One example of a project is a spam filter that classifies emails into
two categories: spam and non-spam. A traditional approach is to design
a set of rules that are known to be effective. However, the effectiveness
of the filters is limited by the knowledge of the designer and is cumber-
some to maintain. A data science approach automatically learns the fil-
ters from a set of emails that are already classified as spam or non-spam.
Another important difference in data science projects is that tradi-
tional testing methods, such as unit tests, are not sufficient. The solu-
tion inferred from the data must be validated considering the stochastic
nature of the data.
In this chapter, we discuss common methodologies for data science
projects. We also present the concept of agile methodologies and the
Scrum framework. We finally propose an extension to Scrum adapted
for data science projects.
31
32 CHAPTER 3. DATA SCIENCE PROJECT
Chapter remarks
Contents
3.1 CRISP-DM . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 ZM approach . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Roles of the ZM approach . . . . . . . . . . . . . . 35
3.2.2 Processes of the ZM approach . . . . . . . . . . . 36
3.2.3 Limitations of the ZM approach . . . . . . . . . . 37
3.3 Agile methodology . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Scrum framework . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Scrum roles . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Sprints and backlog . . . . . . . . . . . . . . . . . 39
3.4.3 Scrum for data science projects . . . . . . . . . . . 41
3.5 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.1 The roles of our approach . . . . . . . . . . . . . . 43
3.5.2 The principles of our approach . . . . . . . . . . . 45
3.5.3 Proposed workflow . . . . . . . . . . . . . . . . . 49
Context
Objectives
Takeaways
3.1 CRISP-DM
CRISP-DM1 is a methodology for data mining projects. It is an acronym
for Cross Industry Standard Process for Data Mining. It is a methodol-
ogy that was developed in the 1990s by IBM, and it is still widely used
today.
CRISP-DM is a cyclic process. The process is composed of six phases:
Data
preparation
Deployment Data
Modeling
Evaluation
3.2 ZM approach
Zumel and Mount (2019)2 also propose a methodology for data science
projects — which we call here the ZM approach. Besides describing
each step in a data science project, they further address the roles of each
individual involved in the project. They state that data science projects
are always collaborative, as they require domain expertise, data exper-
tise, and software expertise.
Requirements of a data science project are dynamic, and we need to
perform many exploratory phases. Unlike traditional software projects,
we should expect significant changes in the initial requirements and
goals of the project.
Usually, projects based on data are urgent, and they must be com-
pleted in a short time — not only due to the business requirements, but
also because the data changes over time. The authors state that agile
methodologies are suitable (and necessary) for data science projects.
Project sponsor It is the main stakeholder of the project, the one that
needs the results of the project. He represents the business interests and
champions the project. The project is considered successful if the spon-
sor is satisfied. Note that, ideally, the sponsor cannot be the data scien-
tist, but someone who is not involved in the development of the project.
However, he needs to be able to express quantitatively the business goals
and participate actively in the project.
Client The client is the domain expert. He represents the end users’
interests. In a small project, he is usually the sponsor. He translates the
daily activities of the business into the technical requirements of the
software.
Data scientist The data scientist is the one that sets and executes the
analytic strategy. He is the one that communicates with the sponsor and
the client, effectively connecting all the roles. In small projects, he can
also act as the developer of the software. However, in large projects, he is
2N. Zumel and J. Mount (2019). Practical Data Science with R. 2nd ed. Shelter Island,
NY, USA: Manning.
36 CHAPTER 3. DATA SCIENCE PROJECT
Data architect The data architect is the one that manages data and
data storage. He usually is involved in more than one project, so he is
not an active participant. He is the one who receives instructions to
adapt the data storage and means to collect data.
Define
the goal
Evaluate
the model
Note that the manifesto does not discard the items on the right, but
rather values the items on the left more. For example, comprehensive
documentation is important, but working software is more important.
Product owner This role is responsible for defining the product vi-
sion and maximizing the value of the work done by the team. The
4Latest version of the Scrum Guide available at K. Schwaber and J. Sutherland (2020).
Scrum Guide: The Definitive Guide to Scrum: The Rules of the Game. Scrum.org. url:
https://scrumguides.org/docs/scrumguide/v2020/2020-Scrum-Guide-US.pdf.
5S. Denning (2016). “Why Agile Works: Understanding the Importance of Scrum
in Modern Software Development”. In: Forbes. url: https : / / www . forbes . com / sites /
stevedenning/2016/08/10/why-agile-works/.
3.4. SCRUM FRAMEWORK 39
Scrum master The Scrum master acts as a facilitator and coach for
the Scrum team, ensuring that the team adheres to the Scrum frame-
work and agile principles. Unlike a traditional project manager, the
Scrum master is not responsible for managing the team directly but for
enabling them to perform optimally by removing impediments and fos-
tering a self-organizing culture7.
ií
Scrum master daily scrum
i sprint review
ʝ í ií
product owner dev team sprint
incremental version
sprint planning
ø ɚ ɔʝ
product backlog sprint retrospective
sprint backlog
ʝií
and changes emerge. The items selected for the sprint become part of
the sprint backlog, a subset of the product backlog that the development
team commits to completing during the sprint.
At the heart of the Scrum process is the daily scrum (or stand-up
meeting), a brief meeting where the team discusses progress toward the
sprint goal, any obstacles they are facing, and their plans for the next
day. This daily inspection ensures that everyone stays aligned and can
quickly adapt to any changes or challenges.
The burn down/up chart is a visual tool used during the sprint to
track the Scrum team’s progress against the planned tasks. It displays
the amount of remaining work (in hours or story points) over time, al-
lowing the team and the product owner to monitor whether the work
is on pace to be completed by the end of the sprint. The chart’s line
decreases as tasks are finished, providing a clear indicator of potential
delays or blockers. If progress is falling behind, the team adjusts the
approach during the sprint by re-prioritizing tasks or removing impedi-
ments. Thus, this chart provides real-time visibility into the team’s effi-
ciency and contributes to more agile and proactive work management.
At the end of each sprint, the team holds a sprint review, during
3.4. SCRUM FRAMEWORK 41
which they demonstrate the work completed during the sprint. The
sprint review is an opportunity for stakeholders to see progress and pro-
vide feedback, which may lead to adjustments in the Product Backlog.
Following the review, the team conducts a sprint retrospective to dis-
cuss what went well, what did not, and how they can improve their pro-
cesses moving forward. These continuous improvement cycles are key
to Scrum’s success, allowing teams to adapt both their work and their
working methods iteratively.
The sprint retrospective is a crucial event in the Scrum framework,
held at the end of each sprint. Its primary purpose is to provide the
Scrum team with an opportunity to reflect on the sprint that just con-
cluded and identify areas for improvement. During the retrospective,
the team discusses what went well, what challenges they encountered,
and how they can enhance their processes for future sprints. This con-
tinuous improvement focus allows the team to adapt their workflow and
collaboration methods, fostering a more efficient and effective develop-
ment cycle. By encouraging open and honest feedback, the retrospective
plays a vital role in maintaining team cohesion and driving productivity
over time.
9Note that many other adaptations to Scrum have been described in literature. For ex-
ample, J. Saltz and A. Suthrland (2019). “SKI: An Agile Framework for Data Science”. In:
2019 IEEE International Conference on Big Data (Big Data), pp. 3468–3476. doi: 10.1109/
BigData47090.2019.9005591; J. Baijens, R. Helms, and D. Iren (2020). “Applying Scrum
in Data Science Projects”. In: 2020 IEEE 22nd Conference on Business Informatics (CBI).
vol. 1, pp. 30–38. doi: 10.1109/CBI49978.2020.00011; N. Kraut and F. Transchel (2022).
“On the Application of SCRUM in Data Science Projects”. In: 2022 7th International Con-
ference on Big Data Analytics (ICBDA), pp. 1–9. doi: 10.1109/ICBDA55095.2022.9760341.
42 CHAPTER 3. DATA SCIENCE PROJECT
not possess “hacking-level” skills, and they often do not know good prac-
tices of software development.
Scrum is a good starting point for a compromise between the need
for autonomy (required in dynamic and exploratory projects) and the
need for a detailed plan to guide the project (required to avoid bad prac-
tices and low-quality software). A good project methodology is needed
to ensure that the project is completed on time and within budget.
Moreover, we add two other values besides the Agile Manifesto val-
ues. They are:
The first value is based on the observation that the model perfor-
mance is not the most important aspect of the model. The most impor-
tant aspect is being sure that the model behaves as expected (and some-
times why it behaves as expected). It is not uncommon to find models
that seem to perform well during evaluation steps10, but that are not
suitable for production.
The second value is based on the observation that interactive en-
vironments are not suitable for the development of consistent and re-
producible software solutions. Interactive environments help in the ex-
ploratory phases, but the final version of the code must be version con-
trolled. Often, we hear stories that models cannot be reproduced be-
cause the code that generated them is not runnable anymore. This is
a serious problem, and it is not acceptable for maintaining a software
solution.
As in the Agile manifesto, the values on the right are not discarded,
but the values on the left are more important. We do not discard the
importance of model performance or the convenience of interactive en-
vironments, but they are not the most important aspects of the project.
These observations and values are the basis of our approach. The
roles and principles of our approach are described in the following sec-
tions.
The roles of Scrum are associated with the roles defined by Zumel
and Mount (2019). Note that the association is not exact. In our
approach, the data scientist leads the development team and in-
teracts with the business spokesman. The development team in-
cludes people with both data science and software engineering
expertise.
the business spokesman, the lead data scientist, the Scrum master, and
the data science team. An association between the roles of our proposal,
Scrum, and the ZM approach is shown in table 3.1.
Lead data scientist The lead data scientist, like the product owner,
is the one who represents the interests of the stakeholder. She must
be able to understand the application domain and the business goals.
We decide to call her “lead data scientist” to make it clear that she also
has data science expertise. The reason is that mathematical and statis-
tical expertise is essential to understand the data and the models. Cor-
rectly interpreting the results and communicating them to the business
3.5. OUR APPROACH 45
spokesman are essential tasks of the lead data scientist. All other re-
sponsibilities of the traditional product owner are also delegated to her.
Data science team The data science team is the development team.
It includes people with expertise in data science, database management,
software engineering, and any other domain-specific expertise that is
required for the project.
14https://dvc.org/
15The learned solution that is the result of the application of a learning algorithm to
the dataset.
16https://www.gnu.org/software/make/manual/make.html
3.5. OUR APPROACH 47
Reports as deliverables
During sprints, the deliverables of phases like exploratory analysis and
solution search are not only the source code, but also the reports gen-
erated. That is probably the reason why interactive environments are
so popular in data science projects. However, the data scientist must
guarantee that the reports are version controlled and reproducible. The
reports must be generated in a way that is understandable by the busi-
ness spokesperson.
evolves, the complexity of the methods and the size of the dataset can
be increased. Nowadays, cloud services are a good option for scalability.
Finally, during development, the requirements of the deployment
infrastructure must be considered. The deployment infrastructure must
be able to handle the expected usage of the system. For instance, in the
back-end, one may need to consider the response time of the system,
the programming language, and the infrastructure to run the software.
The choice of communication between the front-end and the back-end
is also important. For instance, one may choose between a REST API or
a WebSocket. A REST API is more suitable for stateless requests, while a
WebSocket is more suitable for stateful requests. For example, if the user
interface must be updated in real-time, a WebSocket is more suitable. If
the user interface is used to submit batch requests, a REST API is more
suitable.
Product backlog
In the data science methodologies described in this chapter, the problem
definition is the first step in a data science project. In our methodology,
this task is dealt with in the product backlog. The product backlog is a
list of all desired work on the project. The product backlog is dynamic,
and it is continuously updated by the lead data scientist.
Each item in the product backlog reflects a requirement or a feature
the business spokesperson wants to be implemented. Like in traditional
Scrum, the items are ordered by priority. Here, however, they are clas-
sified into three types: data tasks, solution search tasks, and application
tasks.
Sprints
The sprints are divided into three types: data sprints, solution sprints,
and application sprints. Sprints occur sequentially, but it is possible to
have multiple sprints of the same type in sequence. Like in traditional
Scrum, the sprint review is performed at the end of each sprint. Data
50 CHAPTER 3. DATA SCIENCE PROJECT
sprints comprise only data tasks, solution sprints comprise only solution
search tasks, and so on.
17N. Zumel and J. Mount (2019). Practical Data Science with R. 2nd ed. Shelter Island,
NY, USA: Manning.
3.5. OUR APPROACH 51
Figure 3.4 shows the tasks and results for each sprint type and their
relationships. Every time a data sprint results in modifications in the
dataset, the solution search algorithm must be re-executed. The same
occurs when the solution search algorithm is modified: the application
must be updated to use the new preprocessor and model.
Figure 3.4: Tasks and results for each sprint types and their rela-
tionships.
Exploratory
Collect data
Data sprint
Data tasks
analysis
Integrate data
Tidy data
Sprint review
Validation
Solution sprint
Solution search
report
Data preprocessing
Machine learning
Validation
Preprocessor
and model
Application sprint
Application tasks
User interface
Communication Application
Monitoring
Summary of the tasks and results for each sprint type and their
relationships.
3.5. OUR APPROACH 53
Sprint reviews
A proper continuous integration/continuous deployment (CI/CD) pipe-
line guarantees that by the end of the sprint, exploratory analysis, per-
formance reports, and the working software are ready for review. The
sprint review is a meeting where the team presents the results of the
sprint to the business spokesman. The business spokesman must ap-
54 CHAPTER 3. DATA SCIENCE PROJECT
prove the results of the sprint. (The lead data scientist approves the re-
sults in the absence of the business spokesman.) It is important that the
reports use the terminology of the client, and that the software is easy
to use for the domain expert.
55
56 CHAPTER 4. STRUCTURED DATA
Chapter remarks
Contents
4.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Database normalization . . . . . . . . . . . . . . . . . . . 58
4.2.1 Relational algebra . . . . . . . . . . . . . . . . . . 59
4.2.2 Normal forms . . . . . . . . . . . . . . . . . . . . 60
4.3 Tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Common messy datasets . . . . . . . . . . . . . . 66
4.4 Bridging normalization, tidiness, and data theory . . . . . 73
4.4.1 Tidy or not tidy? . . . . . . . . . . . . . . . . . . . 73
4.4.2 Change of observational unit . . . . . . . . . . . . 76
4.5 Data semantics and interpretation . . . . . . . . . . . . . 78
4.6 Unstructured data . . . . . . . . . . . . . . . . . . . . . . 79
Context
Objectives
Takeaways
1P. F. Velleman and L. Wilkinson (1993). “Nominal, Ordinal, Interval, and Ratio Ty-
pologies are Misleading”. In: The American Statistician 47.1, pp. 65–72. doi: 10.1080/
00031305.1993.10475938.
4.2. DATABASE NORMALIZATION 59
Join The (natural) join of two relations is the operation that returns
a relation with the columns of both relations. For example, if we have
two relations 𝑆[𝑈 ∪ 𝑉] and 𝑇[𝑈 ∪ 𝑊 ], where 𝑈 is the common set of
attributes, the join 𝑆 ⋈ 𝑇 of 𝑆 and 𝑇 is the relation with tuples (𝑢, 𝑣, 𝑤)
such that (𝑢, 𝑣) ∈ 𝑆 and (𝑢, 𝑤) ∈ 𝑇. The generalized join is built up
out of binary joins: ⋈ {𝑅1 , 𝑅2 , … , 𝑅𝑛 } = 𝑅1 ⋈ 𝑅2 ⋈ ⋯ ⋈ 𝑅𝑛 . Since
the join operation is associative and commutative, we can parenthesize
however we want.
First normal form (1NF) A relation is in 1NF if and only if all at-
tributes are atomic. An attribute is atomic if it is not a set of attributes.
For example, the relation 𝑅[𝐴, 𝐵, 𝐶] is in 1NF if and only if 𝐴, 𝐵, and 𝐶
are atomic.
Invalid join example Consider the 2NF relation 𝑅[𝐴𝐵𝐶]6 such that
the primary key is the composite of 𝐴, 𝐵, and 𝐶. The relation is thus in
the 4NF, as no column is a determinant of another column. Suppose,
however, the following constraint: if (𝑎, 𝑏, 𝑐′ ), (𝑎, 𝑏′ , 𝑐), and (𝑎′ , 𝑏, 𝑐) are
in 𝑅, then (𝑎, 𝑏, 𝑐) is also in 𝑅. This can be illustrated if we consider 𝐴
6Here we abbreviate 𝐴, 𝐵, 𝐶 as 𝐴𝐵𝐶.
4.3. TIDY DATA 63
of the data, reducing the time spent on handling the data to get it into
the right format for analysis.
Tidy data, proposed by Wickham (2014)8, is a data format that pro-
vides a standardized way to organize data values within a dataset. The
main advantage of tidy data is that it provides clear semantics with a
focus on only one view of the data.
Many data formats might be ideal for particular tasks, such as raw
data, dense tensors, or normalized databases. However, most statistical
and machine learning methods require a particular data format. Tidy
data is a data format that is suitable for those tasks.
In an unrestricted table, the meaning of rows and columns is not
fixed. In a tidy table, the meaning of rows and columns is fixed. The
semantics are more restrictive than usually required for general tabular
data.
Brazil USA
Cases (2019) 100
Cases (2020) 200 400
Table 4.7: Messy table, from Pew Forum dataset, where headers
are values, not variable names.
Table 4.8: Tidy version of table 4.7 where values are correctly
moved.
To make it tidy, we can transform it into the table 4.10. Two columns
are created to contain the variables Sex and Age, and the old column
is removed. The table keeps the same number of rows, but it is now
wider. This is a common pattern when fixing this kind of issue. The
new version usually fixes the issue of correctly calculating ratios and
frequency.
Table 4.10: Tidy version of table 4.9 where values are correctly
moved.
To fix this issue, we must first decide which column contains the
names of the variables. Then, we must lengthen the table in function of
the variables (and potentially their names), as seen in table 4.12.
Table 4.12: Partial solution to tidy table 4.11. Note that the table
is now longer.
Table 4.13: Tidy version of table 4.11 where values are correctly
moved.
To fix this issue, we must ensure that each observation unit is moved
to a different table. Sometimes, it is useful to create unique identifiers
for each observation. The separation avoids several types of potential
inconsistencies. However, take into account that during data analysis,
it is possible that we have to denormalize them. The two resulting tables
are shown in table 4.15 and table 4.16.
4.3. TIDY DATA 71
Table 4.19: Tidy data where tables 4.17 and 4.18 are combined.
Table 4.21: Tidy data where the observational unit is the event of
measuring the temperature.
Table 4.22: Tidy data where the observational unit is the tempera-
ture at some time.
In both cases, one can argue that the data is also normalized. In the
first case, the primary key is the composite of the columns date, time,
and sensor. In the second case, the primary key is the composite of the
columns date and time.
One can state that the first form is more appropriate, since it is flexi-
ble enough to add more sensors or sensor-specific attributes (using an ex-
tra table). However, the second form is very natural for machine learn-
ing and statistical methods. Given the definition of tidy data, I believe
both forms are correct. It is just a matter of what ontological view you
have of the data.
Still, one can argue that the sensors share the same nature and thus
only the first form is correct (or can even insist that the more flexible
form is the correct one). Consider however the data in table 4.23. The
observational unit is the person, and the attributes are the body mea-
surements.
If we apply the same logic of table 4.21, data in table 4.23 becomes ta-
ble 4.24. Now, the observational unit is the measurement of a body part
76 CHAPTER 4. STRUCTURED DATA
of a given person. Now, we can easily include more body parts. Let us
say that we want to add the head circumference. We just need to include
rows such as “Alice, head, 50” and “Bob, head, 55”. Moreover, what if
we want to add the height of the person? Should we create another table
(with “name” and “height”) or should we consider “height” as another
body part (even though it seems weird to consider the full body a part
of the body)?
In the first version of the data (table 4.21), it would be trivial to in-
clude head circumference and height. In the second version, the choice
becomes inconvenient. This table seems “overly tidy”. If the first fits
well for the analysis, it should be preferred.
In summary, tidiness is a matter of perspective.
ABCDE ABCDE
AD ABCE BE ABCD
BE ABC AD ABC
Note that the decomposition that splits first 𝑅[𝐴𝐵𝐶] is not valid,
since the resulting relation 𝑅[𝐴𝐵] is not a consequence of a functional
dependency; see fig. 4.2.
4.4. BRIDGING NORMALIZATION, TIDINESS, AND DATA THEORY77
ABCDE ABCDE
AD ABE BE ABD
BE AB AD AB
lation 𝑅[𝐴𝐹𝐺] where 𝐹 is the average grade and 𝐺 is the total course
load taken by the student (see table 4.25). They are all calculated based
on the rows that are grouped in function of 𝐴. It is important to notice
that, after the summarization operation, all observations must contain
a different value of 𝐴. The second join results in relation 𝑅[𝐴𝐷𝐹𝐺] =
𝑅[𝐴𝐷] ⋈ 𝑅[𝐴𝐹𝐺]. This relation has functional dependency 𝐴 → 𝐷𝐹𝐺,
and it is in 3NF (which is also tidy).
Unfortunately, it is not trivial to calculate all possible decomposition
trees for a given dataset. It is up to the data scientist to decide which
directions to follow. However, it is important to notice that the order of
the joins and summarization operations are crucial to the final result.
estimate from the data. Each observed value of a key can represent an
instance of a random variable, and the other attributes can represent
measured attributes or calculated properties.
For data analysis, it is very important to understand the relationships
between the observations. For example, we might want to know if the
observations are independent, if they are identically distributed, or if
there is a known selection bias. We might also want to know if the ob-
servations are dependent on time, and if there are hidden variables that
affect the observations.
Following wrong assumptions can lead to wrong conclusions. For
example, if we assume that the observations are independent, but they
are not, we might underestimate the variance of the estimators.
Although we do not focus on time series, we must consider the tem-
poral dependence of the observations. For example, we might want to
know how the observation 𝑥𝑡 is affected by 𝑥𝑡−1 , 𝑥𝑡−2 , and so on. We
might also want to know if the Markov property holds, and if there is
periodicity and seasonality in the data.
For the sake of the scope of this book, we suggest that any predic-
tion on temporal data should be done in the state space, where it is
safer to assume that observations are independent and identically dis-
tributed. This is a common practice in reinforcement learning and deep
learning. Takens’ theorem17 allows you to reconstruct the state space
of a dynamical system using time-delay embedding. Given a single ob-
served time series, you can create a multidimensional representation
of the underlying dynamical system by embedding the time series in a
higher-dimensional space. This embedding can reveal the underlying
dynamics and structure of the system.
17F. Takens (2006). “Detecting strange attractors in turbulence”. In: Dynamical Sys-
tems and Turbulence, Warwick 1980: proceedings of a symposium held at the University of
Warwick 1979/80. Springer, pp. 366–381.
80 CHAPTER 4. STRUCTURED DATA
81
82 CHAPTER 5. DATA HANDLING
Chapter remarks
Contents
5.1 Formal structured data . . . . . . . . . . . . . . . . . . . 83
5.1.1 Splitting and binding . . . . . . . . . . . . . . . . 84
5.1.2 Split invariance . . . . . . . . . . . . . . . . . . . 86
5.1.3 Illustrative example . . . . . . . . . . . . . . . . . 87
5.2 Data handling pipelines . . . . . . . . . . . . . . . . . . . 90
5.3 Split-invariant operations . . . . . . . . . . . . . . . . . . 91
5.3.1 Tagged splitting and binding . . . . . . . . . . . . 92
5.3.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Joining . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.4 Selecting . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.5 Filtering . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.6 Mutating . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.7 Aggregating . . . . . . . . . . . . . . . . . . . . . 102
5.3.8 Ungrouping . . . . . . . . . . . . . . . . . . . . . 102
5.4 Other operations . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Projecting or grouping . . . . . . . . . . . . . . . 105
5.4.2 Grouped and arranged operations . . . . . . . . . 107
5.5 An algebra for data handling . . . . . . . . . . . . . . . . 109
Context
Objectives
Takeaways
𝑣 𝑖,1 , … , 𝑣 𝑖,|𝐻|
[𝑣ℎ𝑖 ∶ ℎ ∈ 𝐻],
[𝑐(𝑟, ℎ) ∶ ℎ ∈ 𝐻],
split(𝑇, 𝑠) = (𝑇0 , 𝑇1 ) ,
𝑐(𝑟, ℎ) if 𝑠(𝑟) = 𝑖
𝑐 𝑖 (𝑟, ℎ) = {
() otherwise.
5.1. FORMAL STRUCTURED DATA 85
Note that, by definition, the split operation never “breaks” a row. So,
the indices define the indivisible entities of the table. The resulting tables
are disjoint:
The binding operation is the inverse of the split operation. Given two
disjoint tables 𝑇0 = (𝐾, 𝐻, 𝑐 0 ) and 𝑇1 = (𝐾, 𝐻, 𝑐 1 ), the binding operation
creates a new table 𝑇 that contains all the rows of 𝑇0 and 𝑇1 .
Thus, a requirement for the binding operation is that the tables are
disjoint in terms of the row entities they have.
𝑇0 + 𝑇1 = bind(𝑇0 , 𝑇1 ),
Table 5.2: Data table of student grades assuming student and sub-
ject as indices.
Indexed table with data from table 5.1 assuming student and sub-
ject as indices. The column 𝑠 is the split indicator.
2. We can safely assume that Bob has passed Physics in his second
attempt, once all information about (Bob, Physics) is assumed to
be available.
3. There is no guarantee that Carol has only taken classes in 2020. It
could be that some row (Carol, subject) with a year different from
2020 is missing in the table.
Indexed table with data from table 5.1 assuming student as index.
The column 𝑠 is the split indicator and only rows with 𝑠 = 1 are
available to us.
1. We can safely assume that Alice has never taken the Biology class,
as Biology ∉ 𝑐(Alice, subject).
2. There is no information about Bob’s grades, so we can not affirm
nor deny anything about his grades.
3. We can safely assume that Carol has only taken classes in 2020, as
𝑐(Carol, year) contains only values with 2020.
90 CHAPTER 5. DATA HANDLING
Source 1 𝑓1 𝑓2
𝑓5 Data
Source 2 𝑓3 𝑓4
bind𝑠 (𝑇0 , 𝑇1 , … ) = 𝑇,
where 𝑖 is the index of the table 𝑇𝑖 that contains the row 𝑟, i.e.
𝑑 = card(𝑟; 𝑐 𝑖 ) > 0.
When binding datasets by rows, the datasets must have the same
columns. In practice, one can assume, if a column is missing, that all
values in that column are missing.
The indication of the source table usually captures some hidden se-
mantics that has split the tables in the first place. For instance, if each
table represents data collected in a different year, one can create a new
column year that contains the year of the data. It is important to pay at-
tention to the semantics of the split column, as it can also contain extra
information.
Consider table 5.4, which contains monthly gas usage data from US
and Brazil residents. From the requirements described in the previous
section, we can safely bind these datasets — as they are disjoint. We
5.3. SPLIT-INVARIANT OPERATIONS 93
Monthly gas usage data from US (left) and Brazil (right) residents.
for column 𝑠 for all rows9. Failing to meet this assumption can lead to a
biased split. Also, in practice, it is good practice to keep the column 𝑠 in
the output tables to preserve information about the source of the rows.
In terms of storage, smart strategies can be used to avoid the unneces-
sary repetition of the same value in column 𝑠.
5.3.2 Pivoting
Another important operation is pivoting datasets. There are two types
of pivoting: long-to-wide and wide-to-long. These operations are re-
versible and are the inverse of each other.
Pivoting long-to-wide requires a name column — whose discrete
and finite possible values will become the names of the new columns
— and a value column — whose values will be spread across the rows.
Other than these columns, all remaining columns must be indexes.
Note however that the operation only works if card(𝑟 + (ℎ); 𝑐) is con-
stant for all ℎ ∈ 𝒟(name). If this is not the case, one must aggregate
the rows before applying the pivot operation. This is discussed in sec-
tion 5.3.7.
Pivoting wide-to-long10 is the reverse operation. One must specify
all the columns whose names are the values of the previously called
“name column.” The values of these columns will be gathered into a
new column. As before, all remaining columns are indexes.
9We consider a slightly different definition of split invariance here, where the binding
operation is applied to each element of the output of the split operation.
10Also known as unpivot.
5.3. SPLIT-INVARIANT OPERATIONS 95
pivotname (𝑇0 ) + pivotname (𝑇1 ) = (𝐾, 𝒟(name) , 𝑐′0 ) + (𝐾, 𝒟(name) , 𝑐′1 ),
96 CHAPTER 5. DATA HANDLING
5.3.3 Joining
Joining is the process of combining two datasets into a single dataset
based on common columns. This is one of the two fundamental opera-
tions in relational algebra. We will see the conditions under which the
operation is split invariant. However, the join operation has some other
risks you should be aware of; consult section 4.2 for more details.
Adapting the definitions of join in our context, we can define it as
follows. For the sake of simplicity, we denote 𝑟[𝑈] as the row 𝑟 restricted
to the index columns in 𝑈, i.e.
The join of two tables is the operation that returns a new table with
the columns of both tables. Let 𝑈 be the common set of index columns.
For each occurring value of 𝑈 in the first table, the operation will look
for the same value in the second table. If it finds it, it will create a new
row with the columns of both tables. If it does not find it, no row will
be created.
Note that, like in pivoting long-to-wide, one must ensure that the
cardinality of the joined rows is constant for all ℎ ∈ 𝐻 ′ ∪ 𝐻 ″ . If this
is not the case, one must aggregate the rows before applying the join
operation. This is discussed in section 5.3.7.
5.3. SPLIT-INVARIANT OPERATIONS 97
join(𝑇 ′ , 𝑇 ″ ) = 𝑇,
where 𝑇 = (𝐾 ′ ∪ 𝐾 ″ , 𝐻 ′ ∪ 𝐻 ″ , 𝑐) and
𝑐(𝑟, ℎ) = ()
𝑐′ (𝑟[𝐾 ′ ], ℎ) if ℎ ∈ 𝐻 ′ ,
𝑐(𝑟, ℎ) = { ″
𝑐 (𝑟[𝐾 ″ ], ℎ) if ℎ ∈ 𝐻 ″ .
𝑐(𝑟, ℎ) = ()
𝑐′ (𝑟[𝐾 ′ ], ℎ) if ℎ ∈ 𝐻 ′ ,
𝑐(𝑟, ℎ) = { ″
𝑐 (𝑟[𝐾 ″ ], ℎ) if ℎ ∈ 𝐻 ″ .
Thus,
join𝑇 ′ (𝑇0 + 𝑇1 ) = join𝑇 ′ (𝑇0 ) + join𝑇 ′ (𝑇1 ).
Our conclusion is that the left join operation given a fixed table is
split-invariant. So we can safely use it to join tables without worrying
about biasing the dataset once we fix the second table.
I conjecture that the (inner) join operation shares similar properties
but it is not as safe; nonetheless, a clear definition of split invariance for
binary operations is needed. This is left as a thought exercise for the
reader. Notice that the traditional join has the ability to “erase” rows
from any of the tables involved in the operation. This is a potential
source of bias in the data. This further emphasizes the importance of
understanding the semantics of the data schema before joining tables
— consult section 4.2.
5.3.4 Selecting
Selecting is the process of choosing a subset of non-index columns from
a dataset. The remaining columns are discarded. Rows of the table re-
main unchanged.
Although very simple, the selection operation is useful for removing
columns that are not relevant to the analysis. Also, it might be needed
before other operations, such as pivoting, to avoid unnecessary columns
(wide-to-long) and to keep only the value column (long-to-wide).
5.3. SPLIT-INVARIANT OPERATIONS 99
select𝑃 (𝑇) = 𝑇 ′ ,
5.3.5 Filtering
Filtering is the process of selecting a subset of rows from a dataset based
on a predicate.
A predicate can be a combination of other predicates using logical
operators, such as logical disjunction (or) or logical conjunction (and).
100 CHAPTER 5. DATA HANDLING
where predicate
property comes from the fact that rows are treated independently. More
complex cases are discussed in section 5.4.2.
5.3.6 Mutating
Mutating is the process of creating new columns in a table. The oper-
ation is reversible, as the original columns are kept. The new columns
are added to the dataset.
The values in the new column are determined by a function of the
rows. The expression is a function that returns a vector of values given
the values in the other columns. Similarly to filtering, in its simplest
form, we can assume that card(𝑟) ≤ 1 for all 𝑟 and that the predicates
are applied to each row independently.
mutate𝑓 (𝑇) = 𝑇 ′ ,
𝑐(𝑟, ℎ) if ℎ ∈ 𝐻,
𝑐′ (𝑟, ℎ) = {
𝑓(𝑟, 𝑉 (𝑟)) if ℎ = ℎ′ ,
y = ifelse(x > 0, 1, 0) .
Here, x and y are the names of an existing and the new column, respec-
tively. The ifelse(a, b, c) function is a conditional function that
returns 1 if the condition is true and 0 otherwise.
102 CHAPTER 5. DATA HANDLING
This function solves the issue of multiple variables stored in one col-
umn described in section 4.3.1.
As with filtering, the mutating operation is split-invariant even if
card(𝑟) > 1 for any 𝑟13. This is because the operation is applied to each
row independently. In this general case, an extra requirement is that the
function 𝑓 must return tuples with the same cardinality as the row it is
applied to.
5.3.7 Aggregating
Many times, it is easier to reason about the table when all rows have
cardinality 1. Aggregation ensures that the table has this property.
aggregate𝑓 (𝑇) = 𝑇 ′ ,
(𝒟(ℎ) ∪ {?} ∶ ℎ ∈ 𝐻) ,
5.3.8 Ungrouping
We discussed that the fewer index columns a table has — assuming we
guarantee that all information about that entity is present — the safer
13It just changes the input space of function 𝑓.
5.3. SPLIT-INVARIANT OPERATIONS 103
ungroupℎ′ (𝑇) = 𝑇 ′ ,
𝑐′ (𝑟 + 𝑟′ , ℎ) = (𝑣 𝑖,ℎ ∶ 𝑖),
Note that the operation requires that the column ℎ′ has no missing
values.
Table 5.6 shows an example of ungrouping. In the top table, there
are two rows, one with cardinality 4 and the other with cardinality 3.
The column year is ungrouped, creating new rows for each value in the
nested row. The bottom table is the result of ungrouping the column
year. Although there were 7 nested rows in the original table, the bottom
table has 6 rows — the number of nested rows is preserved however. The
reason is that the row (A, 2020) has cardinality 2.
The ungrouping operation is split-invariant. To see this, consider
two disjoint tables 𝑇0 = (𝐾, 𝐻, 𝑐 0 ) and 𝑇1 = (𝐾, 𝐻, 𝑐 1 ), we have
The index of the top table is the column city. The bottom table is
the result of ungrouping the column year.
where 𝑐𝑗′ (𝑟 + 𝑟′ , ℎ) = (𝑣 𝑖,ℎ ∶ 𝑖 such that 𝑣 𝑖,ℎ′ = 𝑟′ ). Since the tables are
disjoint, the rows of the output tables are also disjoint. In other words,
For any 𝑟, either card(𝑟 + 𝑟′ ; 𝑐 0 ) = 0 or card(𝑟 + 𝑟′ ; 𝑐 1 ) = 0 indepen-
dently of the value of 𝑟′ . The reason is that there is no possible 𝑣 𝑖,ℎ′ = 𝑟′
if 𝑟 is not present in the table.
Then,
project𝐾 ′ (𝑇) = 𝑇 ′ ,
where 𝑇 ′ = (𝐾 ′ , 𝐻 ∪ (𝐾 ∖ 𝐾 ′ ), 𝑐′ ) and
∑ 𝑐(𝑟 + 𝑟′ , ℎ) if ℎ ∈ 𝐻
⎧
⎪ 𝑟′
𝑐′ (𝑟, ℎ) = ′
⎨
⎪ ′∑ 𝑘 if ℎ ∈ 𝐾 ∖ 𝐾 ′ ,
⎩𝑘 ∈𝒟(ℎ)
for all valid row 𝑟 considering the indices 𝐾 ′ and for all tuples
𝑟′ = (𝑘𝑖 ∶ 𝑖) such that 𝑘𝑖 ∈ 𝒟(𝐾𝑖 ) for all 𝐾𝑖 ∈ 𝐾 ∖ 𝐾 ′ .
We can see that projection for our tables is a little more complex
than the usual projection in relational algebra. Consider the example we
discussed in section 4.2 as well, where we have a table with the columns
student, subject, year, and grade.
Table 5.7 (top) shows that table adapted for our definitions. Suppose
we want to project the table to have only the entity course. Now each
row (bottom table) represents a course. The column student is not an
index column anymore, and the values in the column are exhaustive
106 CHAPTER 5. DATA HANDLING
and unique, i.e., the whole set 𝒟(student) is represented in the column
for each row.
Thus, projection is a very useful operation when we want to change
the observational unit of the data, particularly to the entity represented
by a subset of the index columns. Semantically, projection groups the
rows by the values.
It is easy to see that the projection operation is not split-invariant.
Consider the following example. If we split the top table in table 5.7
so the first row (Alice, Math) is in one table and the second row (Alice,
Physics) is in another, the bind operation between the projection into the
entity student of these two tables is not allowed. The reason is that the
row (Alice) will be present in both tables, violating the disjoint property
of the tables.
The consequence is that a poor architecture of the data schema can
lead to incorrect conclusions in the face of missing information (due
to split). This is one of the reasons why database normalization is so
important. The usage of parts of the tables without fully denormalizing
them is a bad practice that can lead to spurious information.
5.4. OTHER OPERATIONS 107
where the order of the values may play a role in the result of the function.
Examples of aggregation functions are sum (summation), mean
(average value), count (number of elements), and first (first ele-
ment of the tuple). Examples of window functions are cumsum (cumu-
lative sum), lag (a tuple with the previous values), and rank (the
rank of the value in the tuple given some ordering).
Here, we consider that the rows of the table have cardinality equal to
one — as discussed before, one can use ungrouping (section 5.3.8) and
aggregation (section 5.3.7) to ensure this property. Without loss of gen-
erality, we also assume that there is only one index, called row number,
such that each row has a unique value for this index15.
will create a new column y with the cumulative sum of the x column
for each category given the order of the rows defined by the date
column.
15Since the operations we describe here are not split-invariant, we can assume a previ-
ous projection of the data, see section 5.4.1.
108 CHAPTER 5. DATA HANDLING
Source Join
Result
since the missing value for the year 2020 was implicit.
111
112 CHAPTER 6. LEARNING FROM DATA
Chapter remarks
Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 The learning problem . . . . . . . . . . . . . . . . . . . . 115
6.2.1 Learning tasks . . . . . . . . . . . . . . . . . . . . 115
6.2.2 A few remarks . . . . . . . . . . . . . . . . . . . . 117
6.3 Optimal solutions . . . . . . . . . . . . . . . . . . . . . . 118
6.3.1 Bayes classifier . . . . . . . . . . . . . . . . . . . . 118
6.3.2 Regression function . . . . . . . . . . . . . . . . . 120
6.4 ERM inductive principle . . . . . . . . . . . . . . . . . . 122
6.4.1 Consistency of the learning process . . . . . . . . 122
6.4.2 Rate of convergence . . . . . . . . . . . . . . . . . 123
6.4.3 VC entropy . . . . . . . . . . . . . . . . . . . . . . 123
6.4.4 Growing function and VC dimension . . . . . . . 124
6.5 SRM inductive principle . . . . . . . . . . . . . . . . . . . 126
6.5.1 Bias invariance trade-off . . . . . . . . . . . . . . 129
6.5.2 Regularization . . . . . . . . . . . . . . . . . . . . 131
6.6 Linear problems . . . . . . . . . . . . . . . . . . . . . . . 132
6.6.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . 132
6.6.2 Maximal margin classifier . . . . . . . . . . . . . 137
6.7 Closing remarks . . . . . . . . . . . . . . . . . . . . . . . 139
Context
Objectives
Takeaways
6.1 Introduction
Several problems can be addressed by techniques that utilize data in
some way. Once we focus on one particular problem — inductive learn-
ing —, we need to define the scope of the tasks we are interested in. Let
us start from the broader fields to the more specific ones.
Artificial intelligence (AI) is a very broad field, including not only
the study of algorithms that exhibit intelligent behavior, but also the
study of the behavior of intelligent systems. For instance, it encom-
passes the study of optimization methods, bio-inspired algorithms, ro-
botics, philosophy of mind, and many other topics. We are interested in
the subfield of artificial intelligence that studies algorithms that exhibit
some form of intelligent behavior.
A more specific subfield of AI is machine learning (ML), which stud-
ies algorithms that enable computers to learn and improve their perfor-
mance on a task from experience automatically, without being explicitly
programmed by a human being.
Programming a computer to play chess is a good example of the dif-
ference between traditional AI and ML. In traditional AI, a human pro-
grammer would write a program that contains the rules of chess and the
strategies to play the game. The algorithm might even “search” among
the possible moves to find the best one. In ML, the programmer would
write a program that learns to play chess by playing against itself, against
other programs, or even from watching games played by humans. The
system would learn the rules of chess and the strategies to play the game
by itself.
This field is particularly useful when the task is too complex to be
solved by traditional programming methods or when we do not know
how to solve the task. Among the many tasks that can be addressed by
ML, we can specialize even more.
Predictive learning is the ML paradigm that focuses on making pre-
dictions about outcomes (sometimes about the future) based on histori-
cal data. Predictive tasks involve predicting the value of a target variable
based on the values of one or more input variables1.
Depending on the reasoning behind the learning algorithms, we can
divide the learning field into two main approaches: inductive learning
and transductive learning2.
1Descriptive learning, which is out of the scope of this book, focuses on describing
the relationships between variables in the data without the need for a target variable.
2Trasduction is the process of obtaining specific knowledge from specific observa-
tions, and it is not the focus of this book.
114 CHAPTER 6. LEARNING FROM DATA
artificial intelligence
machine learning
predictive learning
inductive learning
(𝑐(𝑟, ℎ′ ) ∶ ℎ′ ∈ 𝐻 ∖ {ℎ} ).
Similarly, the variables 𝐾 that describe each unit are set aside, as it does
not make sense to infer general rules from them.
From the statistical point of view, learning problems consist of an-
swering questions about the distribution of the data.
the goal is to find the function 𝑓𝜃 that minimizes 𝑅(𝜃) where the only
available information is the training set given by (6.1).
This formulation encompasses many specific tasks. I focus on two of
them, which I believe are the most fundamental ones: binary data clas-
sification4 and regression estimation5. (I left aside the density estimation
problem, once it is not addressed in the remainder of the book.)
0 if 𝑦 = 𝑓𝜃 (𝑥)
ℒ(𝑦, 𝑓𝜃 (𝑥)) = {
1 if 𝑦 ≠ 𝑓𝜃 (𝑥),
the risk (6.2) becomes the probability of classification error. The func-
tion 𝑓𝜃 , in this case, is called a classifier and 𝑦 is called the label.
In section 6.3, we show that the function that minimizes the risk with
such a loss function is the so-called regression. The estimator 𝑓𝜃 of the
regression, in this case, is called a regressor.
4In SLT, Vapnik calls it pattern recognition.
5We are not talking about regression analysis; regression estimation is closer to the
scoring task definition by N. Zumel and J. Mount (2019). Practical Data Science with R.
2nd ed. Shelter Island, NY, USA: Manning.
6Alternatively, negative class is represented by −1 and positive class by 1.
6.2. THE LEARNING PROBLEM 117
As one should expect, dealing with more than two classes is more com-
plex than dealing with only two classes. If possible, prefer to deal with
binary classification tasks first.
Number of inputs and outputs Note that the definition of the learn-
ing problem does not restrict the number of inputs and outputs. The in-
put data can be a scalar, a vector, a matrix, or a tensor, and the output as
well. The learning machine must be able to handle the input and output
data according to the problem.
We can easily see that the Bayes classifier is the optimal solution for
the binary data classification task. The probability of classification error
for an arbitrary classifier 𝑓 is
where 𝟙⋅ is the indicator function that returns one if the condition is true
and zero otherwise. Let 𝑏(𝑥) = P(𝑦 = 1 ∣ 𝑥); we have that
which means only one of the terms is nonzero for each 𝑥. Thus, the risk
is minimized by choosing a classifier that 𝑓(𝑥) = 1 if 𝑏(𝑥) > 1 − 𝑏(𝑥)
and 𝑓(𝑥) = 0 otherwise. This is the Bayes classifier.
6.3. OPTIMAL SOLUTIONS 119
We know that 𝑓Bayes (𝑥) = 1 if 𝑏(𝑥) > 0.5 and 𝑓Bayes (𝑥) = 0 otherwise.
Thus, the Bayes error rate can be rewritten as
P(𝑥 ∣ 𝑦 = 0) P(𝑥 ∣ 𝑦 = 1)
The Bayes classifier is the line that separates the two classes. The
Bayes error is a result of the darker area in which the distributions
of the classes intersect.
Figure 6.2 illustrates the Bayes classifier and its error rate. The verti-
cal line represents the Bayes classifier that separates the classes the best
way possible in the space of the feature vectors 𝑥. Since the distributions
P(𝑥 ∣ 𝑦 = 0) and P(𝑥 ∣ 𝑦 = 1) may intersect, there is a region where the
Bayes classifier cannot always predict the class correctly.
120 CHAPTER 6. LEARNING FROM DATA
that is the expected value of the target variable 𝑦 given the input 𝑥.
It is easy to show that the regression function minimizes the risk
(6.2) with loss
2
ℒ(𝑦, 𝑟(𝑥)) = (𝑦 − 𝑟(𝑥)) .
The risk functional for an arbitrary function 𝑓 is
2
𝑅(𝑓) = ∫ (𝑦 − 𝑓(𝑥)) 𝑑P(𝑥, 𝑦) =
however we can substitute 𝑟(𝑥) for the inner integral and obtain
Unexplained variance
1 𝜎=1 𝑟(𝑥) = 𝑥
0.5
Explained variance
𝑥
0.5 1
“‘
Approximating 𝑅(𝜃) by the empirical risk functional 𝑅𝑛 (𝜃) is the
so-called empirical risk minimization (ERM) inductive principle. The
ERM principle is the basis of the SLT.
Traditional methods, such as least squares, maximum likelihood,
and maximum a posteriori, are all realizations of the ERM principle for
specific loss functions and hypothesis spaces.
6.4.3 VC entropy
Let 𝐿(𝑧, 𝜃), 𝜃 ∈ Θ, be a set of bounded loss functions, i.e.
for some constant 𝑀 and all 𝑧 and 𝜃. One can construct 𝑛-dimensional
vectors
𝑙(𝑧1 , … , 𝑧𝑛 ; 𝜃) = [𝐿(𝑧1 , 𝜃), … , 𝐿(𝑧𝑛 , 𝜃)].
Once the loss functions are bounded, this set of vectors belongs to a 𝑛-
dimensional cube and has a finite minimal 𝜖-net9.
Consider the quantity 𝑁(𝑧1 , … , 𝑧𝑛 ; Θ, 𝜖) that counts the number of
elements of the minimal 𝜖-net of that set of vectors. Once the quantity
𝑁 is a random variable, we can define the VC entropy as
𝜃3
𝜃2
is infinite, even though the parameter 𝜃 is a scalar. See fig. 6.5. By in-
creasing the frequency 𝜃 of the sine wave, the function can approximate
any set of points.
This opens remarkable opportunities to find good solutions contain-
ing a huge number of parameters11 but with a finite VC dimension.
𝐵ℰ 4𝑅 (𝜃 )
𝑅(𝜃𝑛 ) ≤ 𝑅𝑛 (𝜃𝑛 ) + (1 + √1 + 𝑛 𝑛 ) , (6.6)
2 𝐵ℰ
12V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-Verlag
New York, Inc. isbn: 978-1-4419-3160-3.
13For the sake of the arguments, we consider only the expression for bounded losses
and an hypothesis space with infinite number of functions. Rigorously, the loss function
may not be bounded; consult the original work for the complete expressions.
6.5. SRM INDUCTIVE PRINCIPLE 127
with
2𝑛 𝜂
ℎ (ln + 1) − ln
ℎ 4
ℰ=4 ,
𝑛
where 𝐵 is the upper bound of the loss function, ℎ is the VC dimension
of the hypothesis space, 𝑛 is the number of samples. The term 𝜂 is the
confidence level, i.e., the inequality holds with probability 1 − 𝜂.
It is easy to see that as the number of samples 𝑛 increases, the em-
pirical risk 𝑅𝑛 (𝜃𝑛 ) approaches the true risk 𝑅(𝜃𝑛 ). Also, the greater the
VC dimension ℎ, the greater the term ℰ, decreasing the generalization
ability of the learning machine.
In other words, if 𝑛/ℎ is small, a small empirical risk does not guar-
antee a small value for the actual risk. A consequence is that we need
to minimize both terms of the right-hand side of the inequality eq. (6.6)
to achieve a good generalization ability.
Two problems that can arise in the learning process are underfit-
ting and overfitting. Underfitting occurs when the model is too
simple (low VC dimension) and cannot capture the complexity of
the training data (high empirical risk). Overfitting occurs when
the model is too complex (high VC dimension increases the con-
fidence interval) and fits the training data almost perfectly (low
empirical risk).
satisfying
ℎ1 ≤ ℎ2 ≤ ⋯ ≤ ℎ𝑛 ≤ … ,
Risk
Risk upper
bound Confidence
interval
Empirical
risk 𝑘
ℎ1 ∗ ℎ𝑛
ℎ
The upper bound of the risk is the sum of the empirical risk and
the confidence interval. The smallest bound is found for some 𝑘∗
in the admissible structure.
14Note that the VC dimension considering the whole set Θ might be infinite. Moreover,
in the original formulation, the sets 𝑆𝑘 also need to satisfy some bounds; read more in
chapter 4 of V. N. Vapnik (1999). The nature of statistical learning theory. 2nd ed. Springer-
Verlag New York, Inc. isbn: 978-1-4419-3160-3.
6.5. SRM INDUCTIVE PRINCIPLE 129
𝐷 = {(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 )}
such that
𝑦 𝑖 = 𝑓(𝑥𝑖 ) + 𝜖,
for a fixed function 𝑓 and a random noise 𝜖 with zero mean and variance
𝜎2 , where 𝑥𝑖 are i.i.d. samples drawn from some distribution P(𝑥).
̄ is the expected value of the function 𝑓(𝑥;
Also, consider that 𝑓(𝑥) ̂ 𝐷)
over all possible training sets 𝐷, i.e.
̄ = ∫ 𝑓(𝑥;
𝑓(𝑥) ̂ 𝐷) 𝑑P(𝐷).
(Note that the models themselves are the random variable we are study-
ing here.)
For any model 𝑓,̂ the expected (squared) error for a particular sample
̂ 𝐷))2 ], is
(𝑥, 𝑦), E𝐷 [(𝑦 − 𝑓(𝑥;
2 2
̂
∫ (𝑦 − 𝑓(𝑥)) ̂
𝑑P(𝐷, 𝜖) = ∫ (𝑦 − 𝑓(𝑥) + 𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷, 𝜖)
2
= ∫ (𝑦 − 𝑓(𝑥)) 𝑑P(𝐷) (6.7)
2
̂
+ ∫ (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷) (6.8)
̂
+ 2 ∫ (𝑦 − 𝑓(𝑥)) (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷, 𝜖). (6.9)
130 CHAPTER 6. LEARNING FROM DATA
̂
∫ (𝑦 − 𝑓(𝑥)) (𝑓(𝑥) − 𝑓(𝑥)) ̂
𝑑P(𝐷, 𝜖) = ∫ 𝜖 (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷, 𝜖) =
0
*
̂
∫𝜖 𝑑P(𝜖) ∫ (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷) = 0,
since P(𝐷) and P(𝜖) are independent and E[𝜖] = 0 by definition.
We can apply a similar strategy to analyze the term (6.8):
2 2
̂
∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄ + 𝑓(𝑥)
𝑑P(𝐷) = ∫ (𝑓(𝑥) − 𝑓(𝑥) ̄ − 𝑓(𝑥))
̂ 𝑑P(𝐷)
2
̄
= ∫ (𝑓(𝑥) − 𝑓(𝑥)) 𝑑P(𝐷) (6.11)
2
̄ − 𝑓(𝑥))
+ ∫ (𝑓(𝑥) ̂ 𝑑P(𝐷) (6.12)
̄
+ 2 ∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄ − 𝑓(𝑥))
(𝑓(𝑥) ̂ 𝑑P(𝐷).
(6.13)
Now, the term (6.13) is also null:
̄
∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄ − 𝑓(𝑥;
(𝑓(𝑥) ̂ 𝐷)) 𝑑P(𝐷) =
̄
(𝑓(𝑥) − 𝑓(𝑥)) ̄ − 𝑓(𝑥;
∫ (𝑓(𝑥) ̂ 𝐷)) 𝑑P(𝐷) =
0
:
̄ ̄
(𝑓(𝑥) − 𝑓(𝑥))(𝑓(𝑥) −∫ 𝐷) 𝑑P(𝐷)) = 0,
̂
𝑓(𝑥;
̄ is the expected value of 𝑓(𝑥;
since 𝑓(𝑥) ̂ 𝐷).
The term (6.11) does not depend on the training set, so
2 2
̄
∫ (𝑓(𝑥) − 𝑓(𝑥)) ̄
𝑑P(𝐷) = (𝑓(𝑥) − 𝑓(𝑥)) . (6.14)
6.5. SRM INDUCTIVE PRINCIPLE 131
2
̄ − 𝑓(𝑥;
∫ (𝑓(𝑥) ̂ 𝐷)) 𝑑P(𝐷) =
2
̄ − 𝑓(𝑥;
E𝐷 [(𝑓(𝑥) ̂ 𝐷)) ] = Var𝐷 (𝑓(𝑥;
̂ 𝐷)) . (6.15)
Finally, putting all together — i.e. eqs. (6.10), (6.14) and (6.15) —,
we have that the expected error for a particular sample (𝑥, 𝑦) is
2 2
̂ 𝐷)) ] = 𝜎2 + (𝑓(𝑥) − E[𝑓(𝑥;
E𝐷 [(𝑦 − 𝑓(𝑥; ̂ 𝐷)] ) + Var𝐷 (𝑓(𝑥;
̂ 𝐷)) .
6.5.2 Regularization
Also related to the SRM principle is the concept of regularization. Reg-
ularization encourages models to learn robust patterns within the data
rather than memorizing it.
Regularization techniques usually modify the loss by adding a pen-
alty term that depends on the complexity of the model. So, instead of
minimizing the empirical risk 𝑅𝑛 (𝜃), the learning machine minimizes
the regularized empirical risk
𝑅𝑛 (𝜃) + 𝜆Ω(𝜃),
where Ω(𝜃) is the complexity of the model and 𝜆 is a hyperparameter
that controls the trade-off between the empirical risk and the complex-
ity. Note that the regularization term acts as a proxy for the confidence
interval in the SRM principle. However, regularization is often justified
by common sense or intuition, rather than by strong theoretical argu-
ments.
Other approaches that indirectly control the complexity of the model
— such as early stopping, dropout, ensembles, and pruning — are often
called implicit regularization.
132 CHAPTER 6. LEARNING FROM DATA
𝑥1 𝑥2 𝑦 = 𝑥 1 ∧ 𝑥2 𝑥1 𝑥2 𝑦 = 𝑥 1 ⊕ 𝑥2
0 0 0 0 0 0
0 1 0 0 1 1
1 0 0 1 0 1
1 1 1 1 1 0
• The perceptron, which fixes the complexity of the model and tries
to minimize the empirical risk; and
• The maximal margin classifier, which fixes the empirical risk —
in this case, zero – and tries to minimize the confidence interval.
6.6.1 Perceptron
The perceptron is a linear classifier that generates a hyperplane that sep-
arates the classes in the feature space. It is a parametric model, and the
learning process minimizes the empirical risk by adjusting its fixed set
of parameters.
Parametric models are usually simpler and faster to fit, but they are
less flexible. In other words, it is up to the researcher to choose the best
model “size” for the problem. If the model is too small, it will not be
able to capture the complexity of the data. If the model is too large, it
tends to be too complex, too slow to train, and might overfit to the data.
6.6. LINEAR PROBLEMS 133
𝑓(𝑥1 , 𝑥2 ; w = [𝑤 0 , 𝑤 1 , 𝑤 2 ]) = 𝑢(𝑤 0 + 𝑤 1 𝑥1 + 𝑤 2 𝑥2 ),
1 if 𝑥 > 0,
𝑢(𝑥) = {
0 otherwise.
1
𝑥2
0 1
𝑥1
In fig. 6.7, we show the hyperplane (in this case, a line) that the
model with weights w = [−1.1, 0.6, 1] generates in this feature space.
As one can see, the classes are linearly separable, and the perceptron
model classifies the dataset correctly; see table 6.3.
Table 6.3: Truth table for the predictions of the perceptron in the
AND dataset.
𝑥1 𝑥2 𝑦 −1.1 + 𝑥1 + 𝑥2 𝑦̂
0 0 0 -1.1 0
0 1 0 -0.1 0
1 0 0 -0.5 0
1 1 1 0.5 1
1
𝑥2
0 1
𝑥1
In fig. 6.8, we show the hyperplane that the model w = [−0.5, 1, −1]
generates for the XOR dataset. As one can see, the perceptron model
fails to solve the task since there is no single decision boundary that can
classify this data.
Table 6.4: Truth table for the predictions of the perceptron in the
XOR dataset.
𝑥1 𝑥2 𝑦 −0.5 + 𝑥1 − 𝑥2 𝑦̂
0 0 0 -0.5 0
0 1 1 -1.5 0
1 0 1 0.5 1
1 1 0 -0.5 0
vectors, we can subtract 𝜂x from w, for some small 𝜂 > 0 — see fig. 6.9.
The error here is 𝑒 = −1.
−𝜂x
w′ w
𝛼
x
𝜂x
w w′
𝛼
x
where 𝜂 is a small positive number that controls the step size of the algo-
rithm. Note that this rule works even for cases 1 and 4, where the error
is zero.
The algorithm converges given 𝜂 sufficiently small and the dataset
is linearly separable. Note that the algorithm does not make any effort
to reduce the confidence interval.
The perceptron is (possibly) the simplest artificial neural network.
More complex networks can be built by stacking perceptrons in layers
and adding non-linear activation functions. The training strategies for
those networks are usually based on reducing the empirical risk using
the gradient descent algorithm while controlling the complexity of the
model with regularization techniques15. Consult appendix B.1.
ℎ
Ω( ) ,
𝑛
(w ⋅ x) − 𝑏 = 0, ‖w‖ = 1,
1 if (w ⋅ x) − 𝑏 ≥ Δ,
𝑦={
−1 if (w ⋅ x) − 𝑏 ≤ −Δ.
𝑅2
ℎ ≤ min (⌊ ⌋ , 𝑑) + 1,
Δ2
1
al
hy
p
er
pl
𝑥2
an
m
ar
e
gi
n
0 1
𝑥1
𝑦 𝑖 [(w ⋅ x𝑖 − 𝑏)] ≥ 1,
6.7. CLOSING REMARKS 139
for some 𝑏 and coefficients 𝑎𝑖 > 0 for the support vectors (𝑎𝑖 = 0 other-
wise).
In the case that the classes are not linearly separable, the maximal
margin classifier can be extended to the soft margin classifier, which sets
the empirical risk to a value greater than zero.
Moreover, since the number of parameters of the maximal margin
classifier depends on the training data (i.e., the number of support vec-
tors), it is a nonparametric model. Nonparametric models are those in
which the number of parameters is not fixed and can grow as needed
to fit the data. This property becomes more clear when we consider the
kernel trick, which allows the maximal margin classifier to deal with
nonlinear problems. Consult V. N. Vapnik (1999)17 for more details.
141
142 CHAPTER 7. DATA PREPROCESSING
Chapter remarks
Contents
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Formal definition . . . . . . . . . . . . . . . . . . 144
7.1.2 Degeneration . . . . . . . . . . . . . . . . . . . . 144
7.1.3 Data preprocessing tasks . . . . . . . . . . . . . . 145
7.2 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.1 Treating inconsistent data . . . . . . . . . . . . . 146
7.2.2 Outlier detection . . . . . . . . . . . . . . . . . . 148
7.2.3 Treating missing data . . . . . . . . . . . . . . . . 149
7.3 Data sampling . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.1 Random sampling . . . . . . . . . . . . . . . . . . 152
7.3.2 Scope filtering . . . . . . . . . . . . . . . . . . . . 152
7.3.3 Class balancing . . . . . . . . . . . . . . . . . . . 153
7.4 Data transformation . . . . . . . . . . . . . . . . . . . . . 154
7.4.1 Type conversion . . . . . . . . . . . . . . . . . . . 155
7.4.2 Normalization . . . . . . . . . . . . . . . . . . . . 157
7.4.3 Dimensionality reduction . . . . . . . . . . . . . . 158
7.4.4 Data enhancement . . . . . . . . . . . . . . . . . 159
7.4.5 Comments on unstructured data . . . . . . . . . . 160
Context
Objectives
Takeaways
7.1 Introduction
In chapters 4 and 5, we discussed data semantics and the tools to han-
dle data. They provide the grounds for preparation of the data as we
described in the data sprint tasks in section 3.5.3. However, the focus is
to guarantee that the data is tidy and in the observational unit of interest,
not to prepare it for modeling.
As a result, although data might be appropriate for the learning tasks
we described in chapter 6 — in the sense that we know what the feature
vectors and the target variable are —, they might not be suitable for the
machine learning methods we will use.
One simple example is the perceptron (section 6.6.1) that assumes
all input variables are real numbers. If the data contains categorical
variables, we must convert them to numerical variables before applying
the perceptron.
For this reason, the solution sprint tasks in section 3.5.3 include not
only the learning tasks but also the data preprocessing tasks, which are
dependent on the chosen machine learning methods.
𝑧∈ 𝒟(ℎ) ∪ {?}
⨉
ℎ∈𝐻
𝑧′ ∈ 𝒟(ℎ′ ) ∪ {?}.
⨉′
ℎ′ ∈𝐻
where ∘ is the composition operator. I say that they are dependent since
none of the operations can be applied to the table without the previous
ones.
7.1.2 Degeneration
The objective of the fitted preprocessor is to adjust the data to make it
suitable for the model. However, sometimes it cannot achieve this goal
7.2. DATA CLEANING 145
for a particular input 𝑧. This can happen for many reasons, such as
unexpected values, information “too incomplete” to make a prediction,
etc.
Formally, we say that the preprocessor 𝑓𝜙 degenerates over tuple 𝑧
if it outputs 𝑧 ′ = 𝑓𝜙 (𝑧) such that 𝑧 ′ = (?, … , ?). In practice, that means
that the preprocessor decided that it has no strategy to adjust the data to
make it suitable for the model. For the sake of simplicity, if any step 𝑓𝜙𝑖
degenerates over tuple 𝑧 (𝑖) , the whole preprocessing chain degenerates1
over 𝑧 = 𝑧 (0) .
Consequently, in the implementation of the solution, the developer
must choose a default behavior for the model when the preprocessing
chain degenerates over a tuple. It can be as simple as returning a default
value or as complex as redirecting the tuple to a different pair of prepro-
cessor and model. Sometimes, the developer can choose to integrate this
as an error or warning in the user application.
• Data cleaning;
• Data sampling; and
• Data transformation.
In the next sections, I address some of the most common data pre-
processing tasks in each of these categories. I present them in the order
they are usually applied in the preprocessing, but note that the order is
not fixed and can be changed according to the needs of the problem.
range from the simple removal of the observations with missing data
to the creation of new information to encode the missing data.
Unit conversion
Goal Convert physical quantities into the same
unit of measurement.
Fitting None. User must declare the units to be
used and, if appropriate, the conversion
factors.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor converts the numerical val-
ues and drops the unit of measurement
column.
Range check
Goal Check whether the values are within the
expected range.
Fitting None. User must declare the valid range
of values.
Adjustment Training set is adjusted sample by sam-
ple, independently. If appropriate, de-
generated samples are removed.
Applying Preprocessor checks whether the value
𝑥 of a variable is within the range
[𝑎, 𝑏]. If not, it replaces 𝑥 with: (a)
missing value ?, (b) the closest valid
value max(𝑎, min(𝑏, 𝑥)), or (c) degener-
ates (discards the observation).
Category standardize
Goal Create a dictionary and/or function to
map different names to a single one.
Fitting None. User must declare the mapping.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor replaces the categorical
variable 𝑥 with the mapped value 𝑓(𝑥)
that implements case standardization,
special character removal, and/or dictio-
nary fuzzy matching.
Note that these technique parameters are not fitted from the data,
but rather are fixed from the problem definition. As a result, they could
148 CHAPTER 7. DATA PREPROCESSING
be done in the data handling phase. The reason we put them here is that
the new data in production usually come with the same issues. Having
the fixes programmed into the preprocessor makes it easier to guarantee
that the model will behave as expected in production.
Outlier removal
Goal Remove the observations that are out-
liers.
Fitting Parameters of the outlier classifier.
Adjustment Training set is adjusted sample by sam-
ple, independently, removing degener-
ated samples.
Applying Preprocessor degenerates if the sample is
classified as an outlier and does nothing,
otherwise.
is not missing at random. Row removal suffers from the same problem
as any filtering operations (degeneration) in the preprocessing step; the
developer must specify a default behavior for the model when a row is
discarded in production. See table 7.6.
the mean, the median, or the mode4. This is a simple and effective strat-
egy, but it can introduce bias in the data, especially when the number
of samples with missing data is large. See table 7.8.
Just imputing data is not suitable when one is not sure whether the
missing data is missing because of a systematic error or phenomenon. A
model can learn the effect of the underlying reason for missingness for
the predictive task. In that case, creating an indicator variable is a good
strategy. This is done by creating a new column that contains a logical
value indicating whether the data is missing or not5.
Many times the indicator variable is already present in the data. For
instance, in a dataset that contains information about pregnancy, let us
say the number of days since the last pregnancy. This information will
certainly be missing if sex is male or the number of children is zero. In
this case, no new indicator variable is needed. See table 7.7.
4More sophisticated methods can be used, such as the k-nearest neighbors algorithm,
for example, consult O. Troyanskaya et al. (June 2001). “Missing value estimation methods
for DNA microarrays”. In: Bioinformatics 17.6, pp. 520–525. issn: 1367-4803. doi: 10.
1093/bioinformatics/17.6.520.
5Some kind of imputation is still needed, but we expect the model to deal better with
it since it can decide using both the indicator and the original variable.
152 CHAPTER 7. DATA PREPROCESSING
Random sampling
Goal Select a random subset of the training
data.
Fitting None. User must declare the size of the
sample.
Adjustment Rows of the training set are randomly
chosen.
Applying Pass-through: preprocessor does nothing
with the new data.
Scope filtering
Goal Remove the observations that do not sat-
isfy a predefined rule.
Fitting None. User must declare the rule.
Adjustment Training set is adjusted sample by sam-
ple, independently, removing degener-
ated samples.
Applying Preprocessor degenerates over the sam-
ples that do not satisfy the rule.
6F. Stulp and O. Sigaud (2015). “Many regression algorithms, one unified model: A
review”. In: Neural Networks 69, pp. 60–79. issn: 0893-6080. doi: https://doi.org/10.
1016/j.neunet.2015.05.005.
7Sometimes called bootstrapping.
154 CHAPTER 7. DATA PREPROCESSING
Class balancing
Goal Balance the number of observations in
each class.
Fitting None. User must declare the number of
samples in each class.
Adjustment Rows of the training set are randomly
chosen.
Applying Pass-through: preprocessor does nothing
with the new data.
Type conversion is the process of changing the type of the values in the
columns. We do so to make the input variables compatible with the
machine learning methods we will use.
The most common type conversion is the conversion from categori-
cal to numerical values. Ideally, the possible values of a categorical vari-
able are known beforehand. For instance, given the values 𝑥 ∈ {𝑎, 𝑏, 𝑐}
in a column, there are two main ways to convert them to numerical val-
ues: label encoding and one-hot encoding. If there is a natural order
𝑎 < 𝑏 < 𝑐, label encoding is usually sufficient. Otherwise, one-hot
encoding can be used.
Label encoding is the process of replacing the values 𝑥 ∈ {𝑎, 𝑏, 𝑐}
with the values 𝑥′ ∈ {1, 2, 3}, where 𝑥′ = 1 if 𝑥 = 𝑎, 𝑥′ = 2 if 𝑥 = 𝑏, and
𝑥′ = 3 if 𝑥 = 𝑐. Other numerical values can be assigned depending on
the specific problem.
One-hot encoding is the process of creating a new column for each
possible value of the categorical variable. The new column is filled with
the logical value 1 if the value is present and 0 otherwise.
However, in the second case, the number of categories might be
too large or might not be known beforehand. So, the preprocessing
step must identify the unique values in the column and create the new
columns accordingly. It is common to group the less frequent values
into a single column, called the other column. See table 7.12.
The other direction is also common: converting numerical values to
categorical values. This is usually done by binning the numerical vari-
able, either by frequency or by range. In both cases, the user declares the
number of bins. Binning by frequency is done by finding the percentiles
of the values and creating the bins accordingly. Binning by range is done
by dividing the range of the values into equal parts, given the minimum
and maximum values. See table 7.13.
Another common task, although it receives less attention, is the con-
version of dates (or other interval variables) to numerical values. Inter-
val variables, like dates, have almost no information in their absolute
values. However, the difference between two dates can be very informa-
tive. For example, the difference between the date of birth and the date
of the last purchase becomes the age of the customer.
156 CHAPTER 7. DATA PREPROCESSING
One-hot encoding
Goal Create a new column for each possible
value of the categorical variable.
Fitting Store the unique values of the categorical
variable. If appropriate, indicate the spe-
cial category other.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor creates a new column for
each possible value of the categorical
variable. The new column is filled
with the logical value 1 if the old value
matches the new column and 0 other-
wise. If the value is new or among the
less frequent values, it is assigned to the
other column.
7.4.2 Normalization
Normalization is the process of scaling the values in the columns. This
is usually done to keep data within a specific range or to make different
variables comparable. For instance, some machine learning methods
require the input variables to be in the range [0, 1].
The most common normalization methods are standardization and
rescaling. The former is done by subtracting the mean and dividing by
the standard deviation of the values in the column. The latter is per-
formed so that the values are in a specific range, usually [0, 1] or [−1, 1].
Standardization works well when the values in the column are nor-
mally distributed. It not only keeps the values in an expected range but
also makes the data distribution comparable with other variables. Given
that 𝜇 is the mean and 𝜎 is the standard deviation of the values in the
column, the standardization is done by
𝑥−𝜇
𝑥′ = . (7.1)
𝜎
See table 7.14.
Standardization
Goal Scale the values in a column.
Fitting Store the statistics of the variable: the
mean and the standard deviation.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor scales the values according
to eq. (7.1).
9The operation clamp(𝑥; 𝑎, 𝑏) where 𝑎 and 𝑏 are the lower and upper bounds, respec-
tively, is defined as max(𝑎, min(𝑏, 𝑥)).
158 CHAPTER 7. DATA PREPROCESSING
Rescaling
Goal Rescale the values in a column.
Fitting Store the appropriate statistics of the vari-
able: the minimum and the maximum
values.
Adjustment Training set is adjusted sample by sam-
ple, independently.
Applying Preprocessor scales the values according
to eq. (7.2).
Data enhancement
Goal Enhance the dataset with external infor-
mation.
Fitting Store the external dataset and the column
to join.
Adjustment Training set is left joined with the exter-
nal dataset. Because of the properties of
the left join, the new dataset has the same
number of rows as the original dataset,
and it is equivalent to enhancing each
row independently.
Applying Preprocessor enhances the new data with
the external information.
160 CHAPTER 7. DATA PREPROCESSING
10D. Jurafsky and J. H. Martin (2008). Speech and Language Processing. An Introduc-
tion to Natural Language Processing, Computational Linguistics, and Speech Recognition.
2nd ed. Hoboken, NJ, USA: Prentice Hall. A new edition is is under preparation and it is
available for free: D. Jurafsky and J. H. Martin (2024). Speech and Language Processing:
An Introduction to Natural Language Processing, Computational Linguistics, and Speech
Recognition with Language Models. 3rd ed. Online manuscript released August 20, 2024.
url: https://web.stanford.edu/~jurafsky/slp3/.
11R. Szeliski (2022). Computer vision. Algorithms and applications. 2nd ed. Springer
Nature. url: https://szeliski.org/Book/.
Solution validation
8
All models are wrong, but some are useful.
— George E. P. Box, Robustness in Statistics
161
162 CHAPTER 8. SOLUTION VALIDATION
Chapter remarks
Contents
8.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1.1 Binary classification evaluation . . . . . . . . . . 164
8.1.2 Regression estimation evaluation . . . . . . . . . 169
8.1.3 Probabilistic classification evaluation . . . . . . . 172
8.2 An experimental plan for data science . . . . . . . . . . . 175
8.2.1 Sampling strategy . . . . . . . . . . . . . . . . . . 176
8.2.2 Collecting evidence . . . . . . . . . . . . . . . . . 177
8.2.3 Estimating expected performance . . . . . . . . . 180
8.2.4 Comparing strategies . . . . . . . . . . . . . . . . 182
8.2.5 About nesting experiments . . . . . . . . . . . . . 183
8.3 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . 184
Context
Objectives
Takeaways
8.1 Evaluation
One fundamental step in the validation of a data-driven solution for a
task is the evaluation of the pair preprocessor and model. This chapter
presents strategies to measure performance of classifiers and regressors,
and how to interpret the results.
We consider the following setup. Let 𝑇 = (𝐾, 𝐻, 𝑐) be a table that
represents the data in the desired observational unit — as defined in
section 5.1. Without loss of generality — as the keys are not used in the
modeling process —, we can consider 𝐾 = {1, 2, … } such that card(𝑖) =
1, if 𝑖 ∈ {1, … , 𝑛}, and card(𝑖) = 0, otherwise. That means that every
row 𝑟 ∈ {1, … , 𝑛} is present in the table.
The table is split into two sets: a training set, given by indices (or
keys) ℐtraining ∈ {1, … , 𝑛}, and a test set, given by indices ℐtest ∈ {1, … , 𝑛},
such that
ℐtraining ∩ ℐtest = ∅
and
ℐtraining ∪ ℐtest = {1, … , 𝑛}.
The bridge between the table format (definition 5.1) and the data
format used in the learning process (as described in section 6.2) is ex-
plained in the following. We say that the pair (x𝑖 , 𝑦 𝑖 ) contains the fea-
ture vector x𝑖 and the target value 𝑦 𝑖 of the sample with key 𝑖 in table 𝑇.
Mathematically, given target variable ℎ ∈ 𝐻, we have that 𝑦 𝑖 = 𝑐(𝑖, ℎ)
and x𝑖 is the tuple
(𝑐(𝑖, ℎ′ ) ∶ ℎ′ ∈ 𝐻 ∖ {ℎ} ).
For evaluation, we consider a data preprocessing technique 𝐹 and a
learning machine 𝑀. The following steps are taken.
𝑐(𝑖, ℎ) if 𝑖 ∈ ℐtraining ,
𝑐training (𝑖, ℎ) = {
() otherwise.
′
The result is an adjusted training set 𝑇training and a fitted preprocessor
𝑓(x; 𝜙) ≡ 𝑓𝜙 (x), where x ∈ 𝒳 for some space 𝒳 that does not include
(or does not modify) the target variable — consult section 7.1.1. Note
that, by definition, the size of the adjusted training set can be different
from the original due to sampling or filtering. The hard requirement is
that the target variable ℎ is not changed.
164 CHAPTER 8. SOLUTION VALIDATION
𝑐(𝑖, ℎ) if 𝑖 ∈ ℐtest ,
𝑐test (𝑖, ℎ) = {
() otherwise.
′
The result is a preprocessed test set 𝑇test from which we can obtain the
set 𝐷test = {(x𝑖 , 𝑦 𝑖 ) ∶ 𝑖 ∈ ℐtest } such that x′𝑖 = 𝑓𝜙 (x𝑖 ). Note that, to avoid
′ ′
data leakage and other issues, the preprocessor has no access to the tar-
get values 𝑦 𝑖 (even if the adjusted training set uses the label somehow).
Confusion matrix
The confusion matrix is a table where the rows represent the true classes
and the columns represent the predicted classes. The diagonal of the
matrix represents the correct classifications, while the off-diagonal ele-
ments represent errors. For binary classification, the confusion matrix
is given by
Predicted
1 0
Expected 1 TP FN
( )
0 FP TN
8.1. EVALUATION 165
Performance metrics
From the confusion matrix, we can derive several performance metrics.
Each of them focuses on different aspects of the classification task, and
the choice of the metric depends on the problem at hand. Each metric
prioritizes different types of errors and yields a value between 0 and 1,
where 1 is the best possible value.
This metric is useful when the cost of missing a positive sample is high,
as it quantifies the ability of the classifier to avoid false negatives. It can
also be interpreted as the “completeness” of the classifier: how many
positive samples were correctly retrieved. For example, in a medical
diagnosis task, recall is important to avoid missing a diagnosis.
(1 + 𝛽 2 ) ⋅ Precision ⋅ Recall
F-score(𝛽) = F𝛽 -score = ,
𝛽2 ⋅ Precision + Recall
TN
Specificity = TNR = .
TN + FP
This metric is very common in the medical literature, but less common
in other contexts. The probable reason is that it is easier to interpret the
metrics that focus on the positive class, as the negative class is usually
the majority class — and, thus, less interesting.
Interpretation of metrics
Table 8.1 summarizes the properties of the classification performance
metrics. Accuracy and balanced accuracy are good metrics when no
particular class is more important than the other. Remember, however,
that balanced accuracy gives more weight to errors on the minority class.
Precision and recall are useful to evaluate the performance of the solu-
tion in terms of the positive class. They are complementary metrics, and
looking at only one of them may give a biased view of the performance
— more on that below. The F-score is a way to balance precision and
recall with a controllable parameter.
168 CHAPTER 8. SOLUTION VALIDATION
rics like accuracy, precision, and F1 -score. Even for random guessing
the class, precision (and F1 -score) is affected by the class imbalance,
yielding 1 (and 2/3) as 𝜋 → 1. As a result, these metrics should be
preferred when the positive class is the minority class, so the results are
not erroneously inflated — and, consequently, mistakenly interpreted
as good. C. K. I. Williams (2021)1 provides an interesting discussion on
that.
Finally, besides accuracy, the other metrics do not behave well when
the evaluation set is too small. In this case, the metrics may be too sen-
sitive to the particular samples in the test set or may not be able to be
calculated at all.
𝜖 𝑖 = 𝑦 ̂𝑖 − 𝑦 𝑖
Performance metrics
From the errors, we can calculate several performance metrics that give
us useful information about the behavior of the model. Specifically, we
are interested in understanding what kind of errors the model is making
and how large they are. Unlike classification, the higher the value of the
metric, the worse the model is.
This metric is easy to interpret, is in the same unit as the target vari-
able, and gives an idea of the average error of the model. It ignores the
direction of the errors, so it is not useful to understand if the model is
systematically overestimating or underestimating the target variable.
𝑛
1
MSE = ∑ 𝜖2 .
𝑛 𝑖=1 𝑖
This metric penalizes large errors more than the mean absolute error, as
the squared residuals are summed.
Root mean squared error is the square root of the mean squared
error, given by
RMSE = √MSE.
This metric is in the same unit as the target variable, which makes it
easier to interpret. It keeps the same properties as the mean squared
error, such as penalizing large errors more than the mean absolute error.
Both MAE and RMSE (or MSE) work well for positive and negative
values of the target variable. However, they might be misleading when
the range of the target variable is large.
𝑛
1 |𝜖 |
MAPE = ∑ 𝑖 .
𝑛 𝑖=1 𝑦 𝑖
This metric is useful when the range of the target variable is large, as it
gives an idea of the relative error of the model, not the absolute error.
few large values. Distributions like that are common in practice, e.g., in
sales, income, and population data. It is given by
𝑛 𝑛
1 (ln) 1
MALE = ∑ |𝜖 | = ∑ | ln 𝑦 ̂𝑖 − ln 𝑦 𝑖 |.
𝑛 𝑖=1 𝑖 𝑛 𝑖=1
Interpretation of metrics
Note that, unlike the classification performance metrics, the scale of the
regression performance metrics is not bounded between 0 and 1. This
makes it potentially harder to interpret the results, as the values depend
on the scale of the target variable.
Absolute error metrics, like MAE and RMSE, are useful for under-
standing the central tendency of the magnitude of the errors. They are
easy to interpret because they are in the same unit as the target variable.
However, they tend to be less informative when the target variable has
a large range or when the errors are not normally distributed.
In those situations, relative error metrics, like MAPE and MALE, are
more useful. For instance, imagine we are predicting house prices. The
error of $20,000 for a house that costs $100,000 is more significant than
the same error for a house that costs $1,000,000. The absolute error is
the same in both cases, but the relative error is different.
In that example, the MAPE would be 20% for the first house and 2%
for the second house. Note, however, that MAPE punishes overestimat-
ing more than underestimating in multiplicative terms. Consider the
example in table 8.3. In the first row, the prediction is ten times larger
than the actual value, which results in a MAPE of 900%. In the second
row, the prediction is one tenth of the actual value, which results in a
MAPE of 90%.
𝑦̂ 𝑦 𝜖 MAPE exp(MALE)
100 10 90 9.0 10
1 10 9 0.9 10
MAPE and MALE for two predictions. The MAPE punishes over-
estimating more than underestimating.
172 CHAPTER 8. SOLUTION VALIDATION
𝑦̂ 𝑦 𝑦̂ 𝑦
| ln 𝑦 ̂ − ln 𝑦| = | ln | = | ln | = ln max ( , ) .
𝑦 𝑦̂ 𝑦 𝑦̂
100 10 100 10
exp ln max ( , ) = max ( , ) = 10.
10 100 10 100
2C. Tofallis (2015). “A better measure of relative prediction accuracy for model selec-
tion and model estimation”. In: Journal of the Operational Research Society 66.8, pp. 1352–
1362. doi: 10.1057/jors.2014.103.
3Although the term probability is used, the output of the regressor does not need to
be a probability in the strict sense. It is a confidence level in the interval [0, 1] that can be
interpreted as a probability.
8.1. EVALUATION 173
𝜏:
1 if 𝑓𝑅 (x) ≥ 𝜏,
𝑓𝐶 (x; 𝜏) = {
0 otherwise.
Since the task is still a classification task, one should not use regres-
sion performance metrics. On the other hand, instead of choosing a
particular threshold and measuring the resulting classifier performance,
we can summarize the performance of all possible variations of the clas-
sifiers using appropriate metrics.
Before diving into the metrics, consider the following error metric.
Let false positive rate (FPR) be the proportion of false positive predic-
tions over the total number of samples that are actually negative,
FP
FPR = .
FP + TN
It is the complement of the specificity, i.e. FPR = 1 − Specificity.
Consider the example in table 8.4 of a given test set and the predic-
tions of a regressor. We can see that a threshold of 0.5 would yield a
classifier that errors in 3 out of 9 samples. We can adjust the threshold
to understand the behavior of the other possible classifiers.
Expected Predicted
0 0.1
0 0.5
0 0.2
0 0.6
1 0.4
1 0.9
1 0.7
1 0.8
1 0.9
We first sort the samples by the predicted probabilities and then cal-
culate the TPR (recall) and FPR for each threshold. We need to con-
sider only thresholds equal to the predicted values to understand the
variations. In this case, TPR values become the cumulative sum of the
expected outputs divided by the total number of positive samples, and
174 CHAPTER 8. SOLUTION VALIDATION
FPR values become the cumulative sum of the complement of the ex-
pected outputs divided by the total number of negative samples.
Note that, from the ordered list of predictions, we can easily see that
a threshold of 0.7 would yield a classifier that commits only one error. A
way to summarize the performance of all possible classifiers is presented
in the following.
0.8
0.6
TPR
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
FPR
ROC curve for the example in table 8.5. The diagonal line repre-
sents a random classifier, and points above the diagonal are better
than random.
means that it measures how well predictions are ranked, rather than
their absolute values. It is also robust to class imbalance, once both re-
call and specificity are considered. In our example, the AUC is 0.9.
One could argue that we could use a hold-out set to estimate the per-
formance of 𝑀 — i.e., splitting the dataset into a training set and a test
set once. However, this does not solve the problem. The performance
𝑝 we observe in the test set might be an overestimation or an underes-
timation of the performance of 𝑀 in production. This is because the
randomly chosen test set might be an “outlier” in the representation of
the real world, containing cases that are too easy or too hard to predict.
The correct way to estimate the performance of 𝑀 is to address per-
formance as a random variable, since both the data and the learning
process are stochastic. By doing so, we can study the distribution of the
performance, not particular values.
As with any statistical evaluation, we need to generate samples of
the performance of the possible solutions that 𝐴 is able to obtain. To
do so, we use a sampling strategy to generate datasets 𝐷1 , 𝐷2 , … from 𝐷.
Each dataset is further divided into a training set and a test set, which
must be disjoint. Each training set is thus used to find a solution —
𝑀1 , 𝑀2 , … for each training set — and the test set is used to evaluate the
performance — 𝑝1 , 𝑝2 , … for each test set — of the solution. The test
set emulates the real-world scenario, where the model is used to make
predictions on new data.
The most common sampling strategy is the cross-validation. It as-
sumes that data are independent and identically distributed (i.i.d.). This
sampling strategy divides the dataset into 𝑟 folds randomly, with the
same size. Each part (fold) is used as a test set once and as a training set
𝑟 − 1 times. So, first we use as training set folds 2, 3, … , 𝑟 and as test set
fold 1. Then, we use as training set folds 1, 3, … , 𝑟 and as test set fold 2.
And so on. See fig. 8.2.
If possible, one should use repeated cross-validation, where this pro-
cess is repeated many times, each having a different fold partitioning
chosen at random. Also, when dealing with classification problems,
we should use stratified cross-validation, where the distribution of the
classes is preserved in each fold.
𝑝 𝑘 = 𝑅([𝑦 𝑖 ∶ 𝑖] , [𝑦 ̂𝑖 ∶ 𝑖]) .
Data
Sampling
strategy
Data
Machine 𝜙
Training handling [ ]
learning 𝜃
pipeline
Test (target) 𝑝
𝑝1
Hypothesis
[ 𝑝2 ]
⋮ test
The results are not the “real” performance of the solution 𝑀 in the
real world, as that would require new data to be collected. However, we
can safely interpret the performance samples as being sampled from the
same distribution as the real-world performance of the solution 𝑀.
4A. Benavoli et al. (2017). “Time for a Change: a Tutorial for Comparing Multiple
Classifiers Through Bayesian Analysis”. In: Journal of Machine Learning Research 18.77,
pp. 1–36. url: http://jmlr.org/papers/v18/16-305.html.
8.2. AN EXPERIMENTAL PLAN FOR DATA SCIENCE 181
and
𝑛−1
1
𝑠2 = ∑ (𝑧 − 𝑧)̄ 2 .
𝑛 − 1 𝑖=1 𝑖
6A. Benavoli et al. (2017). “Time for a Change: a Tutorial for Comparing Multiple
Classifiers Through Bayesian Analysis”. In: Journal of Machine Learning Research 18.77,
pp. 1–36. url: http://jmlr.org/papers/v18/16-305.html.
7That includes both data preprocessing and machine learning.
8.2. AN EXPERIMENTAL PLAN FOR DATA SCIENCE 183
p(𝜆) are the performance vectors of the baseline and the candidate al-
gorithms, respectively, that are calculated using the same strategy de-
scribed in section 8.2.3. It is important to note that the same samplings
must be used to compare the algorithms — i.e., performance samples
must be paired, each one of them coming from the same sampling, and
consequently, from the same training and test datasets.
We can validate whether the candidate is better than the baseline by
P(𝜇 < 0 ∣ z)
is too high or not. Always ask yourself if the risk of performance loss is
worth it in the real-world scenario.
185
186 APPENDIX A. MATHEMATICAL FOUNDATIONS
Chapter remarks
Contents
A.1 Algorithms and data structures . . . . . . . . . . . . . . . 187
A.1.1 Computational complexity . . . . . . . . . . . . . 187
A.1.2 Algorithmic paradigms . . . . . . . . . . . . . . . 188
A.1.3 Data structures . . . . . . . . . . . . . . . . . . . 192
A.2 Set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.2.1 Set operations . . . . . . . . . . . . . . . . . . . . 196
A.2.2 Set operations properties . . . . . . . . . . . . . . 196
A.2.3 Relation to Boolean algebra . . . . . . . . . . . . . 197
A.3 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . 197
A.3.1 Operations . . . . . . . . . . . . . . . . . . . . . . 198
A.3.2 Systems of linear equations . . . . . . . . . . . . . 200
A.3.3 Eigenvalues and eigenvectors . . . . . . . . . . . 200
A.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.4.1 Axioms of probability and main concepts . . . . . 201
A.4.2 Random variables . . . . . . . . . . . . . . . . . . 202
A.4.3 Expectation and moments . . . . . . . . . . . . . 203
A.4.4 Common probability distributions . . . . . . . . . 205
A.4.5 Permutations and combinations . . . . . . . . . . 208
B.1 Multi-layer perceptron . . . . . . . . . . . . . . . . . . . . 210
B.2 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . 212
Context
Objectives
Big-O notation Let 𝑓 and 𝑔 be functions from the set of natural num-
bers to the set of real numbers, i.e., 𝑓, 𝑔 ∶ ℕ → ℝ. We say that 𝑓 is 𝑂(𝑔) if
there exists a constant 𝑐 > 0 such that 𝑓(𝑛) ≤ 𝑐𝑔(𝑛) for all 𝑛 ≥ 𝑛0 , where
𝑛0 is a natural number. We can order functions by their asymptotic com-
plexity. For example, 𝑂(1) < 𝑂(log 𝑛) < 𝑂(𝑛) < 𝑂(𝑛 log 𝑛) < 𝑂(𝑛2 ) <
𝑂(2𝑛 ) < 𝑂(𝑛!). Throughout this book, we consider log 𝑛 = log2 𝑛, i.e.,
whenever the base of the logarithm is not specified, it is assumed to be
2.
The asymptotic analysis of algorithms is usually done in the worst-
case scenario, i.e. the maximum amount of resources the algorithm uses
for any input of size 𝑛. Thus, it gives us an upper bound on the com-
plexity of the algorithm. In other words, an algorithm with complexity
𝑂(𝑔(𝑛)) is guaranteed to run in at most 𝑐𝑔(𝑛) time for some constant 𝑐.
It does not mean, for instance, that an algorithm with time complex-
ity 𝑂(𝑛) will always run faster than an algorithm with time complexity
𝑂(𝑛2 ), but that the former will run faster for a large enough input size.
1T. H. Cormen et al. (2022). Introduction to Algorithms. 4th ed. The MIT Press, p. 1312.
isbn: 978-0262046305.
2J. V. Guttag (2021). Introduction to Computation and Programming Using Python.
With Application to Computational Modeling and Understanding Data. 3rd ed. The MIT
Press, p. 664. isbn: 978-0262542364.
188 APPENDIX A. MATHEMATICAL FOUNDATIONS
i.e. if an algorithm has two sequential steps with time complexity 𝑂(𝑓)
and 𝑂(𝑔), the highest complexity is the one that determines the overall
complexity.
1 function bsearch([𝑎1 , 𝑎2 , … , 𝑎𝑛 ] , 𝑥) is
2 if 𝑛 = 0 then
3 return false
𝑛
4 𝑚 ← ⌊ ⌋;
2
5 if 𝑥 = 𝑎𝑚 then
6 return true
7 if 𝑥 < 𝑎𝑚 then
8 return bsearch([𝑎1 , … , 𝑎𝑚−1 ] , 𝑥)
9 else
10 return bsearch([𝑎𝑚+1 , … , 𝑎𝑛 ] , 𝑥)
Note that this strategy leads to such a low time complexity that we
can solve large instances of the problem in a reasonable amount of time.
Consider the case of an array with 264 = 18,446,744,073,709,551,616
elements, the algorithm will find the key in at most 65 steps.
𝑛
subject to ∑ 𝑤 𝑖 𝑥𝑖 ≤ 𝑊 ,
𝑖=1
niques to reduce the search space and make the algorithm more effi-
cient. For example, the backtracking algorithm for the Sudoku problem
is combined with constraint propagation to reduce the number of possi-
ble solutions.
A Sudoku puzzle consists of an 𝑛 × 𝑛 grid, divided into 𝑛 subgrids
of size √𝑛 × √𝑛. The goal is to fill the grid with numbers from 1 to 𝑛
such that each row, each column, and each subgrid contains all numbers
from 1 to 𝑛 but no repetitions. The most common grid size is 9 × 9.
An illustration of backtracking to solve a 4 × 4 Sudoku puzzle4 is
shown in fig. A.1. The puzzle is solved by trying all possible numbers
in each cell and backtracking when a number does not fit. The solu-
tion is found when all cells are filled and the constraints are satisfied.
Arrows indicate the steps of the backtracking algorithm. Every time a
constraint is violated — indicated in gray —, the algorithm backtracks
to the previous cell and tries a different number.
One can easily see that a puzzle with 𝑚 missing cells has 𝑛𝑚 possible
solutions. For small values of 𝑚 and 𝑛, the algorithm is practical, but for
large values, it becomes too costly.
4Smaller puzzles are more didactic, but the same principles apply to larger puzzles.
5T. H. Cormen et al. (2022). Introduction to Algorithms. 4th ed. The MIT Press, p. 1312.
isbn: 978-0262046305.
A.1. ALGORITHMS AND DATA STRUCTURES 193
? 1
1 2 4
4 1 2
2
1 1 2 1 4 ? 1
1 2 4 1 2 4 1 2 4
4 1 2 4 1 2 4 1 2
2 2 2
3 ? 1
1 2 4 …
4 1 2
2
3 1 1 3 2 1 3 3 1 3 4 1
1 2 4 1 2 4 1 2 4 1 2 4
4 1 2 4 1 2 4 1 2 4 1 2
2 2 2 2
are allowed: push (add an element to the top of the stack) and pop (re-
move the top element). Only the top element is accessible.
∅ if it is empty, or
𝑇={
(𝑣, 𝑇𝑙 , 𝑇𝑟 ) if it has a value 𝑣 and two children 𝑇𝑙 and 𝑇𝑟 .
Note that the left and right children are themselves binary trees. If 𝑇 is
a leaf, then 𝑇𝑙 = 𝑇𝑟 = ∅.
These properties make it easy to represent a binary tree using paren-
theses notation. For example, (1, (2, ∅, ∅), (3, ∅, ∅)) is a binary tree
with root 1, left child 2, and right child 3.
4 3
1 2
A graph with four vertices and five edges. Vertices are numbered
from 1 to 4, and edges are represented by arrows. The graph is
directed, as the edges have a direction.
Universe set The universe set is the set of all elements in a given con-
text. It is denoted by Ω.
Empty set The empty set is the set with no elements. It is denoted by
the symbol ∅. Depending on the context, it can also be denoted by {}.
196 APPENDIX A. MATHEMATICAL FOUNDATIONS
Union The union of two sets 𝐴 and 𝐵 is the set of elements that are
in 𝐴 or 𝐵. It is denoted by 𝐴 ∪ 𝐵. For example, the union of {1, 2, 3} and
{3, 4, 5} is {1, 2, 3, 4, 5}.
• Commutativity: 𝐴 ∪ 𝐵 = 𝐵 ∪ 𝐴 and 𝐴 ∩ 𝐵 = 𝐵 ∩ 𝐴;
• Distributivity: 𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) and 𝐴 ∩ (𝐵 ∪ 𝐶) =
(𝐴 ∩ 𝐵) ∪ (𝐴 ∩ 𝐶).
(𝐴 ∩ 𝐵)𝑐 = 𝐴𝑐 ∪ 𝐵 𝑐 .
A.3.1 Operations
The main operations in linear algebra are presented below.
Dot product The dot product of two vectors v and w is the scalar
𝑛
v ⋅ w = ∑ 𝑣𝑖 𝑤𝑖 .
𝑖=1
7G. Strang (2023). Introduction to Linear Algebra. 6th ed. Wellesley-Cambridge Press,
p. 440. isbn: 978-1733146678.
A.3. LINEAR ALGEBRA 199
|𝑎 𝑏 |
| | = 𝑎𝑑 − 𝑏𝑐.
| 𝑐 𝑑|
𝐴𝐴−1 = 𝐴−1 𝐴 = 𝐼𝑛 ,
where adj(𝐴) is the adjugate (or adjoint) of 𝐴, i.e., the transpose of the
cofactor matrix of 𝐴.
The cofactor of the 𝑖, 𝑗-th entry of a matrix 𝐴 is the determinant of
the matrix obtained by removing the 𝑖-th row and the 𝑗-th column of 𝐴,
multiplied by (−1)𝑖+𝑗 .
In the case of a 2 × 2 matrix, the inverse is
−1
𝑎 𝑏 1 𝑑 −𝑏
( ) = ( ).
𝑐 𝑑 𝑎𝑑 − 𝑏𝑐 −𝑐 𝑎
𝐴v = 𝜆v. (A.1)
A.4 Probability
Probability is the branch of mathematics that studies the likelihood of
events. It is used to model uncertainty and randomness. The basic ob-
jects of probability are events and random variables.
For a comprehensive material about probability theory, the reader is
referred to Ross (2018)8 and Ross (2023)9.
8S. M. Ross (2018). A First Course in Probability. 10th ed. Pearson, p. 528. isbn: 978-
1292269207.
9S. M. Ross (2023). Introduction to Probability Models. 13th ed. Academic Press, p. 870.
isbn: 978-0443187612.
A.4. PROBABILITY 201
Bayes’ rule Bayes’ rule is a formula that relates the conditional prob-
ability of an event 𝐴 given an event 𝐵 to the conditional probability of 𝐵
given 𝐴. It is
P(𝐵 ∣ 𝐴) ⋅ P(𝐴)
P(𝐴 ∣ 𝐵) = . (A.2)
P(𝐵)
Bayes’ rule is one of the most important formulas in probability theory
and is used in many areas of science and engineering. Particularly, for
data science, it is used in Bayesian statistics and machine learning.
E[𝑋] = ∑ 𝑥 ⋅ 𝑝𝑋 (𝑥),
𝑥
if 𝑋 is discrete, or
∞
E[𝑋] = ∫ 𝑥 ⋅ 𝑓𝑋 (𝑥)𝑑𝑥,
−∞
if 𝑋 is continuous.
The main properties of expectation are listed below.
The expectation operator is linear. Given two random variables 𝑋
and 𝑌 and a real number 𝑐, we have
E[𝑐𝑋] = 𝑐 E[𝑋],
E[𝑋 + 𝑐] = E[𝑋] + 𝑐,
and
E[𝑋 + 𝑌 ] = E[𝑋] + E[𝑌 ].
Under a more general setting, given a function 𝑔 ∶ ℝ → ℝ, the
expectation of 𝑔(𝑋) is
if 𝑋 is discrete, or
∞
E[𝑔(𝑋)] = ∫ 𝑔(𝑥) ⋅ 𝑓𝑋 (𝑥)𝑑𝑥,
−∞
if 𝑋 is continuous.
since
2
Var(𝑋) = E[(𝑋 − E[𝑋]) ]
= E[𝑋 2 − 2𝑋 E[𝑋] + E[𝑋]2 ]
= E[𝑋 2 ] − 2 E[𝑋] E[𝑋] + E[𝑋]2
= E[𝑋 2 ] − E[𝑋]2 .
E[𝑋 𝑘 ] = ∑ 𝑥𝑘 ⋅ 𝑝𝑋 (𝑥),
𝑥
if 𝑋 is discrete, or
∞
E[𝑋 𝑘 ] = ∫ 𝑥𝑘 ⋅ 𝑓𝑋 (𝑥)𝑑𝑥,
−∞
if 𝑋 is continuous.
Law of large numbers The law of large numbers states that the aver-
age of a large number of independent and identically distributed (i.i.d.)
random variables converges to the expectation of the random variable.
Mathematically,
𝑛
1
lim ∑ 𝑋𝑖 = E[𝑋],
𝑛→∞ 𝑛
𝑖=1
𝑒−𝜆 𝜆𝑥
𝑝𝑋 (𝑥) = . (A.8)
𝑥!
The expected value of 𝑋 ∼ Poisson(𝜆) is E[𝑋] = 𝜆, and the variance
is Var(𝑋) = 𝜆.
1 (𝑥 − 𝜇)2
𝑓𝑋 (𝑥) = exp (− ). (A.9)
√2𝜋𝜎2 2𝜎2
Central limit theorem The central limit theorem states that the nor-
malized version of the sample mean converges to a standard normal
distribution11. Given 𝑋1 , … , 𝑋𝑛 i.i.d. random variables with mean 𝜇 and
finite variance 𝜎2 < ∞,
√𝑛(𝑋̄ − 𝜇) ∼ 𝒩(0, 𝜎2 ),
11This statement of the central limit theorem is known as the Lindeberg-Levy CLT.
There are other versions of the central limit theorem, some more general and some more
restrictive.
A.4. PROBABILITY 207
𝛽𝛼 𝑥𝛼−1 𝑒−𝛽𝑥
𝑓𝑋 (𝑥) = , (A.10)
Γ(𝛼)
𝑛! = 𝑛 ⋅ (𝑛 − 1) ⋅ … ⋅ 2 ⋅ 1.
By definition, 0! = 1.
This appendix is under construction. Topics like the kernel trick, back-
propagation, and other machine learning algorithms will be discussed
here.
209
210 APPENDIX B. TOPICS ON LEARNING MACHINES
1 if 𝑥 > 0
𝜎(𝑥) = {
0 otherwise.
A model with two neurons in the hidden layer (effectively the combina-
tion of three perceptrons) is
The parameters w(1) and w(2) represent the hyperplanes that sepa-
rate the classes in the hidden layer, and w(3) represents how the hyper-
planes are combined to generate the output. If we set weights w(1) =
[−0.5, 1, −1] (like the perceptron in the previous example) and w(2) =
[−0.5, −1, 1], we use the third neuron to combine the results of the first
two neurons. This way, a possible solution for the XOR problem is set-
ting w(3) = [0, 1, 1].
Figure B.1 and table B.1 show the class boundaries and the predic-
tions of the MLP for the XOR problem.
Note that there are many possible solutions for the XOR problem us-
ing the MLP. Learning strategies like back-propagation are used to find
the optimal parameters for the model and regularization techniques,
like 𝑙1 and 𝑙2 regularization, are used to prevent overfitting.
Deep learning is the study of neural networks with many layers. The
idea is to use many layers to learn not only the boundaries that separate
the classes (or the function that maps inputs and outputs) but also the
features that are relevant to the problem. A complete discussion of deep
learning can be found in Goodfellow, Bengio, and Courville (2016)1.
1I. Goodfellow, Y. Bengio, and A. Courville (2016). Deep Learning. http : / / www .
deeplearningbook.org. MIT Press.
B.1. MULTI-LAYER PERCEPTRON 211
1
𝑥2
0 1
𝑥1
MLP with two neurons in the hidden layer generates two linear
hyperplanes that separate the classes, effectively solving the XOR
problem.
Predictions of the MLP for the XOR problem. The output of the 1st
and 2nd neurons are hyperplanes that separate the classes in the
hidden layer, which are combined by the 3rd neuron to generate
the correct output.
212 APPENDIX B. TOPICS ON LEARNING MACHINES
𝑦̂ = 0 𝑦̂ = 1
1
𝑥2
0 1
𝑥1
Decision trees assume that the classes can be separated with hy-
perplanes orthogonal to the axes.
Bibliography
215
216 BIBLIOGRAPHY
BI business intelligence 10
data leakage A situation where information from the test set is used
to transform the training set in any way or to train the model. 50,
51, 81, 87, 144, 164, 178
221
222 GLOSSARY